Hey folks,
Yes, I know there were server problems last night, resulting in multiple copies of the issue being sent. The problem started with the initial send around 5:15 PM, when the server started rejecting connections. For unknown reasons, the load average was incredibly high (80) such that it couldn’t do anything, and when it recovered 5 or 10 minutes later, only 3,600 copies of the 23,450 issues had been sent, and Sendy wasn’t auto-retrying.
When our developer was able to take a look around 9 PM, he couldn’t see any reason for the problem, so we duplicated the issue, added a short note apologizing for resending it to the 3,600 people who had already received it (there was no way to determine who had and had not received it), and resent.
I watched it send, and although I was a bit perturbed about the high load average while it was sending (11), it seemed to chug through all the issues. Because I’ve never had to watch it before, I didn’t know what to expect at the end. Sendy reported it was done with 23,427 sends, but the load average remained high, indicating that the server was still doing something.
What I didn’t realize was that it was still sending copies of the issue. Lauri and I have gotten lots of reports of people receiving 4-8 copies. All of my duplicates came in before 10:43 PM Eastern last night, but from our Amazon SES stats, 41,000 copies were sent today, presumably in the early hours. Unfortunately, I’ve been sick, so I went to bed shortly after the sending completed, and didn’t see Lauri’s Slack message about the duplicates.
It seems to have stopped now, and the load average on the server is down around 1 again. I don’t yet have any idea what could have happened, but our developer knows about it and is looking into it.
Rest assured, I’ll be keeping an eye on sends, but this is Unix juju on a setup that has worked flawlessly for years, so I’m really not sure what happened or how to tell if it will happen again.
I’m going to delete all the posts and topics reporting this since they’ll just clutter things up and don’t provide any useful information now that we know the problem happened.