Yesterday Campfire was up and down all morning, US time. From 8am CST to a little after noon, Campfire was throwing error messages or the service simply wouldn’t work.
Like many of our customers, our entire company runs on Campfire. Needless to say, it was not a good morning. Having your main line of active communication go dark like that does not make for a happy Tuesday.
We’re deeply sorry for such a prolonged period of instability. When Campfire can’t give you a consistent dial tone for four hours, something is not right.
If you care about the technical details, I’ll elaborate below. If you don’t, know that we’ve set everything in motion to fix all the problems we discovered.
What went wrong in technical terms
The whole debacle started when eight of our physical machines running dozens of KVM virtual machines all started having trouble at the same time. The virtual machines had severely degraded network and CPU performance causing them to grind to a halt. The investigation to find out what was wrong took too long, but we eventually decided to do rolling restarts of the machines.
The restart took down a number of services that Campfire is heavily dependent on, namely memcached and redis. While we’ve done a lot of work on memcached and making sure that the service can keep trucking when something happens to it, we haven’t done that for redis.
The latter is a more recent introduction to our infrastructure and we’ve been relying on it for a critical part of the Campfire workflow without doing our due diligence on detailing failure characteristics. Shame on us.
So when redis went down, we had Campfire services running that would hang waiting on redis. That alone is bad enough, but what made matters worse was that it was actually hanging inside of a MySQL transaction! This meant that MySQL got all gobbed up with open transactions that placed locks on key tables that in turn made everything grind to a halt. Bah!
Ultimately the solution was to track down and restart every Campfire service that was hanging waiting for redis and locking up MySQL. But it took us far too long to come to this realization, in part because some of the lock-ups happened slowly instead of all at once. So we were chasing bad theories for a while instead of just nuking everything from orbit and starting over.
What we’re doing to prevent this
We’ve already started a number of projects to prevent this from happening again. First, we’re moving our critical redis servers off KVM and on to physical, dedicated hardware to make it all both faster and more reliable.
Second, we’re setting up firedrills to deal with the redis lockups. The client should behave much better with regards to timeouts, similar to how well memcached handles failure.
Third, we’re going to get all redis interaction out of MySQL transaction blocks, so that problems with redis doesn’t also cause problems with MySQL.
Fourth, we’re setting up a tool chain of tasks to do some of these emergency tasks for failing over, restarting bulk services, and dealing with other issues as a one-button solution that’s completely automated (instead of scrambling to do it by hand when stuff is down).
Again, we’re really sorry to have thrown a wrench in the Tuesday morning Campfire sessions. We ran around like headless chickens for a minute or two until we got an AIM group chat setup as a feeble backup for the rescue team. Campfire is our lifeline to each other. It just can’t be down for this long.
Thanks for your understanding, loyalty, and your business.




