Slack engineers have revealed what went wrong on May 12, when Slack went down for several hours amid massduring the .
Slack has now published a detailed technical review of the outage as well as an account of its incident response team, which was caught off guard by the outage.
Both posts aim to offer transparency about how Slack engineers, who also depend on Slack to communicate, handle recovery when things go haywire.
SEE: COVID-19: A guide and checklist for restarting your business (TechRepublic Premium)
Early on, Slack was pitched as an email-killer, and Microsoft answered Slack with Teams. But during the coronavirus pandemic, a bigger rivalry between Microsoft Teams and Zoom emerged as employees were forced to conduct video meetings from home.
Slack too claimed to have seen use spike during the pandemic and its CEO, Stewart Butterfield, proclaimed that Teams was not even a competitor to Slack these days.
So what happened at Slack when Slack went down at 16:45 Pacific Time on May 12?
Like everyone else during lockdown, Slack's incident response team turned to a Zoom video meeting, which was not organized on Slack, but via a company-wide email – the medium that Slack was supposed to kill in the workplace.
"In such unfortunate situations where we aren't able to rely on Slack, we have prescribed alternative methods of communication," says Ryan Katkov, a senior engineering manager at Slack.
"Following incident runbooks, we quickly moved the incident to Zoom and elevated the response to a Sev-1, our highest severity level."
Zoom appears to have adeptly handled Slack's crisis communications needs when Slack was down.
"Engineers from different teams joined on the Zoom call, scribes were recording the conversation on Zoom for later dissemination, opinions and solutions were offered. Graphs and signals were shared," writes Katkov.
"It's a flurry of action and adrenaline that the Incident Commander must guide and keep the right people focused so that the most appropriate action can be taken, balancing risk vs restoring service. Multiple workstreams of investigation were established and assigned with the goal of bringing remediation in the most efficient manner."
But, as Laura Nolan, a Slack site reliability engineer explains, Slack's systems began tripping up at 8:30am when alarms were raised over its Vitess database tier for serving user data, which faced a spike in requests after Slack made a configuration change that triggered a "long-standing performance bug".
Her account of the outage – 'A terrible, horrible, no-good, very bad day at Slack' – reveals that the extra load on Slack's database was contained by rolling back the configuration change. But the incident had a knock-on effect for its main web app tier, which carried "significantly higher numbers of instances" during the pandemic.
"We increased our instance count by 75% during the incident, ending with the highest number of webapp hosts that we've ever run to date. Everything seemed fine for the next eight hours – until we were alerted that we were serving more HTTP 503 errors than normal," explains Nolan.
She details a series of issues that affected Slack's fleet of HAProxy software instances for its load balancer. Those issues ended up affecting its web app instances, which began to outnumber available slots within the HAProxy.
"If you think of the HAProxy slots as a giant game of musical chairs, a few of these webapp instances were left without a seat. That wasn't a problem – we had more than enough serving capacity," she writes.
"However, over the course of the day, a problem developed. The program that synced the host list generated by consul template with the HAProxy server state had a bug. It always attempted to find a slot for new webapp instances before it freed slots taken up by old webapp instances that were no longer running."
That program began to fail and exit early because it was unable to find any empty slots, so the running HAProxy instances weren't getting their state updated.
"As the day passed and the webapp autoscaling group scaled up and down, the list of backends in the HAProxy state became more and more stale."
The outage happened after Slack scaled down its web app tier in line with the end of the business day in the US when traffic typically falls.
"Autoscaling will preferentially terminate older instances, so this meant that there were no longer enough older webapp instances remaining in the HAProxy server state to serve demand."
That problem was fixed, but Nolan and her team didn't understand why its monitoring system didn't flag the issue earlier. It seems the monitoring system's history of non-failure led to it being ignored until it failed.
"The broken monitoring hadn't been noticed partly because this system 'just worked' for a long time, and didn't require any change. The wider HAProxy deployment that this is part of is also relatively static. With a low rate of change, fewer engineers were interacting with the monitoring and alerting infrastructure," explains Nolan.
The other reason the HAProxy stack was ignored by engineers was that Slack is moving to Envoy Proxy for its ingress load-balancing needs.
According to Nolan, Slack's new load-balancing infrastructure using Envoy with an xDS control plane for endpoint discovery isn't susceptible to the problems that caused its May outage.