GitHub: We're sorry for all the outages, here's what's went wrong

GitHub is beefing up the hardware behind its databases to prevent outages from reoccurring.
Written by Liam Tung, Contributing Writer

The Microsoft-owned code-sharing service GitHub is making improvements to its "MySQL1" database cluster after repeated outages over the past week affecting many of its 73 million users

GitHub has admitted that its service hasn't been holding up for developers over the past week due to issues affecting the "health of our database", resulting in a degraded experience for developers. 

"We know this impacts many of our customers' productivity and we take that very seriously," GitHub's senior vice president of engineering, Keith Ballinger, said in a blogpost

"The underlying theme of our issues over the past few weeks has been due to resource contention in our mysql1 cluster, which impacted the performance of a large number of our services and features during periods of peak load," he explained. 

SEE: Worried your developers will quit? These are the 5 things that coders say keep them happy at work

Repeated GitHub outages over the past week have spawned numerous complaints on social media. Reports of incidents on downdetector.com spiked on March 23, with most of them about push and pull requests failing for projects. 

Ballinger highlights four multi-hour incidents on March 16, 17, 22, and 23 that lasted between two and five hours each.

GitHub outages are a problem for developers because of software that's hosted on the service. GitHub is also important for keeping enterprise apps running. Microsoft acquired GitHub for $7.5 billion in 2018 as part of its shift to Linux in Windows, the Azure cloud and, more broadly, open-source software development.  

Ballinger explains that the March 16 outage at 14:09 UTC lasted 5 hours and 36 minutes. GitHub's MySQL1 database was over capacity, which caused outages that affected git operations, webhooks, pull requests, API requests, issues, GitHub Packages, GitHub Codespaces, GitHub Actions, and GitHub Pages services.

"The incident appeared to be related to peak load combined with poor query performance for specific sets of circumstances," he notes. 

GitHub does have failover options at hand but these failed, too. On March 17, an outage started at 13:46 UTC and lasted two hours and 28 minutes. 

"We were not able to pinpoint and address the query performance issues before this peak, and we decided to proactively failover before the issue escalated. Unfortunately, this caused a new load pattern that introduced connectivity issues on the new failed-over primary, and applications were once again unable to connect to mysql1 while we worked to reset these connections," he notes. 

Then more outages occurred on March 22 and March 23, with both lasting just under three hours. 

"In this third incident, we enabled memory profiling on our database proxy in order to look more closely at the performance characteristics during peak load. At the same time, client connections to mysql1 started to fail, and we needed to again perform a primary failover in order to recover," Ballinger says of the March 22 incident.

SEE: Want to get ahead at work? Try using this often underrated skill

Then on March 23, it throttled webhook traffic and is using that control to mitigate future issues when its database can't handle peak loads. 

The Microsoft-owned company is taking steps to prevent its database cluster becoming overwhelmed with traffic across its services. It is conducting an audit of load patterned, rolling out multiple performance fixes for the affected database, moving traffic to other databases and attempting to reduce failover times. 

"We sincerely apologize for the negative impacts these disruptions have caused. We understand the impact these types of outages have on customers who rely on us to get their work done every day and are committed to efforts ensuring we can gracefully handle disruption and minimize downtime," said Ballinger. 

GitHub will disclose more details in its March availability report within a few weeks.

Editorial standards