A deeper dive into Facebook's MySQL 8.0 migration

Facebook's MySQL 8.0 migration highlighted issues that are common to enterprise system upgrades. We take a closer look at the lessons that Facebook learned and how it could smooth out the process next time.
Written by Tony Baer (dbInsight), Contributor
Credit: Octopus Deploy

This week, Facebook's development organization announced its completion of the type of tasks that enterprises dread: a major upgrade to its core database. Larry Dignan outlined the high points in his overview yesterday. And in that overview, he more than hinted about the trials and tribulations that the migration encountered.

Database migrations happen all the time, and there's no shortage of advice on how to conduct them. But Facebook's detailed blog post yielded valuable lessons, which we'll get into in more detail here. It was a massive, multiyear project, especially because Facebook has such a large array of MySQL instances.

First off, what was all the fuss about? There's little question that MySQL represents a major generational upgrade to this platform. The upgrade was so significant that Oracle, which owns MySQL, was prompted to undertake a major refresh and upgrade of its own cloud MySQL service.

There's a long laundry list of changes in the 8.0 version, but we'll highlight a few of them here. It starts with manageability: MySQL 8.0 adds the transactional data dictionary that is otherwise standard for enterprise-grade databases. There's greater simplicity: all of the tasks involved with data definition language (DDL) invoked for routine commands is now combined into a single statement. So, you don't have to write separate statements for data dictionary updates, storage engine operations, and binary log writes. Table encryption has been streamlined, and there is improved support for extended data types such as BLOB, TEXT, GEOMETRY, and JSON.

There's the cool new feature of "invisible indexes" that allows tests on the impacts of removing indexes without having to physically remove it. It's analogous to advanced indexing features in Oracle Autonomous Database and Microsoft Azure SQL Database that allow testing of alternative indexing schemes for evaluating whether some indexes are wasting space, or new or modified ones could retrieve data with less compute overhead.

But as any experienced DBA will attest, there's always a price to pay when upgrading, such as destabilizing applications, which is why most organizations postpone migration until the need is too great. Not coincidentally, most cloud Database-as-a-Service (DBaaS) providers play up the promise of eliminating the pain of upgrades because they take those headaches off the shoulders of customers.

So, it's not surprising that at the top of the list of lessons learned are dealing with compatibility issues. Let's start with customizations, which are typical for entrenched enterprise data installations. Not surprisingly, this has also long been an issue in the enterprise applications world, where automated migration tools typically leave off when it comes to customizations. In fact, SAP now encourages customers to instead keep the core implementations vanilla, and abstract the implementations through APIs that should remain stable. Facebook did not have this luxury with the MySQL migration. As Larry noted in his account, only 1500 of 2300 custom patches made the move, with the rest of them being deprecated.

A related issue is API compatibility, and this is where Facebook discovered a hidden "gotcha." As noted, customers typically put off migrations of large enterprise systems because of the time and disruption involved, and in this case, that meant that Facebook skipped a cycle. Instead of going from MySQL 5.6 to 5.7, they skipped the interim release and went directly to 8.0. The result was the need to perform detective work regarding supported APIs; as they learned the hard way, a number of APIs were changed in the 5.6 release that were not included in the 8.0 documentation. This added a penalty of time.

The migration in some cases also involved the underlying storage engine. MySQL is a database that supports pluggable storage engines, and since 2016, Facebook has moved its user-facing MySQL implementations from InnoDB, the most common engine used in MySQL, to MyRocks, an open source storage engine that in fact Facebook developed. The advantage of MyRocks is more efficient compression and writes.

Moving from InnoDB to MyRocks required a "shadow" testing framework that captured production traffic and replayed them in test instances. But this process wasn't foolproof; it missed issues such as how MyRocks would handle transaction write deadlocks. The moral of the story is that while you can simulate some scenarios, in some cases, the bugs won't show up until after the fact, and you should build in after-the-fact fixes into any migration and testing plan.

Because of all the expected and unexpected compatibility issues, Facebook played it very cautiously when it came to moving data. Instead of the customary process of replicating tables or SQL statements, Facebook went row by row – a very painstaking process. The need for such drastic action was driven by the large portfolio of apps, and the likelihood of encountering interdependencies: some data that might be used by Application A may have been derived as a result of processing by Application B.

The unsurprising lesson is that migration of enterprise databases are rarely trivial.

It's a lot of the reason that Facebook took a path that is commonplace for large enterprises to hold off until absolutely necessary, and because that meant skipping an intermediate version upgrade, it paid a price. It drove Larry to pose the question in his piece about whether all the effort was worth it.

The question is whether these headaches could be avoided or minimized in the future, when say, there is a MySQL 9.0. One answer – if you're not Facebook – is moving to cloud Database-as-a-Service DBaaS), where cloud vendors continually keep the version current and supposedly insulate customers from underlying platform changes.

But if you're still trying this at home, Larry hinted at the answer: Start planning now for that eventuality. More specifically, do the dot release upgrades, but maybe in the long run, start to abstract customizations using APIs, if possible. Maybe in the future, machine learning could provide an assist to predict where compatibility issues could arise, but this is certainly one instance where human intuition must remain in control.

Editorial standards