Twitter's 3 lessons learned in service oriented architecture

Slow and steady wins the race in building out Twitter's internal architecture. 'Failure is always an option,' says the company's lead service developer.
Written by Joe McKendrick, Contributing Writer

Gradually and incrementally is the best way to build a service oriented architecture. It's a three-step process: 1) make the smallest change possible, 2) verify that it works, 3) repeat.

twitter-bird-white-on-blue- image courtesy of Twitter
Image courtesy of Twitter

Those are the lessons learned by Twitter's development team, as recently explained by Jeremy Cloud, leader of the vendor's Tweet service team, at the recent QConNY conference. The challenge to providing failure-resistant services that support more than 400 million messages a day is to allow for failure, he says.

Cloud's top three takeways consist of the following:

1. In order to make big changes, make very small changes. "If you want to make a really big change, you can increase your chance of success by making many small changes," says Cloud. "A couple of times at Twitter, we tried to do something big where we put our heads down for six months, to build something really big and bring it up at the end. But invariably, these things fail -- things change on the ground during that time."

The key to success, Cloud says, is to "1) make the smallest possible change you can make; 2) verify that it works; and 3) repeat. Do it over and over. With each step, you’re making course corrections, and you can minimize the risk at each step."

The key, Cloud says, is to deploy often and deploy incrementally. "When we develop a new feature, the first thing we do is deploy the build with the feature off. Then we turn the feature on to maybe 1%. The same thing applies to how we do a new service. We try to find the smallest thing you can break off to start with, code that up and slowly turn traffic up to it."

2. Integration testing is very hard.  Testing for all possible scenarios with multiple services is one of the most difficult pieces of SOA, Cloud admits. "Planning your integration test and strategy is very hard.  We’ve learned, unfortunately the hard way, that integration on SOA is actually really hard."

Cloud considers integration testing an area that his team still hasn't been able to fully address to its satisfaction. "This is largely an unsolved problem that we have at Twitter, it's a manual and tedious problem," he says, pointing out that with single, monothic applications, testing could be conducted with one button.

In SOA, "when you have dozens of services that all have to work together, it presents a pretty complex picture. The hard problem in SOA is testing services that span multiple servers. You need to launch all of those services with the change. The services need to effectively point them all at each other. The services may have downstream dependencies."  The only way to address this is to think about integration testing as early in the process as possible, Cloud concludes.

3. Failure is always an option. With SOA, "you're still going to have a lot of unsolved problems," Cloud says. "A single request could turn into thousands of [remote procedure calls]...  and every RPC call is a chance for failure." The best approach is to be able to "degrade gracefully," says Cloud. "You want to make sure you want your system to degrade as quickly as possible.  Avoid single points of failure. You to make sure than if any one node in your system goes down, that’s no problem.  You want to make sure you have plenty of redundancy in your system, and have a fallback strategy. If things start going bad, what are you going to do? If you lose 25% of your cluster."

Actually, a full-blown failure isn't the challenge -- it's the much more common partial failures that are difficult to manage. "It turns out that most of the problems you’ll most encounter are not full failures, but partial failures," he says. "They’re much harder to imagine -- the number of ways something can fail is mind-boggling." He recommends executing a new service or function in production, and purposely "unplugging an entire rack if you’re running  your own data center.  If reliability matters -- and it should -- you really need to spend most of your development time thinking about failure cases -- implementing failure scenarios and actually testing them in production."

(A full video of Jeremy Cloud's presentation, "Decomposing Twitter: Adventures in Service Oriented Architecture," is available at the InfoQ site.)

Editorial standards