Since it was founded in 2001, the hotel reservation site Booking.com has grown at a rapid rate: customers now book over a million hotel rooms every day.
Such growth has enormous implications for IT: because its systems have to deal with a vast number of transactions -- and quickly -- they have to be extremely powerful and, above all, fast and reliable.
At Flash Forward storage event in London, Booking.com's product owner for storage, Peter Buschman, offered some insight into how the company accomplishes this.
"People think that we must be doing some really amazing stuff. We must be using holographic storage and quantum computers to solve our problem, but it's actually very boring stuff," he says.
So what do they do?
"It is it easier to talk about all of the things that we don't do, because what we actually do is a few boring things that can actually scale," he says.
The company "throws hardware at problems", he explains, and uses blade servers "like other companies use virtual machines".
Booking.com has commissioning and automated systems set up so that "within 30 minutes we can spin up an instance of an application on bare-metal hardware and we can take it down as quickly and re-cycle it into another piece of hardware".
The vast majority of this hardware consists of blade systems, because the company can get more density in its racks that way.
"We just buy [standard] systems and throw them at all sorts of problems," he says. The use of large numbers of off-the-shelf systems "produces weird effects like 100 percent write workloads" as opposed to the more usual balance of reads and writes.
Why the weird effects?
"When you throw that much bare metal, that much compute, and that much memory at every single problem, you will dump cache a lot of data in RAM," he explains.
That's because Booking.com is running thousands of transactions, which will all need to cache memory as the system holds them before completing them or not. That's because at any one time, Booking.com is a holding a lot of transactions while customers make up their minds about whether to book that holiday or not.
That mass of traffic means that the system generates millions of input/output operations per-second.
Buschman explains the company's choice to use SSDs and RAID 0: "Sometimes we get asked why we do this, but some of our applications are often storage constrained and we are growing our applications by throwing large numbers of blade systems at them."
RAID 1 offers greater reliability than RAID 0 but the latter offers better performance. Clearly for Booking.com's systems, which are processing thousands of holidays for people, performance is crucial.
More and more companies like Booking.com are using SSDs in their systems because of their performance -- but there are other issues the company must deal with as well, Buschman points out.
"These blade systems can only fit two drives and if these drives are SSD, then it is more economical to put those SSDs into RAID 0 and double the capacity," where the alternate is to get more reliability with RAID 1 and take a performance hit. Also, "the reliability of those SSDs is an order of magnitude at least better than spinning drives so it makes complete sense to do this."
The increasing capacity of SSDs means they are getting to the point now "where SSDs in blade chassis's can now viably take most of our production databases or workloads -- especially the 15TB SSDs that are out now", he says.
Booking.com also uses non-redundant "top-up-rack" switches. "This surprises people quite a lot," Buschman says. "Our standard production racks have three blade chassis with 48 servers in them. All of these blades have two internet points in them so we could viably double connect them and have redundant work connections but we deliberately do not do this."
Why? Not because it saves money -- but because it encourages his developers to plan for fault tolerance in their applications, he says, again with tongue-in-cheek.
"If they have to accommodate for the risk of a failed top-up rack with its 48 servers, then they will think a lot more about the safety of the application layer."
"Out of pure pragmatism" they mix traffic types on interfaces, he says. "It goes against the best practice that you should separate your data traffic from your storage traffic... because when you are dealing with 10GBps on your fast Ethernet, you can very easily mix these things."
This practice doesn't have an impact unless you are saturating your network interfaces, he says.
As for things like quality of service and jumbo frames, two best practices which could help with performance problems: "We have never had to turn these on," he says.
The company does "collect metrics obsessively" because we make it very easy for developers to generate metrics in their applications and throw them into the metrics collection system, he says.
"This generates millions of IOPs and hundreds of terabytes of data and we like it."
They also do fault injection on a regular basis -- "I wish I could say it was always intentional," he adds -- but they make efforts to regularly test things like multi-path applications. "Every time we do these, we regularly find that something is missing or broken, but the way that you find things is by breaking them on purpose and not by waiting for it to happen,"he says.
So what principles does Booking.com use in defining storage when they deploy? Buschman names eleven:
We question dated best practices. We find that new best practices often invalidate previous best practices that everybody else has been recommending.
We build many-failure domains. We don't consolidate too much stuff on any one system. And we try to keep potential disruption of any one system, no matter how reliable. Human error can take anything offline.
We bias towards simplicity versus complexity. Put simply, if you have to think too much about something, then it is probably too complex.
We accept that anything we do is probably going to involve bottlenecks and we try to keep those bottlenecks as thin as humanly possible. This means that you over-commit on memory and capacity.
We design for a long service life. This is especially true with flash systems today. Well-designed flash systems are doing so well that the life of these systems is five-years-plus and, if you talk to certain vendors, they will tell you that they are not even on track to hit the wear level in ten years. So the systems we are implementing today, I expect not to be re-commissioning them in three years when they are off the books and not in five years or even seven, eight, or nine years.
Allow for human error in design. Any mistake, innocent or malicious, can kill any storage system.
We try to anticipate the evolution of technology.
'Just in time' capacity planning: there is an opportunity cost to buying more storage than you actually need. You can lock yourself out to advances in technology and price drops.
Scalable building blocks: this means having many failure domains. Don't put too much stuff in any one thing. And the way things are going today, instead of storing things in big racks, you can store lots of things in many small bricks.
Compact and modular: That's just how you should try to build your architecture so you can shift things around.
Tried and tested configuration: We like it when stuff doesn't break.