X
Business

MongoDB CTO: How our new WiredTiger storage engine will earn its stripes

With general availability of the next version of MongoDB looming as early as next month, CTO Eliot Horowitz believes it represents an important step for the open-source database.
Written by Toby Wolpe, Contributor
EliotHorowitzMongoDB2014Nov220x257
Eliot Horowitz: Performance improvements. Image: MongoDB

Last week, the first release candidate of MongoDB 2.8 shipped with two new additions that could prove highly significant for the open-source document database.

The first is designed to address one of its long-standing shortcomings, according to MongoDB CTO and co-founder Eliot Horowitz.

Version 2.8 comes with the WiredTiger storage engine, from the architects who originally developed the Berkeley DB open-source library, now owned by Oracle.

"The biggest feedback we've got over the past 12 to 18 months is we need to do a much better job on high write-volume workloads - workloads where there's a lot of writes or a very high mix of reads and writes. Mongo historically has suffered in those sorts of workloads a bit," he said.

"So the big thing in 2.8 is we've built a new storage-engine API and we're fully integrating and supporting our first third-party storage engine, WiredTiger."

Horowitz described WiredTiger as a modern, high-performance, high- throughput storage engine that will offer the NoSQL database greatly improved performance for certain types of workloads. Providing the engine proves a success, it will become the default in MongoDB 3.0, probably due in the third quarter of 2015.

"The key word that MongoDB has been missing is document-level locking, which is engrained in WiredTiger from the beginning," he said.

"From a user standpoint, the new storage engine gives you a whole set of things around concurrency, better throughput, better efficiency on hardware, disk compression - so a lot of features rolled into this one major change."

In addition to compression and record-level locking, WiredTiger also gives MongoDB multi-version concurrency control (MVCC), multi-document transactions, and support for log-structured merge-trees, or LSM trees, for very high insert workloads.

"It has a whole number of features and performance improvements that the Mongo community has been asking for for a while," Horowitz said.

The second important element to ship with 2.8, due for general availability in December or early January, is the on-premise version of the MongoDB Management Service cloud automation tools launched last month.

"One of the challenges of MongoDB is when you're using a distributed database, you don't have a few servers, you have dozens or hundreds. So a tool that can manage that infrastructure and do rolling upgrades and zero-downtime changes is pretty critical," Horowitz said.

"A tool that can really manage your large Mongo farm, that can do seamless upgrades and changes without having to write a lot of scripts or do a lot of work and that can handle the nuances of the processing, is critical. That's all going to be available in 2.8."

Having made decisions such as the number of machines and the amount of data replication, operators can leave the tool to build a cluster automatically.

"For example, if you're using [Amazon] EC2, it will go provision the machines for you, it will deploy MongoDB, it will deploy the version you choose, it will set everything up and tell you when it's done," he said.

The same tools can then be used to monitor and back up the system, or increase the size of the cluster for extra capacity.

"You can go into the UI, click a few buttons, add a few more shards and it will expand the cluster for you. When you want to do an upgrade, again you go into the UI and change the version for the cluster. Now when you hit go it may actually - because it's doing a zero-downtime upgrade -- take it hours to do the upgrade," he said.

"If you have no load on the system, it can take minutes. If you have a very large, heavily-loaded system, it can take a while but you don't have to think about it. It will just be doing this in the background. It's designed to be as safe and as fault-tolerant as humanly possible so that it doesn't damage your running system."

The addition of the automation tool is one of the main elements in a drive to find ways to improve MongoDB's ease of use at scale in production.

"MongoDB is still a very new product relative to an Oracle or a Postgres, so there's a lot of room for improvement. So it's everything from performance in certain situations, to administratability, to how complex the system is to manage to make it simpler and easier for people living with it day to day," he said.

The other big area of activity lies in improving the way MongoDB integrates with other software.

"In any big enterprise, MongoDB isn't going to be the only database. You're going to have your relational databases, your data warehouses, you need to do analytics, you've got to integrate with all the different security and autonomic systems," Horowitz said.

"So we're really doing a lot of work facilitating making MongoDB part of the ecosystem, part of the fabric, so that it's not a standalone system. In 3.0 we'll make advances in many of those areas but there's going to be a very long road before we have all those buckets where we want them to be."

The ultimate aim is that companies should be able to opt for a relational database such as Oracle or choose MongoDB based purely on the form of the data.

"Now for Mongo it's still early, so there are technical limitations but that's our goal. Our mission is that the only thing you have to think about is the data model. Whichever data model is right for your application, that's how you choose your database," he said.

"Then once you have the document data model, you need the equivalent data warehouse for documents, which may be Hadoop, which may something that doesn't exist yet or it may be some of the traditional data warehouses if they adapt and take on some of the document challenges."

Horowitz said in the end it should depend on the type of data, how much it changes, what it looks like, and how you interact with it.

"If you're interacting with bank balances, a relational database works great. If you interacting with user profiles, a document database works a lot better," he said.

"If you're trying to store a user profile, for example, and you try to break that down into a relational model, you've got to have a different table for different attributes. If you've got a list of phone numbers, if you've got a list of likes, friends, all these things become different tables."

Horowitz said in a relational data model a user profile collection averages about 70 different tables.

"In Mongo you can have one, maybe two collections, where you've got one document that represents the entirety of a user profile. You may have other things associated with them but it greatly simplifies the overall data model," he said.

That simplification improves the way programmers interact with it, and how quickly and easily they can develop using it.

"MongoDB has been more suited for the developers for a long time and now we're catching up on the operator side. But at the end of the day, the choice of whether you use MongoDB or not often resides in the decision around the data model - and the data model has a higher impact on the developer than anyone else and on developer productivity," Horowitz said.

More on MongoDB and databases

Editorial standards