How to build massive scale infrastructure, without slowing down

SendGrid explains the engineering behind the giant scale of its email and communications systems.

This message will self-destruct in... Gmail lets users set expiration date for emails Gmail's not entirely confidential 'confidential mode' expiry-date email feature is now available on mobile apps.

SendGrid is one of the leading email delivery systems with over 82,000 paying customers around the world. Twilio is a cloud communications provider and allows software developers to programmatically make and receive phone calls, send and receive text messages, and provide other communication functions. Earlier this month, the communications specialist Twilio bought SendGrid for $3bn. Before the deal closed, ZDNet spoke to SendGrid's VP of technical operations, Jamie Tischart, about email and its key requirements of throughput and reliability.

ZDNet: You have just been acquired by Twilio. How are the two companies to merge?

We are pretty excited about the convergence of these two companies, and they are really developer focused and there are a lot of great things we can do for our really large, combined, customer base.

sendgrid-jamie-tischart.jpg

Tischart: "We envision ourselves as the world's most trusted communications channel."

SendGrid

Tischart: How do you see yourselves?

We envision ourselves as the world's most trusted communications channel and being able to serve all our customers with all of the channels that they might need with a really strong API basis to enable developers to create their solutions on our platform.

As a company where do you think you have come from and where do you see yourselves going to in the future?

SendGrid came from a couple of engineers who were fairly serial entrepreneurs, starting up these new companies and really struggling with how to approach email, how to get higher delivery rates and how to get things through to their customers. And what they did was they created a very strong API vision that would help companies find a much better way to communicate with their customers. It was primarily through the email channel but in a platform model that supported any communications channel that the customers would need.

SEE: Tech budgets 2019: A CXO's guide (ZDNet special report) | Download the report as a PDF (TechRepublic)

We focused a lot on having massive scale. Apart from the scale, we really pride ourselves on working closely with the ESPs across the industry to really prove the deliverability on behalf of our customers so that they can rely on a service that ensures that key communication whether it was a bill communication, or a password reset or whatever.

We focus on ensuring that their communications flow is reliable and will be fast, even if they want to send hundreds of millions of messages — and that somebody will be there to support them.

You say you scale massively, so what's the secret to doing that?

Everybody is looking for that silver bullet of 'how do you scale?' Now if you go back nine years or so, we didn't have Amazon and elastic scalability and services to rely upon. We had to really build services so that they were resilient, that they could scale independently. Thankfully, that all kind of bridged into the cloud-native design. Things like being able to scale independently as they needed to grow with our customer growth, that was a big part of it — and a lot of this came from, not one quick thing or one magic bullet. 

So, we started by talking about what our projections were for the coming year and Black Friday. Some of our data scientists dug in to start predicting what our growth looked like and that would be from message size and the number of customers, and so on, and we used that data to start planning out how are we really going to support this size?

We planned out a number of different approaches. The first was, how do we optimise our architecture to support that? How do we make it native enough to scale independently so that we could have horizontal scale across all of our servers?

We're in the process of moving to Amazon, so we have to figure out how do we do a hybrid approach of data centre and Amazon, and how do we get our software to run that way?

And what we really started to look at was, how do we do consistent performance testing across all of our environments, whether they're internal, all the way to production? And what we looked at was applying data models and applying the ability to identify any soft spots within our architecture.

So, we consistently — every week, every month or every quarter — were testing each piece of our system and any inter-failure mode analysis so that we could identify where we needed to shore up anything and then be able to apply that well in advance of the load coming.

And, of course, we're running a business every day and so we had to take some strategic ways of doing this. A lot of the architecture we built, we were able to turn off and turn on things in production so that we could force more load onto the things that were running. And that gave us the real-world knowledge and belief of where our systems had challenges or were running very well.

This is something that people get really nervous about always thinking about that "golden" rule, never test in production! We have a very different belief. We are constantly testing in production — very like the chaos model, where we are running them all with the minimum amount of architecture and software possible.

The other thing is that we pride ourselves on is preventative maintenance. What are the little nagging things that need doing? Do I need to get an oil change for my car so that I know it's going to run optimally? So, do I know my system is going to run optimally with every one of our things? What are the things that are impacting our development team today? Let's go fix those.  

That kind of preparation meant that we had a very good understanding through our data analysis and data science, of what we were going to expect.

One of the big things is preparation then?

It's so much about preparation and multi-year preparation at that. You need to do that when you are up and above 50 billion messages each month. It is a serious issue as we are trying to make sure we are right in our predictions and analysis of the math data.

You are clearly very experienced in scalability. From that experience, what are the issues that you think IT managers should be looking out for?

One of the traps we all fall into I think is really getting focused on the overall scaling and forgetting to look at the individual components themselves.

Some of your large capacity is hiding the true fragility of your components. Now, one of the key learnings — and we had one of these this year — was that we had something that was able to handle the capacity but was hiding an area that wasn't. And as soon as we were able to scale down that first area of the architecture, we started to see scalability problems with other components.

So my advice to people is, keep peeling it down to all the layers. Find innovative ways to push the scale to a service level, or component-level pieces, rather than doing systems-type scale, or infrastructure-type scale. You've got to get down to the real detailed pieces of your service to actually find those issues. Otherwise, the problems are going to happen at those times when you are at a really large scale and have the most demand for your services from your customers.  

PREVIOUS AND RELATED COVERAGE

Twilio to acquire email API platform SendGrid for $2 billion

Twilio says adding email to its communications platform -- which includes voice, messaging and video -- will let customers use Twilio to manage all of their communications channels.

Twilio posts strong Q4, has 64,286 active customer accounts

The cloud communications platform posted strong results and upped its outlook with the addition of SendGrid.

Zoom, Slack, and Twilio see expense account love from businesses, says Expensify

Collaboration and workflow tools were among the trending categories in Expensify's analysis of corporate expense accounts.

Twilio intros Super SIM, Narrowband connectivity services for IoT developers

Super SIM is a mashup of Twilio's mobile core infrastructure, OTA platform, and SIM hardware and software.

What will the next year bring for developers? (TechRepublic)

AI, application platforms, and IoT will continue to change the tech landscape for software developers, according to Twilio.

Best cloud services for small businesses (CNET)

Best-known cloud storage and cloud computing services.