ACM Queue has a fascinating interview with Phil Smoot, a product unit manager with MSN. In that capacity, he manages the product teams responsible for Hotmail, one of the largest Web-based services on the Internet. Most of us will never build or manage a service as large as Hotmail, but even so, there are important lessons in what he's learned that apply to any service that hopes to grow beyond a single server.With SOA gaining ground, more and more of us are running services
Phil says "mimicking Internet loads on our QA lab machines is a hard engineering problem." No one wants to build a QA environment as large as their production environment, but when you have thousands of servers, it's impossible. One of the techniques Hotmail uses is rolling out new versions and features piecemeal. Try it on 10 servers, then 100, then push it to the entire site. This let's operations engineers see the new system in operation without affecting every user.
That strategy comes with it's own set of problems: for this to work, you need to build servers and clients so that they interact well with versions N and N+1 at the same time. This approach means that some features may take several version iterations to completely push out.
Another technique Hotmail uses for scaling QA is building tools to replay live-site transactions in the QA testbed. Being able to extrapolate single-node loadings into system loadings is also an important skill. This is easier to do when the service doesn't have strong cross server coupling.
Phil prescribes automation and up-front instrumentation as key factors in making a service scalable from a staff perspective. Another important rule: consistency. Phil says "The key is getting the underlying operational infrastructure in place and then being disciplined across all parts of the organization so that you’re all marching the same way, so that all deployments come through one way, all imaging comes through one way, and all applications generate errors and alarms and get monitored in the same way. You’re going to be able to get economic advantages by scaling out your operational people less and less because everything’s consistent."
Phil's advice on capacity planning is to think in terms of clusters. "[F]or this many users we require X servers, Y networking, and it costs Z dollars. The ideal is that clusters can be built out in a cookie-cutter fashion."
Operational scalability is affected by the number of unique platforms and systems, which Phil calls SKUs in the article. Keeping things simple, both in terms of the number of platform types as well as the software running on them. We found this out when we built iMall. We initially deployed with Veritas clustering on the database. The clustering tools reduced the number of failures, but when the servers did go down, bringing them back up was much more difficult. The complexity just wasn't worth it.
You may not be running a megaservice, but with SOA gaining ground, more and more of us are running services of one type or another. At some point, the operational costs start to override the engineering costs. That's a critical crossover point where engineering your system for operational scalability starts to make a lot of sense.