My fellow World Economic Forum Technology Pioneer, David Sifry, the founder of Technorati, was also in Dalian, China for the “Meeting of New Champions” or “Summer Davos” as the Chinese like to call it. During Davos in January, I had the great misfortune of pitching Alfresco against Technorati in a competition between tech pioneer companies. As fantastically well as Alfresco is doing, Technorati has the temerity to compete against Google in blog search and win.
I got the chance to talk to Dave during the conference and ask him some questions on the technology and architecture behind Technorati, the internet blog search site. I thought that someone who could take ordinary computer components and build a huge internet architecture could possibly teach something to people running enterprise architectures that are puny in comparison.
Technorati is a web site that tracks blogs, pictures and any user generated content and allows you to search those sites about what people are thinking, seeing and hearing. When a new or urgent situation breaks out, you can do worse than to search Technorati for immediate reaction. Every day, every hour, every second, Technorati is indexing over 10 million blogs with over 10 billion objects. Technorati’s user base is doubling every six months and quick and accurate response is critical for retaining those users.
I asked Dave about his architecture and what applicability their might be for enterprise architectures.
John Newton: In building Technorati, what were some of the issues that you had in architecting your systems.
David Sifry: I was looking at just temporal information. I had no idea how big it could get. When I looked at the architecture, instead of architecting it right, I architected it for right now. I had no big budget and I didn’t want to wait six months to build it. Also, I had no idea what the killer app would be.
I focused on data flexibility. At the time, that meant putting everything into a relational database. That was okay while the size of the indexes is less than RAM and about a million blocks of data. That was okay while there were less than 20 million blogs.
The next generation took advantage of data parallelism. That meant upon update send a signal to all the other systems. We expanded the data over several “shards” [segments of data partitioned on different databases on separate machines].
What was surprising was that we were writing as much data as we were reading. At this point Technorati was as big as some of the biggest OLTP. Even so, maintaining data integrity was important, because you would want the link count [count of how many other blogs point to a particular URL] to be out of sync. This put real pressure on the system. At the same time, we realized that time was more important dimension than URL. People didn’t want to sort or search on URL, they wanted to search on time. [i.e. what are the latest blogs on a particular subject?]
By this point, we understood the application more and more. The app [Technorati] is about real time access. You need to be able to count on finding latest information on a subject. That’s when we built the third architecture. Scaling was well understood and we build the shards on time rather than on URLs. Instead of putting data into a DBMS, we put it into special purpose databases. It was more of a bus-based architecture. Each database could be scalable and grow as big as we needed.
JN: The notion of shards - did you call it that at the time? I have been looking into shards and I was only aware of or heard of them for about the last year.
DS: Back in 2002 when we were pitching this to VCs, I at least explained the theory. All I just thought through the problem carefully. Doing it this way, we could add hundreds of systems, lots of cheap CPUs, RAM and disks. It provides inherent parallelism. I can’t believe that I was the first one to think this up.
JN: How big does this architecture scale?
DS: We are loading one terabyte a day into Technorati. That’s 100 million blogs or about 10 billion objects. A lot of is new types of tagged data. There are about a half billion videos and photos.
With all that data, you have to think about what do you throw away? We can’t really delete anything, because we are potentially losing an asset. We don’t delete anything. So we take data out of the spin cycle. [Transitory data used in preparation.] We take the long-term data and put it into low latency storage.
When data is doubling in size every six months, that means that only one quarter is a year old. We don’t need to worry old data.
JN: How do you deal with large number of users with very large data sets?
DS: Any off the shelf tools falls over. There is a lot of interesting analysis on old data, but no off the shelf tools can handle that much data. It’s only just now that some tools can handle it.
JN: What are those tools?
DS: One is Green Plum by a bunch of O’Reilly guys. If you use ordinary data warehouse tools, they would just scream and shout.
JN: Actually what I was originally referring to was the fact that you are showing lots of data that are not users used to enterprise information management tools. How do you present this information to consumer-level users? How do you deal with the user interface and visualization of all this data?
DS: Gotcha. It depends on what the user wants to get out of Technorati. If the user wants search results, then we give it to them. Sometimes they want to browse or discover information. We have spent a lot of time on visual design. Then we give them lots of bright, shiny things for them to click on. Things like metadata, video or other links.
We have used enterprise class web tools to analyze what users are doing? We look at the click stream and see what is successful or not. That helps to make the information contextual.
One of the big mistakes that we made is to not do this [buy click stream analysis tools] sooner. It was only $80K. Up to that point it was so much trial and error. I’m glad we finally did it. Now we can see how much time a user spends on a feature. We can see page views, goals per visitor.
JN: So what do you measure on Technorati?
DS: Measuring a web site is like forecasting the weather. Yesterday it’s sunny and today it is cloudy. Why is it cloudy? Sometimes you have no idea. Sometimes you realize that that a change in barometric pressure has a lot to do with it.
We look at the number of newbies, number of reports, session lengths and then measure them against prior periods. It’s not always consistent.
I had never built a B2C site before. I just focused on me, on what I wanted. That worked well for a while when I was the target audience. But we have to build for a broader audience.
JN: At Alfresco, we measure conversions. Are you measuring things like performance? Does that affect retention of users?
DS: Of course, but if the system is falling down, then even performance doesn’t matter. So I don’t get too stressed out about it.
JN: When we met at Davos you wanted to move Technorati to be the Internet Now! Is that still the case?
DS: Everything is shifting. I wanted it to be a site that everyone is able to use. We forgot about the core users that just wanted to find out about blogs and any real time information. In an attempt to jump the chasm, we chased after 100 million users and tried to be everything to everyone. Now we try to make blogs and user driven content available for those looking for that.
Also performance is improved significantly. Now I notice how slow other sites are. This is a total tribute to the engineering team. Everything is easier and faster.
Pretty soon we will have a whole lot of stuff that we have been working for a year.
JN: Can you say what it is?
DS: I don’t pre-announce.
JN: What does the Technorati brand stand for today?
DS: Good question. What’s popping up now on the internet, especially user generated content? It’s about users tagging user generated content and finding it.
JN: Who are your competitors?
DS: I probably sound like the typical entrepreneur, but nobody really seriously. Google provides blog search, but other than that nobody really. Other people are trying to identify and tag information like Digg and del.icio.us, but they aren’t really competition.
JN: What do you want Technorati to be in two years time? Five years would be ridiculous.
DS: I would like Technorati to be a profitable business that is strongly differentiated. It will be the place that you would go for mobile, RSS or push information. For all that you would come to Technorati.