Social-networking site MySpace may have slipped behind Facebook, but it still handles up to six billion visitor records a day.
With a major revamp due later this year, designed to help MySpace make up ground on its rival, the company — snapped up by Rupert Murdoch's NewsCorp for $580m (then £332m) in 2005 — is predicting a big jump in activity on the site and a corresponding surge in the records it handles daily — up to 10 billion. The man charged with the sizable task of making sense of that information is chief data architect Don Watters.
ZDNet UK caught up with him recently to ask him about MySpace's relaunch plans and the main issues with managing and analysing substantial volumes of information.
Q: How would you describe your job and the main challenges you face?
A: As chief data architect for MySpace, I pretty much have tactical purview over the entire data platform. Not only the data warehouse, but also the data-mining platform and the data development platform, which is all the front-end data you see at MySpace — so anything to do with your profile or music and video data. It's my responsibility to ensure the data is secure, safe, reliable and on time in real-time.
The biggest challenges we have are to do with scale. We have been doing it a while, but it's still not easy to deal with billions of records a day and still maintain some kind of coherency within the system. We have to deal with that amount of data every day. We're still struggling with new data as it comes in and [with] integrating it and making it available not only to internal users but to our customers.
Can you give an example of the information you provide to customers?
The easiest concrete example is something like an artist's dashboard, where we've given the artists who are on MySpace information about what their user base looks like. Just demographically — they don't get to see any detailed user data.
By demographic, we show artists over time what's happening on their part of the site so that they can get a better understanding of who is actually doing what. Then they either can adjust their message or their site, or maybe go to the towns where they're seeing a lot of activity.
We do an incredible amount of data crunching to be able to figure that out, because it's not always easy to take in information from users. They may say they are 103 years old and live on the North Pole, and we just have to believe them.
Or we can do the opposite, and do some introspection and try and figure out what do [the artist's] friends look like and who that person is, based on other information. We use crowdsourcing, where you take multiple sources and try and figure out what's going on in a single point of view from that crowd source.
Can you provide a sense of the scale of the number crunching?
It's massive numbers of information. We're doing somewhere in the order of three to six billion records a day.
As MySpace changes over the next six months to reinvigorate our brand, we are going to do things that will change the front end and make even more activity happen. If you think about what Twitter does and what Facebook does, you'll see a lot of things that are similar in concept, but not similar in product.
So today on MySpace, the centre panel is called the activity stream. You can filter that by many different aspects, which is something that nobody else really does. But to be able to do that in real-time is actually quite a challenge. To be able then to record on what's going on on the site, so that people understand what features are being used and what's not being used, and how they are being used and what ways users traverse the site — what pages they are hitting on the way — that takes an incredible amount of data.
MySpace is going to go through a giant product relaunch towards the end of this year, and that means the way we are doing business on the front end...
...is going to change completely for us. We are going to need to compare and contrast what happened before and what is going to happen so that we have a good picture of the good parts of the change and what parts might not be hitting with our customers.
How is the use of visitor data evolving?
I've been at MySpace about a year and a half. Before that I was at Disney, ESPN and ABC, and they do a fair amount of traffic as well. They are probably about half the size of MySpace, but they still do a tremendous amount of traffic especially through ESPN. So I do have sense of how things have changed over the years.
What is happening is a big shift in the amount of compute power available and what we can do with it. Having the compute power continue to grow over time is really what is allowing us to do more data analysis and get more value out of the data.
When I first started at Disney we were at probably 3TB total of data in the entire environment. When I left they were probably at about 30TB. Here at MySpace, I'm at one petabyte just in the warehouse and we have something in the order of 2.5 petabytes in production for data development — and that's just for user data. Those numbers will just continue to grow and grow.
So there has been a huge increase in data and computing power, but what about the analysis tools?
We've been working with Aster Data for a while and they have just released a number of analytic functions that allow us to do even more with the data. We've been working with them in terms of getting new functionality into the product, which is extending their SQL-MapReduce framework.
Can you give an example of what these analytic functions allow you to do?
One of the easiest things to understand — but one of the most difficult problems to solve — is a user session. When a user comes on the site they may or may not be logged in, or have a cookie or a consistent IP address from one day to another. We have to try figure out who the user is over the course of the day, so that we can say how long a particular user has spent on the site. We literally look through billions of records a day to try and figure that out.
Part of the issue with that is it's always based on a single user. So most compute solutions would include doing an iterative process to look through those things. However, given that the data is in a database, most database operations are set-based. In other words, they are looking at large amounts of data all at once.
What SQL-MapReduce allows us to do is iterate over that set, in terms of trying to find the user information throughout that set. [This] then gives us the ability to do things such as discover how that user traversed the site in the same session. What categories of the site did they visit — were they part of the music pages or the blog pages or the video pages? — and then show that traversal.
Are the new analytic functions things you have worked on?
SQL-MapReduce has been around a long time. What is happening now is that Aster has come back and got third parties and internal folk from their teams either to build SQL-MapReduce functions or find ones that have been built by other teams, like those from within MySpace. [It has] then coalesced all of those into one package, so that everyone has the opportunity to use them.
What is the most important recent development in data analytics?
Data analytics used to be a very specialised field. It used to be something that you had to have a post-graduate degree to work on, or you had to have very specialised knowledge in a software package that only a few people knew about.
We now have the ability to bring in analysts who know SQL but don't necessarily understand in-depth analysis. Then we only need a few people who understand the specifics behind everything and how to group things together. So I see this as the new age in analysis. We no longer need specialised packages to do in-depth analysis of the data. That change is going to open up a lot of things.