Yahoo talks social graphs, data analytics and Hadoop

Yahoo outlined how it's going to use data from its partnerships with Facebook and Twitter, its use of Hadoop in data warehousing and its analytics plan.
Written by Larry Dignan, Contributor on

Yahoo outlined how it's going to use data from its partnerships with Facebook and Twitter, its use of Hadoop in data warehousing and its analytics plan.

In a talk at a Pacific Crest Securities technology conference, Scott Burke, vice president of data and analytics at Yahoo, shed light on the company's data management plan. Burke oversees a consolidated team that looks to aggregate all data platforms, data warehousing and analytics tools in the company. Burke’s comments are notable since Yahoo has decided to partner with social networking leaders (Facebook and Twitter) instead of trying to build rival services (much like Google has). And Yahoo has its Microsoft search partnership, which won’t impact the company’s data analytics efforts.

Among the key points:

On Yahoo's data collection today, Burke said that Yahoo gets search query data, where a consumer sits in the buying decision funnel, toolbar information as well as shared information with large publishers. Zip code information for weather reports is also used to deliver local news. Increasingly, Yahoo is able to connect social graph information via Facebook and Twitter streams as well as things like Yahoo Messenger.

Consumer and ad analytics are merging and Yahoo's information systems are following suit. Burke noted:

There are a lot of things in common that over the years at Yahoo! had been built up separately. So our knowledge of what a consumer's profile is, when you engage with Yahoo! Sports and you share with us what your fantasy teams are, you share with us your interests, that is also certainly valuable for providing better features in sports, but it's also valuable to certain advertisers who want to target people that are interested in that particular region and that particular team.

Yahoo's take on the social graph. Burke talked a lot about how Yahoo's partnership with Facebook fits with the company's plan. Yahoo will also provide Twitter access via Yahoo Mail later this year. Burke said:

I think that the actual value of the social graph is more about providing services to the consumers.

What we're giving you is we're giving you an aggregated way to see access to all of your social network data, and then what we want to do is take it a step further and we can actually add value to those social streams because of the content assets that Yahoo! has. So if you sent me a note saying, hey, do you want to go to a concert tonight in Vail? On Facebook, that's a bunch of text. On Yahoo!, we can start to link that actually to pictures, videos, to information that we actually have that Facebook has no interest in licensing, because they are not a media play. So they don't have the content access to enrich the status and make it more actionable. So what you'll see Yahoo! doing is really trying to commoditize this stream of information and make it really into a channel for both consumers as well as for advertisers, who could then come to Yahoo! and get access to syndicating ads across multiple social networks, whereas they can't go to Facebook today to run an ad campaign across Facebook and Twitter. That's the kind of role that Yahoo! can play at the center point of the social space.

On the Associated Content deal, Burke said that Yahoo plans to integrate its analytics with a vast freelance network. The aim: Build relevant editorial packages quickly and promote them heavily. Ultimately, Yahoo may be able to cut its content licensing costs.

What's the line between adding value and overstepping the boundaries on privacy? Burke said:

We've taken the position that we will anonymize at 90 days the vast majority of our personally identifiable information. Because frankly, we can run the business just fine without that information in most areas. We keep it for certain exceptions, like legal or fraud or abuse cases, but you don't have to maintain that level of personally identifiable information for the long-term. And I think that it's a real area right now of scrutiny. I think it's going to be an increased area of scrutiny, especially because of the moves some of the competition is making. But we intend to be proactive and maintain that line. And at the same time, we have enough information to do aggregated and category-based targeting that works quite well for advertisers. And advertisers are not pushing to get access to personally identifiable information. They just want to find access to the segments of consumers that are interested in their products.

The natural follow-up to Burke's answer revolved around consumer privacy worries. Burke said:

I think the sensitivity is that things online are more measurable, and they are more visible now than -- and it's also more of a novelty. The type of -- the places that Facebook is pushing the envelope are a novelty online. The fact that I can see what school most of you went to, it's perfectly public, because you've made it public on Facebook. And any developer can look you up by e-mail address and get those data elements back. I think that's a sea change in consumer perceptions on privacy. And it really remains to be seen if that's where consumers are going to settle and actually be satisfied with that. I think that's obviously a very active area. I think it's a very true statement that the offline marketing world has a lot more personally identifiable data about people and the credit agencies. And there are decades of efforts that have built those things, and there's a very strong lobbying movement to protect that type of a business model. I do think there's an over reaction to some of what's happening online. But at the same time, because it's online and in the computer systems, you can do it at scale in a way that hasn't been possible off line.

And finally Burke talked a little about Yahoo's data management engine, which revolves around Hadoop. Burke said:

We are making a bet on Hadoop on the grid technology, and we have the largest grid clusters in production today on the internet running Hadoop. We -- our intent, our data warehousing problem is of larger scale than what you can solve with a traditional data warehousing technology, like a Teradata or others. We're talking about warehouses in the tens of petabytes range and growing. So, that amount of raw data, you cannot put into an Oracle system or a Greenplum appliance. So, we're definitely adopting the grid. We have many systems today in production running on Hadoop and over time, our basic technology stack is going to be Hadoop for the raw event level data and the data processing.

That said Oracle will be used for high throughput and fast query performance. At the dashboard level, Burke said it will use MicroStrategy for analytics.

Related: Yahoo topic page

Editorial standards