Short clips: Technorati on sifting through splogs

October 28, 2008, 2:20pm PDT | Length: 00:01:01
Dorion Carroll, vice president of engineering for Technorati, discusses the challenges inherent in trying to index the growing blogosphere. Because the company grew right along with it, they were able to evolve defenses, like keywords and posting heuristics, against the onslaught of spam blogs.

Transcript

Short clips: Technorati on sifting through splogs

Sumi Das: And what about these sites that you don't want your users to really be bombarded with, you know, we are talking about spam blogs, the splogs, the scraper sites that people aren't particularly interested in. How does your technology filter those out? Dorion Carroll: With the advent of Yahoo! Pipes with lots of crawlers, with RSS, it's really easy to fabricate sites. Pump up keywords, do some simple word substitutions, plop AdSense on it and make money on other people's content without actually giving anything back. We definitely want to weed those out. Some of the things that we've done and I think this has been part of our advantage having grown up with the blogosphere is as those problems started to surface, we were able to grow our defenses against them. We have a number of defenses right upfront and one of the things that's interesting about blogs is we don't have to try to go guess where the blog updates are. Blogs ping. When you hit "publish" on your blog post it sends a message out and there are a number of services that aggregate pings basically saying here's a site that says it's changed. It doesn't say what's changed. It doesn't say whether anything actually has changed. You then have to go look at the site, compare it to last time you saw it and decide what you want to do with that. Over 95 percent of the pings that we process today are from known spam sources, known to us as spam. All we've ever seen from there is spam. We don't want that stuff and we can drop that on the floor. But, a lot of spam still gets to the next line of defense. We then have Bayesian filters, we do keywords, we look at a number of different heuristics, so posting frequencies. If you are seeing many, many posts per minute, it's not a human being. So, there are definitely signatures you can look for.

==== Transcribed by Automatic Sync Technologies ====

Short clip: Technorati reducing costs during weekend downtime

Short clip: Technorati reducing costs during weekend downtime

Dorion Carroll, vice president of engineering for Technorati, explains that because bloggers...

Short clip: Technorati serves up sub-second responses

Short clip: Technorati serves up sub-second responses

Dorion Carroll, vice president of engineering for Technorati, says that while the active...

Technorati VP of engineering: Dorion Carroll

Technorati VP of engineering: Dorion Carroll

Dorion Carroll, vice president of engineering for Technorati, talks to ZDNet correspondent Sumi...

Short clip: 1-800 Flowers blooms with blogs, social networks

Short clip: 1-800 Flowers blooms with blogs, social networks

Steve Bozzo, CIO of 1-800 Flowers, discusses innovative technologies the company is working on...

Short clip: Blogging about the Kentucky Derby

Short clip: Blogging about the Kentucky Derby

Jay Rollins, the vice president of information technology at Churchill Downs, discusses how the...

Monte Ford, CIO, American Airlines

Monte Ford, CIO, American Airlines

Monte Ford, CIO of American Airlines talks to ZDNet’s Sumi Das about developing a new passenger...

Shadman Zafar, CIO, Verizon Telecom

Shadman Zafar, CIO, Verizon Telecom

Shadman Zafar, CIO of Verizon Telecom talks to ZDNet correspondent Sumi Das about the company’s...

Short clip: Sony converges electronics and entertainment

Short clip: Sony converges electronics and entertainment

Drew Martin, CIO of Sony Electronics, talks about the convergence of content and consumer...

Talkback - Tell Us What You Think

Formatting +
BB Codes - Note: HTML is not supported in forums
  • [b] Bold [/b]
  • [i] Italic [/i]
  • [u] Underline [/u]
  • [s] Strikethrough [/s]
  • [q] "Quote" [/q]
  • [ol][*] 1. Ordered List [/ol]
  • [ul][*] · Unordered List [/ul]
  • [pre] Preformat [/pre]
  • [quote] "Blockquote" [/quote]

The best of ZDNet, delivered

ZDNet Newsletters

Get the best of ZDNet delivered straight to your inbox

White Papers, Webcasts, & Resources

Facebook Activity