Podcast: Google talks about its newly launched CodeSearch

Launched today, Google is making the rounds talking about it's newest search product: Google CodeSearch.  To find out more about it, I interviewed Google product manager Tom Stocky.
Written by David Berlind, Inactive

Play audio version

Launched today, Google is making the rounds talking about it's newest search product: Google CodeSearch.  To find out more about it, I interviewed Google product manager Tom Stocky. Using the embedded player at the top of this blog, the interview can be streamed to your desktop, manually downloaded, or, if you're already subscribed to ZDNet's IT Matters series of podcasts, it should turn up on your computer and/or your portable audio player automatically.  

So, what's cool about CodeSearch? Well, if you're a software developer or you want to become one, finding source code to reuse (or just source code to learn from) can be a daunting task.  Of what's publicly available on the Internet, most of it is tucked away in source code repositories (ZIP files, Concurrent Versions Systems, etc.). So, it's not in the sort of file that's easily crawled and indexed the way today's search engines crawl and index Web pages. Sure, search engines like Google now index other file types (like PDF). But with code repositories, there are multiple files. And, there's an entire structure which includes things like the license under which the code is available (eg: open source). Google's CodeSearch, has, according to Stocky, indexed the billions of lines of code that are hiding in these repositories.  Even better, you don't have to dive into the repository to see the code. Although Stocky disagreed with my characterization of it as caching, CodeSearch's results can display the actual code that was a hit and it highlights the specific text that was a match to whatever string you searched.

If there's one incredibly obvious feature missing from Google's CodeSearch, its social networking.  To find code, you kind of have to have an idea of what the code might look like. You cannot for example depend on CodeSearch to return reliable results if you search on "Perl parser for eBay listings".  Such a search might work if there were Perl-based parsers for eBay listings that had that same text in any comments that the developer includeded with his or her code.  But the best way to solve the problem, if you ask me, is to let users of CodeSearch openly tag the code the way Flickr allows open and social tagging of photos or del.icio.us allows open/social tagging of Web pages. No worries though. I took the cached version of some code I found through CodeSearch that contains the text "Hello World" and, tagged it with the tag "helloworld" in del.icio.us.  You can see it here (problem solved, kinda).

I mentioned that CodeSearch could use this sort of social networking feature to Stocky and also asked him a bunch of other questions, one of which was what role this plays in Google's business model. Here's a sampling of what Stocky said:

Stocky on what CodeSearch is: Today we're launching Google Code Search on Google Labs which gives software programmers a single place to search publicly accessible source code.  It looks like a Google search page. But instead of searching Web pages like the typical Google.com search engine, we're searching over billions of lines of code and we're trying to create a search engine that's useful for everyone from computer science students to serious programmers to even hobbyists and code-enthusiasts.
Stocky on the definition of publicly accessible code: Google's definition of publicly accessible code: In most cases, this is open source code that has an explicit open source license and in those cases, we do actually list the license -- something like the GPL license for example. But other times, it's just things that people have hosted in public places that are accessible for everyone.
Stocky on whether humans are involved: [It's] all done alogorithmically...Our crawler goes out and basically does the equivalent of downloading these [zip files and archives], opening them up, and then indexing all of the individual files inside of them.

Editorial standards