GitHub built a new search engine for code 'from scratch' in Rust

GitHub built a new code-focused search engine in Rust because popular text search engines couldn't scale enough.
Written by Liam Tung, Contributing Writer
Image: Luis Alvarez/Getty Images

The Rust programming language continues to grow in popularity and now developer platform GitHub has used it to build its new code-focused search engine, Blackbird. 

Instead of perusing forums for answers, GitHub wants users to use its search engine, which is currently in beta

Also: Memory safe programming languages are on the rise. Here's how developers should respond

Rust is consistently the most loved (but not most widely used) programming language among developers, according to developer question and answer site, Stack Overflow. 

As a new project, it is an interesting reference for Rust, which is usually adopted for building new features in projects previously written in C/C++, and is popular for systems programming versus building apps. The CTO of Microsoft Azure last year declared all new projects should be written in Rust over C/C++ because of its memory safety features.  

But why build a search engine from scratch when GitHub could use another open-source solution, such as Apache Cassandra, Solr, or Elasticsearch?

"At first glance, building a search engine from scratch seems like a questionable decision. Why would you do that? Aren't there plenty of existing, open source solutions out there already? Why build something new?" writes GitHub's Timothy Clem

His short answer is that GitHub hasn't found success using general text search products to power code search.     

"The user experience is poor, indexing is slow, and it's expensive to host. There are some newer, code-specific open source projects out there, but they definitely don't work at GitHub's scale," he writes. 

GitHub started experimenting with Elasticsearch in 2011, but Clem notes it look "months" to index GitHub's then roughly eight million repositories. Today, GitHub supports about 200 million dynamic code repositories.  

GitHub's Blackbird currently supports searching across about 45 million repositories, so it provides only partial coverage, but it still enables code searching across 15 terabytes of code and 15.5 billion documents for programs written in Python, Java, and JavaScript. 

The Rust-written custom search engine, Blackbird, is more efficient and gives GitHub "substantial storage savings via deduplication and guarantees a uniform load distribution across shards", according to Pavel Avgustinov, VP of software engineering at GitHub.  

He argues GitHub's scale means it can't use a Unix 'grep' (global regular expression print) for search. In effect, it would be too slow when considering the possibility of processing hundred of terabytes of code in memory. Queries would take too long. 

Also: New job? Here are 5 ways to make a great first impression

Clem notes that deduplication and its approach to indexing cut down the 115 terabytes it needed to search down to 28 terabytes of unique content. The index itself is now 25 terabytes.  

Editorial standards