As a new project, it is an interesting reference for Rust, which is usually adopted for building new features in projects previously written in C/C++, and is popular for systems programming versus building apps. The CTO of Microsoft Azure last year declared all new projects should be written in Rust over C/C++ because of its memory safety features.
But why build a search engine from scratch when GitHub could use another open-source solution, such as Apache Cassandra, Solr, or Elasticsearch?
"At first glance, building a search engine from scratch seems like a questionable decision. Why would you do that? Aren't there plenty of existing, open source solutions out there already? Why build something new?" writes GitHub's Timothy Clem.
His short answer is that GitHub hasn't found success using general text search products to power code search.
"The user experience is poor, indexing is slow, and it's expensive to host. There are some newer, code-specific open source projects out there, but they definitely don't work at GitHub's scale," he writes.
GitHub started experimenting with Elasticsearch in 2011, but Clem notes it look "months" to index GitHub's then roughly eight million repositories. Today, GitHub supports about 200 million dynamic code repositories.
The Rust-written custom search engine, Blackbird, is more efficient and gives GitHub "substantial storage savings via deduplication and guarantees a uniform load distribution across shards", according to Pavel Avgustinov, VP of software engineering at GitHub.
He argues GitHub's scale means it can't use a Unix 'grep' (global regular expression print) for search. In effect, it would be too slow when considering the possibility of processing hundred of terabytes of code in memory. Queries would take too long.