Elasticsearch at Scale - Kiln and GitHub

Most of us are not exposed to data at real scale. It is getting more common but still I appreciate that more progressive companies that have to fight with large volumes of data are open about it and talk about their problems and solutions. GitHub and Fog Creek are two of the larger users of Elasticsearch and both have published articles and interviews on their setup. It's interesting that both of these companies are using it for a very specialized use case, source code search. As I have recently read the article on Kiln as well as the interview with the folks at GitHub I'd like to summarize some of the points they made. Visit the original links for in depth information.

Elasticsearch at Fog Creek for Kiln

In this article on InfoQ Kevin Gessnar, a developer at Fog Creek describes the process of migrating the code search of Kiln to Elasticsearch.

Initial Position

Kiln allows you to search on commit messages, filenames and file contents. For commit messages and filenames they were initially using the full text search features of SQL Server. For the file content search they were using a tool called OpenGrok that leverages Ctags to analyze the code and stores it in a Lucene index. This provided them will all of the features they needed but unfortunately the solution couldn't scale with their requirements. Queries took several seconds up to the timeout value of 30 seconds.

It's interesting to see that they decided against Solr because of poor read performance on heavy writes. Would be interesting to see if this is still the case for current versions.

Scale

They are indexing several million documents every day, which comes to terabytes of data. They are still running their production system on two nodes only. These are numbers that really surprised me. I would have guessed that you need more nodes for this amount of data (well, probably those are really big machines). They only seem to be using Elasticsearch for indexing and search but retrieve the result display data from their primary storage layer.

Elasticsearch at GitHub

Andrew Cholakian, who is doing a great job with writing his book Exploring Elasticsearch in the open, published an interview with Tim Pease and Grant Rodgers of GitHub on their Elasticsearch setup, going through a lot of details.

Initial Position

GitHub used to have their search based on Solr. As the volume of data and search increased they needed a solution that scales. Again, I would be interested if current versions of Solr Cloud could handle this volume.

Scale

They are really searching big data. 44 Amazon EC2 instances power search on 2 billion documents which make up 30 terabyte of data. 8 instances don't hold any data but are only there to distribute the queries. They are planning to move from the 44 Amazon instances to 8 larger physical machines. Besides their user facing data they are indexing internal data like audit logs and exceptions (it isn't clear to me from the interview if in this case Elasticsearch is their primary data store which would be remarkable). They are using different clusters for different data types so that the external search is not affected when there are a lot of exceptions.

Challenges

Shortly after launching their new search feature people started discovering that you could also search for files people had accidentally commited like private ssh keys or passwords. This is an interesting phenomen where just the possibility for better retrieval made a huge difference. All the information had been there before but it just couldn't be found easily. This led to an increase in search volume that was not anticipated. Due to some configuration issues (suboptimal Java version, no setting for minimum of masters) their cluster became unstable and they had to disable search for the whole site.

Further Takeaways

A Common Theme

Both of these articles have some observations in common: