What Is Special About This? Significant Terms in Elasticseach

I have been using Elasticsearch a few times now for doing analytics of twitter data for conferences. Popular hashtags and mentions that can be extraced using facets can show what is hot at a conference. But you can go even further and see what makes each hashtag special. In this post I would like to show you the significant terms aggregation that is available with Elasticsearch 1.1. I am using the tweets of last years Devoxx as those contain enough documents to play around.

Aggregations

Elasticsearch 1.0 introduced aggregations, that can be used similar to facets but are far more powerful. To see why those are useful let's take a step back and look at facets, that are often used to extract statistical values and distributions. One useful example for facets is the total count of a hashtag:

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
"size": 0,
"facets": {
"hashtags": {
"terms": {
"field": "hashtag.text",
"size": 10,
"exclude": [
"devoxx", "dv13"
]
}
}
}
}'

We request a facet called hashtags that uses the terms of hashtag.text and returns the 10 top values with the counts. We are excluding the hashtags devoxx and dv13 as those are very frequent. This is an excerpt of the result with the popular hashtags:

   "facets": {
"hashtags": {
"_type": "terms",
"missing": 0,
"total": 19219,
"other": 17908,
"terms": [
{
"term": "dartlang",
"count": 229
},
{
"term": "java",
"count": 216
},
{
"term": "android",
"count": 139
},
[...]

Besides the statistical information we are retrieving here facets are often used for offering a refinement on search results. A common use is to display categories or features of products on eCommerce sites for example.

Starting with Elasticsearch 1.0 you can have the same behaviour by using one of the new aggregations, in this case a terms aggregation:

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
"size" : 0,
"aggs" : {
"hashtags" : {
"terms" : {
"field" : "hashtag.text",
"exclude" : "devoxx|dv13"
}
}
}
}'

Instead of requesting facets we are now requesting a terms aggregation for the field hashtag.text. The exclusion is now based on a regular expression instead of a list. The result looks similar to the facet return values:

   "aggregations": {
"hashtags": {
"buckets": [
{
"key": "dartlang",
"doc_count": 229
},
{
"key": "java",
"doc_count": 216
},
{
"key": "android",
"doc_count": 139
},
[...]

Each value forms a so called bucket that contains a key and a doc_count.

But aggregations not only are a replacement for facets. Multiple aggregations can be combined to give more information on the distribution of different fields. For example we can see the users that used a certain hashtag by adding a second terms aggregation for the field user.screen_name:

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
"size" : 0,
"aggs" : {
"hashtags" : {
"terms" : {
"field" : "hashtag.text",
"exclude" : "devoxx|dv13"
},
"aggs" : {
"hashtagusers" : {
"terms" : {
"field" : "user.screen_name"
}
}
}
}
}
}'

Using this nested aggregation we now get a list of buckets for each hashtag. This list contains the users that used the hashtag. This is a short excerpt for the #scala hashtag:

 
"key": "scala",
"doc_count": 130,
"hashtagusers": {
"buckets": [
{
"key": "jaceklaskowski",
"doc_count": 74
},
{
"key": "ManningBooks",
"doc_count": 3
},
[...]

We can see that there is one user that is responsible for half of the hashtags. A very dedicated user.

Using aggregations we can get information that we were not able to get with facets alone. If you are interested in more details about aggregations in general or the metrics aggregations I haven't touched here, Chris Simpson has written a nice post on the feature, there is a nice visual one at the Found blog, another one here and of course there is the official documentation on the Elasticsearch website.

Significant Terms

Elasticsearch 1.1 contains a new aggregation, the significant terms aggregation. It allows you to do something very useful: For each bucket that is created you can see the terms that make this bucket special.

Significant terms are calculated by comparing a foreground frequency (which is the frequency of the bucket you are interested in) with a background frequency (which for Elasticsearch 1.1 always is the frequency of the complete index). This means it will collect any results that have a high frequency for the current bucket but not for the complete index.

For our example we can now check for the hashtags that are often used with a certain mention. This is not the same that can be done with the terms aggregation. The significant terms will only return those terms that are occuring often for a certain user but not as frequently for all users. This is what Mark Harwood calls the uncommonly common.

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
"size" : 0,
"aggs" : {
"mentions" : {
"terms" : {
"field" : "mention.screen_name"
},
"aggs" : {
"uncommonhashtags" : {
"significant_terms" : {
"field" : "hashtag.text"
}
}
}
}
}
}'

We request a normal terms aggregation for the mentioned users. Using a nested significant_terms aggregation we can see any hashtags that are often used with the mentioned user but not so often in the whole index. This is a snippet for the account of Brian Goetz:

            {
"key": "BrianGoetz",
"doc_count": 173,
"uncommonhashtags": {
"doc_count": 173,
"buckets": [
{
"key": "lambda",
"doc_count": 13,
"score": 1.8852860861614915,
"bg_count": 33
},
{
"key": "jdk8",
"doc_count": 8,
"score": 0.7193691737111163,
"bg_count": 32
},
{
"key": "java",
"doc_count": 21,
"score": 0.6601749139630457,
"bg_count": 216
},
{
"key": "performance",
"doc_count": 4,
"score": 0.6574225667412876,
"bg_count": 9
},
{
"key": "keynote",
"doc_count": 9,
"score": 0.5442707998673785,
"bg_count": 52
},
[...]

You can see that there are some tags that are targeted a lot at the keynote by Brian Goetz and are not that common for the whole index.

Some more ideas what we could look at with the significant terms aggregation:

Besides these impressive analytics feature significant terms can also be used for search applications. A useful example is given in the Elasticsearch documentation itself: If a user searches for "bird flu" automatically display a link to a search to H5N1 which should be very common in the result documents but not in the whole of the corpus.

Conclusion

With significant terms Elasticsearch has again added a feature that might very well offer surprising new applications and use cases for search. Not only is it important for analytics but it can also be used to improve classic search applications. Mark Harwood has collected some really interesting use cases on the Elasticsearch blog. If you'd like to read another post on the topic you can see this post at QBox-Blog that introduces significant terms as well as the percentile and cardinality aggregations.