Dienstag, 10. September 2013

Simple Event Analytics with ElasticSearch and the Twitter River

Tweets can say a lot about an event. The hashtags that are used and the time that is used for tweeting can be interesting to see. Some of the questions you might want answers to:

  • Who tweeted the most?
  • What are the dominant keywords/hashtags?
  • When is the time people are tweeting the most?
  • And, most importantly: Is there a correlation between the time and the amount of tweets mentioning coffee or beer?

During this years FrOSCon I indexed all relevant tweets in ElasticSearch using the Twitter River. In this post I'll show you how you can index tweets in ElasticSearch to have a dataset you can do analytics with. We will see how we can get answers to the first two questions using the ElasticSearch Query DSL. Next week I will show how Kibana can help you to get a visual representation of the data.

Indexing Tweets in ElasticSearch

To run ElasticSearch you need to have a recent version of Java installed. Then you can just download the archive and unpack it. It contains a bin directory with the necessary scripts to start ElasticSearch:

bin/elasticsearch -f

-f will take care that ElasticSearch starts in the foreground so you can also stop it using Ctrl-C. You can see if your installation is working by calling http://localhost:9200 in your browser.

After stopping it again we need to install the ElasticSearch Twitter River that uses the Twitter streaming API to get all the tweets we are interested in.

bin/plugin -install elasticsearch/elasticsearch-river-twitter/1.4.0

Twitter doesn't allow anonymous access to its API anymore so you need to register for the OAuth access at https://dev.twitter.com/apps. Choose a name for your application and generate the key and token. Those will be needed to configure the plugin via the REST API. In the configuration you need to pass your OAuth information as well as any keyword you would like to track and the index that should be used to store the data.

curl -XPUT localhost:9200/_river/frosconriver/_meta -d '
{
    "type" : "twitter",
    "twitter" : {
        "oauth" : {
            "consumer_key" : "YOUR_KEY",
            "consumer_secret" : "YOUR_SECRET",
            "access_token" : "YOUR_TOKEN",
            "access_token_secret" : "YOUR_TOKEN_SECRET"
        },
        "filter" : {
            "tracks" : "froscon"
        }
    },
    "index" : {
        "index" : "froscon",
        "type" : "tweet",
        "bulk_size" : 1
    }
}
'

The index doesn't need to exist yet, it will be created automatically. I am using a bulk size of 1 as there aren't really many tweets. If you are indexing a lot of data you might consider setting this to a higher value.

After issuing the call you should see some information in the logs that the river is starting and receiving data. You can see how many tweets there are in your index by issuing a count query:

curl 'localhost:9200/froscon/_count?pretty=true

You can see the basic structure of the documents created by looking at the mapping that is created automatically.

http://localhost:9200/froscon/_mapping?pretty=true

The result is quite long so I am not replicating it here but it contains all the relevant information you might be interested in like the user who tweeted, the location of the user, the text, the mentions and any links in it.

Doing Analytics Using the ElasticSearch REST API

Once you have enough tweets indexed you can already do some analytics using the ElasticSearch REST API and the Query DSL. This requires you to have some understanding of the query syntax but you should be able to get started by skimming through the documentation.

Top Tweeters

First, we'd like to see who tweeted the most. This can be done by doing a query for all documents and facet on the user name. This will give us the names and count in a section of the response.

curl -X POST "http://localhost:9200/froscon/_search?pretty=true" -d '
  {
    "size": 0,
    "query" : {
      "match_all" : {}
    },
    "facets" : {
      "user" : { 
        "terms" : {
          "field" : "user.screen_name"
        } 
      }                            
    }
  }
'

Those are the top tweeters for FrOSCon:

Dominant Keywords

The dominant keywords can also be retrieved using a facet query, this time on the text of the tweet. As there are a lot of german tweets for FrOSCon and the text field is processed using the StandardAnalyzer that only removes english stopwords it might be necessary to exclude some terms. Also you might want to remove some other common terms that indicate retweets or are part of urls.

curl -X POST "http://localhost:9200/froscon/_search?pretty=true" -d '
  {
    "size": 0,
    "query" : {
      "match_all" : {}
    },
    "facets" : {
      "keywords" : { 
        "terms" : {
          "field" : "text", 
          "exclude" : ["froscon", "rt", "t.co", "http", "der", "auf", "ich", "my", "die", "und", "wir", "von"] 
        }
      }                            
    }
  }
'

Those are the dominant keywords for FrOSCon:

  • talk (no surprise for a conference)
  • slashme
  • teamix (a company that does very good marketing. Unfortunately in this case this is more because their fluffy tux got stolen. The tweet about it is the most retweeted tweet of the data.)

Summary

Using the Twitter River it is really easy to get some data into ElasticSearch. The Query DSL makes it easy to extract some useful information. Next week we will have a look at Kibana that doesn't necessarily require a deep understanding of the ElasticSearch queries and can visualize our data.

About Florian Hopf

I am working as a freelance software developer and consultant in Karlsruhe, Germany. If you liked this post you can follow me on Twitter or subscribe to my feed to get notified of new posts. If you think I could help you and your company and you'd like to work with me please contact me directly.

Keine Kommentare:

Kommentar veröffentlichen