Donnerstag, 23. Juli 2015

ActiveMQ as a Message Broker for Logstash

When scaling Logstash it is common to add a message broker that is used to temporarily buffer incoming messages before they are being processed by one or more Logstash nodes. Data is pushed to the brokers either through a shipper like Beaver that reads logfiles and sends each event to the broker. Alternatively the application can send the log events directly using something like a Log4j appender.

A common option is to use Redis as a broker that stores the data in memory but using other options like Apache Kafka is also possible. Sometimes organizations are not that keen to introduce lots of new technology and want to reuse existing stores. ActiveMQ is a widely used messaging and integration platform that supports different protocols and looks just perfect for the use as a message broker. Let's see the options to integrate it.

Setting up ActiveMQ

ActiveMQ can easily be set up using the scripts that ship with it. On Linux it's just a matter of executing ./activemq console. Using the admin console at http://127.0.0.1:8161/admin/ you can create new queues and even enqueue messages for testing.

Consuming messages with AMQP

An obvious way to try to connect ActiveMQ to Logstash is using AMQP, the Advanced Message Queuing Protocol. It's a standard protocol that is supported by different messaging platforms.

There used to be a Logstash input for AMQP but unfortunately it has been renamed to rabbitmq-input because RabbitMQ is the main system that is supported.

Let's see what happens if we try to use the input with ActiveMQ.

input {
    rabbitmq {
        host => "localhost"
        queue => "TestQueue"
        port => 5672
    }
}

output {
    stdout {
        codec => "rubydebug"
    }
}

We tell Logstash to listen on localhost on the standard port on a queue named TestQueue. The result should just be dumped to the standard output. Unfortunately Logstash only issues errors because it can't connect.

Logstash startup completed
RabbitMQ connection error: . Will reconnect in 10 seconds... {:level=>:error}

In the ActiveMQ logs we can see that our parameters are correct but unfortunately both systems seem to speak different dialects of AMQP.

 WARN | Connection attempt from non AMQP v1.0 client. AMQP,0,0,9,1
org.apache.activemq.transport.amqp.AmqpProtocolException: Connection from client using unsupported AMQP attempted
...

So bad luck with this option.

Consuming messages with STOMP

The aptly named Simple Text Oriented Messaging Protocol is another option that is supported by ActiveMQ. Fortunately there is a dedicated input for it. It is not included in Logstash by default but can be installed easily.

bin/plugin install logstash-input-stomp

Afterwards we can just use it in our Logstash config.

input {
    stomp {
        host => "localhost"
        destination => "TestQueue"
    }
}

output {
    stdout {
        codec => "rubydebug"
    }
}

This time we are better off: Logstash really can connect and dumps our message to the standard output.

bin/logstash --config stomp.conf 
Logstash startup completed
{
       "message" => "Can I kick it...",
      "@version" => "1",
    "@timestamp" => "2015-07-22T05:42:35.016Z"
}

Consuming messages with JMS

Though the stomp-input works there is even another option that is not released yet but can already be tested: jms-input supports the Java Messaging System, the standard way of doing messaging on the JVM.

Currently you need to build the plugin yourself (which didn't work on my machine but should be caused by my outdated local jruby installation).

Getting data in ActiveMQ

Now that we know of ways to consume data from ActiveMQ it is time to think about how to get data in. When using Java you can use something like a Log4j- or Logback-Appender that push the log events directly to the queue using JMS.

When it comes to shipping data unfortunately none of the more popular solutions seems to be able to push data to ActiveMQ. If you know of any solution that can be used it would be great if you could leave a comment.

All in all I think it can be possible to use ActiveMQ as a broker for Logstash but it might require some more work when it comes to shipping data.

Freitag, 6. Februar 2015

Fixing Elasticsearch Allocation Issues

Last week I was working with some Logstash data on my laptop. There are around 350 indices that contain the logstash data and an index that holds the metadata for Kibana 4. When trying to start the single node cluster I have to wait a while, until all indices are available. Some APIs can be used to see the progress of the startup process.

The cluster health API gives general information about the state of the cluster and indicates if the cluster health is green, yellow or red. After a while the number of unassigned shards didn't change anymore but the cluster still stayed in a red state.

curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 1850,
  "active_shards" : 1850,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1852
}

One shard couldn't be recovered: 1850 were ok but it should have been 1851. To see the problem we can use the cat indices command that will show us all indices and their health.

curl http://localhost:9200/_cat/indices
[...]
yellow open logstash-2014.02.16 5 1 1184 0   1.5mb   1.5mb 
red    open .kibana             1 1                        
yellow open logstash-2014.06.03 5 1 1857 0     2mb     2mb 
[...]

The .kibana index didn't turn yellow. It only consists of one primary shard that couldn't be allocated.

Restarting the node and closing and opening the index didn't help. Looking at elasticsearch-kopf I could see that primary and replica shards both were unassingned (You need to tick the checkbox that says hide special to see the index).

Fortunately there is a way to bring the cluster in a yellow state again. We can manually allocate the primary shard on our node.

Elasticsearch provides the Cluster Reroute API that can be used to allocate a shard on a node. When trying to allocate the shard of the index .kibana I first got an exception.

curl -XPOST "http://localhost:9200/_cluster/reroute" -d'
{
    "commands" : [ {
          "allocate" : {
              "index" : ".kibana", "shard" : 0, "node" : "Jebediah Guthrie"
          }
        }
    ]
}'

[2015-01-30 13:35:47,848][DEBUG][action.admin.cluster.reroute] [Jebediah Guthrie] failed to perform [cluster_reroute (api)]
org.elasticsearch.ElasticsearchIllegalArgumentException: [allocate] trying to allocate a primary shard [.kibana][0], which is disabled
Fortunately the message already tells us the problem: By default you are not allowed to allocate primary shards due to the danger of losing data. If you'd like to allocate a primary shard you need to tell it Elasticsearch explicitly by setting the property allow_primary.
curl -XPOST "http://localhost:9200/_cluster/reroute" -d'
{
    "commands" : [ {
          "allocate" : {
              "index" : ".kibana", "shard" : 0, "node" : "Jebediah Guthrie", "allow_primary": "true"
          }
        }
    ]
}'

For me this helped and my shard got reallocated and the cluster health turned yellow.

I am not sure what caused the problems but it is very likely related to the way I am working locally. I am regularly sending my laptop to sleep which is something you never do on a server. Nevertheless I have seen this problem a few times locally which justifies writing down the necessary steps to fix it.

Freitag, 23. Januar 2015

Logging to Redis using Spring Boot and Logback

When doing centralized logging, e.g. using Elasticsearch, Logstash and Kibana or Graylog2 you have several options available for your Java application. You can either write your standard application logs and parse those using Logstash, either consumed directly or shipped to another machine using something like logstash-forwarder. Alternatively you can write in a more appropriate format like JSON directly so the processing step doesn't need that much work for parsing your messages. As a third option is to write to a different data store directly which acts as a buffer for your log messages. In this post we are looking at how we can configure Logback in a Spring Boot application to write the log messages to Redis directly.

Redis

We are using Redis as a log buffer for our messages. Not everyone is happy with Redis but it is a common choice. Redis stores its content in memory which makes it well suited for fast access but can also sync it to disc when necessary. A special feature of Redis is that the values can be different data types like strings, lists or sets. Our application uses a single key and value pair where the key is the name of the application and the value is a list that contains all our log messages. This way we can handle several logging applications in one Redis instance.

When testing your setup you might also want to look into the data that is stored in Redis. You can access it using the redis-cli client. I collected some useful commands for validating your log messages are actually written to Redis.

CommandDescription
KEYS *Show all keys in this Redis instance
LLEN keyShow the number of messages in the list for key
LRANGE key 0 100Show the first 100 messages in the list for key

The Logback Config

When working with Logback most of the time an XML file is used for all the configuration. Appenders are the things that send the log output somewhere. Loggers are used to set log levels and attach appenders to certain pieces of the application.

For Spring Boot Logback is available for any application that uses the spring-boot-starter-logging which is also a dependency of the common spring-boot-starter-web. The configuration can be added to a file called logback.xml that resides in src/main/resources.

Spring boot comes with a file and a console appender that are already configured correctly. We can include the base configuration in our file to keep all the predefined configurations.

For logging to Redis we need to add another appender. A good choice is the logback-redis-appender that is rather lightweight and uses the Java client Jedis. The log messages are written to Redis in JSON directly so it's a perfect match for something like logstash. We can make Spring Boot log to a local instance of Redis by using the following configuration.

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
    <include resource="org/springframework/boot/logging/logback/base.xml"/>
    <appender name="LOGSTASH" class="com.cwbase.logback.RedisAppender">
        <host>localhost</host>
        <port>6379</port>
        <key>my-spring-boot-app</key>
    </appender>
    <root level="INFO">
        <appender-ref ref="LOGSTASH" />
        <appender-ref ref="CONSOLE" />
        <appender-ref ref="FILE" />
    </root>
</configuration>

We configure an appender named LOGSTASH that is an instance of RedisAppender. Host and port are set for a local Redis instance, key identifies the Redis key that is used for our logs. There are more options available like the interval to push log messages to Redis. Explore the readme of the project for more information.

Spring Boot Dependencies

To make the logging work we of course have to add a dependency to the logback-redis-appender to our pom. Depending on your Spring Boot version you might see some errors in your log file that methods are missing.

This is because Spring Boot manages the dependencies it uses internally and the versions for jedis and commons-pool2 do not match the ones that we need. If this happens we can configure the versions to use in the properties section of our pom.

<properties>
    <commons-pool2.version>2.0</commons-pool2.version>
    <jedis.version>2.5.2</jedis.version>
</properties>

Now the application will start and you can see that it sends the log messages to Redis as well.

Enhancing the Configuration

Having the host and port configured in the logback.xml is not the best thing to do. When deploying to another environment with different settings you have to change the file or deploy a custom one.

The Spring Boot integration of Logback allows to set some of the configuration options like the file to log to and the log levels using the main configuration file application.properties. Unfortunately this is a special treatment for some values and you can't add custom values as far as I could see.

But fortunately Logback supports the use of environment variables so we don't have to rely on configuration files. Having set the environment variables REDIS_HOST and REDIS_PORT you can use the following configuration for your appender.

    <appender name="LOGSTASH" class="com.cwbase.logback.RedisAppender">
        <host>${REDIS_HOST}</host>
        <port>${REDIS_PORT}</port>
        <key>my-spring-boot-app</key>
    </appender>

We can even go one step further. To only activate the appender when the property is set you can add conditional processing to your configuration.

    <if condition='isDefined("REDIS_HOST") &amp;&amp; isDefined("REDIS_PORT")'>
        <then>
            <appender name="LOGSTASH" class="com.cwbase.logback.RedisAppender">
                <host>${REDIS_HOST}</host>
                <port>${REDIS_PORT}</port>
                <key>my-spring-boot-app</key>
            </appender>
        </then>
    </if>

You can use a Java expression for deciding if the block should be evaluated. When the appender is not available Logback will just log an error and uses any other appenders that are configured. For this to work you need to add the Janino library to your pom.

Now the appender is activated based on the environment variables. If you like you can skip the setup for local development and only set the variables on production systems.

Conclusion

Getting started with Spring Boot or logging to Redis alone is very easy but some of the details are some work to get right. But it's worth the effort: Once you get used to centralized logging you don't want to have your systems running without it anymore.

Freitag, 12. Dezember 2014

Use Cases for Elasticsearch: Analytics

In the last post in this series we have seen how we can use Logstash, Elasticsearch and Kibana for doing logfile analytics. This week we will look at the general capabilities for doing analytics on any data using Elasticsearch and Kibana.

Use Case

We have already seen that Elasticsearch can be used to store large amounts of data. Instead of putting data into a data warehouse Elasticsearch can be used to do analytics and reporting. Another use case is social media data: Companies can look at what happens with their brand if they have the possibility to easily search it. Data can be ingested from multiple sources, e.g. Twitter and Facebook and combined in one system. Visualizing data in tools like Kibana can help with exploring large data sets. Finally mechanisms like Elasticsearchs Aggregations can help with finding new ways to look at the data.

Aggregations

Aggregations provide what the now deprecated facets have been providing but also a lot more. They can combine and count values from different documents and therefore show you what is contained in your data. For example if you have tweets indexed in Elasticsearch you can use the terms aggregation to find the most common hashtags. For details on indexing tweets in Elasticsearch see this post on the Twitter River and this post on the Twitter input for Logstash.

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
    "aggs" : {
        "hashtags" : {
            "terms" : { 
                "field" : "hashtag.text" 
            }
        }
    }
}'

Aggregations are requested using the aggs keyword, hashtags is a name I have chosen to identify the result and the terms aggregation counts the different terms for the given field (Disclaimer: For a sharded setup the terms aggregation might not be totally exact). This request might result in something like this:

"aggregations": {
      "hashtags": {
         "buckets": [
            {
               "key": "dartlang",
               "doc_count": 229
            },
            {
               "key": "java",
               "doc_count": 216
            },
[...]

The result is available for the name we have chosen. Aggregations put the counts into buckets that contain of a value and a count. This is very similar to how faceting works, only the names are different. For this example we can see that there are 229 documents for the hashtag dartlang and 216 containing the hashtag java.

This could also be done with facets alone but there is more: Aggregations can even be combined. You can now nest another aggregation in the first one that for every bucket will give you more buckets for another criteria.

curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d'
{
    "aggs" : {
        "hashtags" : {
            "terms" : { 
                "field" : "hashtag.text" 
            },
            "aggs" : {
                "hashtagusers" : {
                    "terms" : {
                        "field" : "user.screen_name"
                    }
                }
            }
        }
    }
}'

We still request the terms aggregation for the hashtag. But now we have another aggregation embedded, a terms aggregation that processes the user name. This will then result in something like this.

               "key": "scala",
               "doc_count": 130,
               "hashtagusers": {
                  "buckets": [
                     {
                        "key": "jaceklaskowski",
                        "doc_count": 74
                     },
                     {
                        "key": "ManningBooks",
                        "doc_count": 3
                     },
    [...]

We can now see the users that have used a certain hashtext. In this case one user used one hashtag a lot. This is information that is not available that easily with queries and facets alone.

Besides the terms aggreagtion we have seen here there are also lots of other interesting aggregations available and more are added with every release. You can choose between bucket aggregations (like the terms aggregation) and metrics aggregations, that calculate values from the buckets, e.g. averages oder other statistical values.

Visualizing the Data

Besides the JSON output we have seen above, the data can also be used for visualizations. This is something that can then be prepared even for a non technical audience. Kibana is one of the options that is often used for logfile data but can be used for data of all kind, e.g. the Twitter data we have already seen above.

There are two bar charts that display the term frequencies for the mentions and the hashtags. We can already see easily which values are dominant. Also, the date histogram to the right shows at what time most tweets are sent. All in all these visualizations can provide a lot of value when it comes to trends that are only seen when combining the data.

The image shows Kibana 3, which still relies on the facet feature. Kibana 4 will instead provide access to the aggregations.

Conclusion

This post ends the series on use cases for Elasticsearch. I hope you enjoyed reading it and maybe you learned something new along the way. I can't spend that much time blogging anymore but new posts will be coming. Keep an eye on this blog.

Freitag, 19. September 2014

Use Cases for Elasticsearch: Index and Search Log Files

In the last posts we have seen some of the properties of using Elasticsearch as a document store, for searching text content and geospatial search. In this post we will look at how it can be used to index and store log files, a very useful application that can help developers and operations in maintaining applications.

Logging

When maintaining larger applications that are either distributed across several nodes or consist of several smaller applications searching for events in log files can become tedious. You might already have been in the situation that you have to find an error and need to log in to several machines and look at several log files. Using Linux tools like grep can be fun sometimes but there are more convenient ways. Elasticsearch and the projects Logstash and Kibana, commonly known as the ELK stack, can help you with this.

With the ELK stack you can centralize your logs by indexing them in Elasticsearch. This way you can use Kibana to look at all the data without having to log in on the machine. This can also make Operations happy as they don't have to grant access to every developer who needs to have access to the logs. As there is one central place for all the logs you can even see different applications in context. For example you can see the logs of your Apache webserver combined with the log files of your application server, e.g. Tomcat. As search is core to what Elasticsearch is doing you should be able to find what you are looking for even more quickly.

Finally Kibana can also help you with becoming more proactive. As all the information is available in real time you also have a visual representation of what is happening in your system in real time. This can help you in finding problems more quickly, e.g. you can see that some resource starts throwing Exceptions without having your customers report it to you.

The ELK Stack

For logfile analytics you can use all three applications of the ELK stack: Elasticsearch, Logstash and Kibana. Logstash is used to read and enrich the information from log files. Elasticsearch is used to store all the data and Kibana is the frontend that provides dashboards to look at the data.

The logs are fed into Elasticsearch using Logstash that combines the different sources. Kibana is used to look at the data in Elasticsearch. This setup has the advantage that different parts of the log file processing system can be scaled differently. If you need more storage for the data you can add more nodes to the Elasticsearch cluster. If you need more processing power for the log files you can add more nodes for Logstash.

Logstash

Logstash is a JRuby application that can read input from several sources, modify it and push it to a multitude of outputs. For running Logstash you need to pass it a configuration file that determines where the data is and what should be done with it. The configuration normally consists of an input and an output section and an optional filter section. This example takes the Apache access logs, does some predefined processing and stores them in Elasticsearch:

input {
  file {
    path => "/var/log/apache2/access.log"
  }
}

filter {
  grok {
    match => { message => "%{COMBINEDAPACHELOG}" }
  }
}

output {
  elasticsearch_http {
    host => "localhost"
  }
}

The file input reads the log files from the path that is supplied. In the filter section we have defined the grok filter that parses unstructured data and structures it. It comes with lots of predefined patterns for different systems. In this case we are using the complete Apache log pattern but there are also more basic building block like parsing email and ip addresses and dates (which can be lots of fun with all the different formats).

In the output section we are telling Logstash to push the data to Elasticsearch using http. We are using a server on localhost, for most real world setups this would be a cluster on separate machines.

Kibana

Now that we have the data in Elasticsearch we want to look at it. Kibana is a JavaScript application that can be used to build dashboards. It accesses Elasticsearch from the browser so whoever uses Kibana needs to have access to Elasticsearch.

When using it with Logstash you can open a predefined dashboard that will pull some information from your index. You can then display charts, maps and tables for the data you have indexed. This screenshot displays a histogram and a table of log events but there are more widgets available like maps and pie and bar charts.

As you can see you can extract a lot of data visually that would otherwise be buried in several log files.

Conclusion

The ELK stack can be a great tool to read, modify and store log events. Dashboards help with visualizing what is happening. There are lots of inputs in Logstash and the grok filter supplies lots of different formats. Using those tools you can consolidate and centralize all your log files.

Lots of people are using the stack for analyzing their log file data. One of the articles that is available is by Mailgun, who are using it to store billions of events. And if that's not enough read this post on how CERN uses the ELK stack to help running the Large Hadron Collider

In the next post we will look at the final use case for Elasticsearch: Analytics.

Freitag, 29. August 2014

Use Cases for Elasticsearch: Geospatial Search

In the previous posts we have seen that Elasticsearch can be used to store documents in JSON format and distribute the data across multiple nodes as shards and replicas. Lucene, the underlying library, provides the implementation of the inverted index that can be used to search the documents. Analyzing is a crucial step for building a good search application.

In this post we will look at a different feature that can be used for applications you would not immediately associate Elasticsearch with. We will look at the geo features that can be used to build applications that can filter and sort documents based on the location of the user.

Locations in Applications

Location based features can be useful for a wide range of applications. For merchants the web site can present the closest point of service for the current user. Or there is a search facility for finding points of services according to a location, often integrated with something like Google Maps. For classifieds it can make sense to sort them by distance from the user searching, the same is true for any search for locations like restaurants and the like. Sometimes it also makes sense to only show results that are in a certain area around me, in this case we need to filter by distance. Probably the user is looking for a new appartment and is not interested in results that are too far away from his workplace. Finally locations can also be of interest when doing analytics. Social media data can tell you where something interesting is happening just by looking at the amount of status messages sent from a certain area.

Most of the time locations are stored as a pair of latitude and longitude, which denotes a point. The combination of 48.779506, 9.170045 for example points to Liederhalle Stuttgart which happens to be the location for Java Forum Stuttgart. Geohashes are an alternative means to encode latitude and longitude. They can be stored in arbitrary precision so those can also refer to a larger area instead of a point.

When calculating a Geohash the map is divided into several buckets or cells. Each bucket is identified by a base 32 encoded value. The complete geohash then consists of a sequence of characters. Each following character marks the bucket in the previous bucket so you are zooming in to the location. The longer the geohash string the more precise the location is. For example u0wt88j3jwcp is the geohash for Liederhalle Stuttgart. The prefix u0wt on the other hand is the area of Stuttgart and some of the surrounding cities.

The hierarchical nature of geohashes and the possiblity to express them as strings makes them a good choice for storing them in the inverted index. You can create geohashes using the original geohash service or more visually appealing using the nice GeohashExplorer.

Locations in Elasticsearch

Elasticsearch accepts lat and lon for specifying latitude and longitude. These are two documents for a conference in Stuttgart and one in Nuremberg.

{
    "title" : "Anwendungsfälle für Elasticsearch",
    "speaker" : "Florian Hopf",
    "date" : "2014-07-17T15:35:00.000Z",
    "tags" : ["Java", "Lucene"],
    "conference" : {
        "name" : "Java Forum Stuttgart",
        "city" : "Stuttgart",
            "coordinates": {
                "lon": "9.170045",
                "lat": "48.779506"
            }
    } 
}
{
    "title" : "Anwendungsfälle für Elasticsearch",
    "speaker" : "Florian Hopf",
    "date" : "2014-07-15T16:30:00.000Z",
    "tags" : ["Java", "Lucene"],
    "conference" : {
        "name" : "Developer Week",
        "city" : "Nürnberg",
            "coordinates": {
                "lon": "11.115358",
                "lat": "49.417175"
            }
    } 
}

Alternatively you can use the GeoJSON format, accepting an array of longitude and latitude. If you are like me be prepared to hunt down why queries aren't working just to notice that you messed up the order in the array.

The field needs to be mapped with a geo_point field type.

{
    "properties": {
          […],
       "conference": {
            "type": "object",
            "properties": {
                "coordinates": {
                    "type": "geo_point",
                    "geohash": "true",
                    "geohash_prefix": "true"
                }
            }
       }
    }
}'

By passing the optional attribute geohash Elasticsearch will automatically store the geohash for you as well. Depending on your usecase you can also store all the parent cells of the geohash using the parameter geohash_prefix. As the values are just strings this is a normal ngram index operation which stores the different substrings for a term, e.g. u, u0, u0w and u0wt for u0wt.

With our documents in place we can now use the geo information for sorting, filtering and aggregating results.

Sorting by Distance

First, let's sort all our documents by distance from a point. This would allow us to build an application that displays the closest location for the current user.

curl -XPOST "http://localhost:9200/conferences/_search " -d'
{
    "sort" : [
        {
            "_geo_distance" : {
                "conference.coordinates" : {
                    "lon": 8.403697,
                    "lat": 49.006616
                },
                "order" : "asc",
                "unit" : "km"
            }
        }
    ]
}'

We are requesting to sort by _geo_distance and are passing in another location, this time Karlsruhe, where I live. Results should be sorted ascending so the closer results come first. As Stuttgart is not far from Karlsruhe it will be first in the list of results.

The score for the document will be empty. Instead there is a field sort that contains the distance of the locations from the one provided. This can be really handy when displaying the results to the user.

Filtering by Distance

For some usecase we would like to filter our results by distance. Some online real estate agencies for example provide the option to only display results that are in a certain distance from a point. We can do the same by passing in a geo_distance filter.

curl -XPOST "http://localhost:9200/conferences/_search" -d'
{
   "filter": {
      "geo_distance": {
         "conference.coordinates": {
            "lon": 8.403697,
            "lat": 49.006616
         },
         "distance": "200km",
         "distance_type": "arc"
      }
   }
}'

We are again passing the location of Karlsruhe. We request that only documents in a distance of 200km should be returned and that the arc distance_type should be used for calculating the distance. This will take into account that we are living on a globe.

The resulting list will only contain one document, Stuttgart, as Nuremberg is just over 200km away. If we use the distance 210km both of the documents will be returned.

Geo Aggregations

Elasticsearch provides several useful geo aggregations that allow you to retrieve more information on the locations of your documents, e.g. for faceting. On the other hand as we do have the geohash as well as the prefix enabled we can retrieve all of the cells our results are in using a simple terms aggregation. This way you can let the user drill down on the results by filtering on the cell.

curl -XPOST "http://localhost:9200/conferences/_search" -d'
{
    "aggregations" : {
        "conference-hashes" : {
            "terms" : {
                "field" : "conference.coordinates.geohash"
            }
        }
    }
}'

Depending on the precision we have chosen while indexing this will return a long list of prefixes for hashes but the most important part is at the beginning.

[...]
   "aggregations": {
      "conference-hashes": {
         "buckets": [
            {
               "key": "u",
               "doc_count": 2
            },
            {
               "key": "u0",
               "doc_count": 2
            },
            {
               "key": "u0w",
               "doc_count": 1
            },
            [...]
        }
    }

Stuttgart and Nuremberg both share the parent cells u and u0.

Alternatively to the terms aggregation you can also use specialized geo aggregations, e.g. the geo distance aggregation for forming buckets of distances.

Conclusion

Besides the features we have seen here Elasticsearch offers a wide range of geo features. You can index shapes and query by arbitrary polygons, either by passing them in or by passing a reference of an indexed polygon. When geohash prefixes are turned on you can also filter by geohash cell.

With the new HTML 5 location features location aware search and content delivery will become more important. Elasticsearch is a good fit for building this kind of applications.

Two users in the geo space are Foursquare, a very early user of Elasticsearch, and Gild, a recruitment agency that does some magic with locations.

Mittwoch, 13. August 2014

Resources for Freelancers

More than half a year ago I wrote a post on my first years as a freelancer. While writing the post I noticed that there are quite some resources I would like to recommend which I deferred to another post that never was written. Last weekend at SoCraTes we had a very productive session on freelancing. We talked about different aspects from getting started, kinds of and reasons for freelancing to handling your sales pipeline.

David and me recommended some resources on getting started so this is the perfect excuse to write the post I planned to write initially.

I will keep it minimal and only present a short description of each point.

Softwerkskammer Freelancer Group
Our discussion at SoCraTes lead to founding a new group in the Softwerkskammer community. We plan to exchange knowledge and probably even work opportunities.
The Freelancers' Show
A podcast on everything freelancing. Started as the Ruby Freelancers but the topics always were general. Fun to listen to, when it comes to software development you might also want to listen to the Ruby Rogues.
Book Yourself Solid
I read this when getting started with freelancing. Helps you with deciding what you want to do and with marketing yourself.
Get Clients Now
A workbook with daily tasks for improving your business. It's a 28 day program that contains some really good ideas and helps you working on getting more work.
Duct Tape Marketing
A book on improving your marketing activities. I took less out of this book than the other two.
Email Course by Eric Davis
Eric Davis, one of the hosts of the Freelancers' Show also provides a free email course for freelancers.
Mediafon Ratgeber Selbstständige
A German book on all practical issues you have to take care of.

There is also stuff that is only slightly related to freelancing but helped me on the way, either through learning or motivation.

Technical Blogging
A book that can help you getting started with blogging. Can be motivating but also contains some good tips.
My Blog Traffic Sucks.
A short book on blogging. This book lead to my very frequent blog publishing habit.
Confessions of a Public Speaker
A very good and entertaining read on presenting.
The 100$ Startup
Not exactly about freelancing but about small startups of all kinds, a very entertaining read about people working for themselves.

I am sure I forgot some of the things that helped me but I hope one of the resources can help you and your freelancing business. If you are missing something feel free to leave a comment.