Donnerstag, 10. April 2014

The Absolute Basics of Indexing Data

Ever wondered how a search engine works? In this post I would like to show you a high level view of the internal workings of a search engine and how it can be used to give fast access to your data. I won't go into any technical details, what I am describing here holds true for any Lucene based search engine, be it Lucene itself, Solr or Elasticsearch.

Input

Normally a search engine is agnostic to the real data source of indexing data. Most often you push data into it via an API that already needs to be in the expected format, mostly Strings and data types like integers. It doesn't matter if this data originally resides in a document in the filesystem, on a website or in a database.

Search engines are working with documents that consist of fields and values. Though not always used directly you can think of documents as JSON documents. For this post imagine we are building a book database. In our simplified world a book just consists of a title and one or more authors. This would be two example documents:

{
    "title" : "Search Patterns",
    "authors" : [ "Morville", "Callender" ],
}
{
    "title" : "Apache Solr Enterprise Search Server",
    "authors" : [ "Smiley", "Pugh" ]
}

Even though the structure of both documents is the same in our case, the format of the document doesn't need to be fixed. Both documents could have totally different attributes, nevertheless both could be stored in the same index. In reality you will try to keep the documents similar, after all, you need a way to handle the documents in your application.

Lucene itself doesn't even have the concept of a key. But of course you need a key to identify your documents when updating them. Both Solr and Elasticsearch have ids that can either be chosen by the application or be autogenerated.

Analyzing

For every field that is indexed a special process called analyzing is employed. What it does can differ from field to field. For example, in a simple case it might just split the terms on whitespace and remove any punctuation so Search Patterns would become two terms, Search and Patterns.

Index Structure

An inverted index, the structure search engines are using, is similar to a map that contains search terms as key and a reference to a document as value. This way the process of searching is just a lookup of the term in the index, a very fast process. Those might be the terms that are indexed for our example documents.

FieldTermDocument Id
titleApache2
Enterprise2
Patterns1
Search1,2
Server2
Solr2
authorCallender1
Morville1
Pugh2
Smiley2

A real index contains more information like position information to enable phrase queries and frequencies for calculating the relevancy of a document for a certain search term.

As we can see the index holds a reference to the document. This document, that is also stored with the search index, doesn't necessarily have to be the same as our input document. You can determine for each field if you would like to keep the original content which is normally controlled via an attribute named stored. As a general rule, you should have all the fields stored that you would like to display with the search results. When indexing lots of complete books and you don't need to display it in a results page it might be better to not store it at all. You can still search it, as the terms are available in the index, but you can't access the original content.

More on Analyzing

Looking at the index structure above we can already imagine how the search process for a book might work. The user enters a term, e.g. Solr, this term is then used to lookup the documents that contain the term. This works fine for cases when the user types the term correctly. A search for solr won't match for our current example.

To mitigate those difficulties we can use the analyzing process already mentioned above. Besides the tokenization that splits the field value into tokens we can do further preprocessing like removing tokens, adding tokens or modifying tokens (TokenFilter).

For our book case it might at first be enough to do lowercasing on the incoming data. So a field value Solr will then be stored as solr in the index. To enable the user to also search for Solr with an uppercase letter we need to do analyzing for the query as well. Often it is the same process that is used for indexing but there are also cases for different analyzers.

The analyzing process not only depends on the content of the documents (field types, language of text fields) but also on your application. Take one common scenario: Adding synonyms for terms to the index. You might think that you just take a huge list of synonyms like WordNet and add those to each of your application. This might in fact decrease the search experience of your users as there are too many false positives. Also, for certain terms of the domain of your users WordNet might not contain the correct synonyms at all.

Duplication

When designing the index structure there are two competing forces: Often you either optimize for query speed or for index size. If you have a lot of data you probably need to take care that you only store data that you really need and even only put terms in the index that are necessary for lookups. Oftentimes for smaller datasets the index size doesn't matter that much and you can design your index for query performance.

Let's look at an example that can make sense for both cases. In our book information system we would like to display an alphabetic navigation for the last name of the author. If the user clicks on A, all the books of authors starting with the letter A should be displayed. When using the Lucene query syntax you can do something like this with its wildcard support: Just issue a query that contains the letter the user clicked and a trailing *, e.g. a*.

Wildcard queries have become very fast with recent Lucene versions, nevertheless it still is a query time impact. You can also choose another way. When indexing the data you can add another field that just stores the first letter of the name. This is what the relevant configuration might look like in Elasticsearch but the concept is the same for Lucene and Solr:

"author": {
    "type": "multi_field",
    "fields": {
        "author" : {
            "type": "string"
        },
        "letter" : {
            "type": "string",
            "analyzer": "single_char_analyzer"
        }
    }
}

Under the hood, another term dictionary for the field author.letter will be created. For our example it will look like this:

FieldTermDocument Id
author.letterC1
M1
P2
S2

Now, instead of issuing a wildcard query on the author field we can directly query the author.letter field with the letter. You can even build the navigation from all the terms in the index using techniques like faceting the extract all the available terms for a field from the index.

Conclusion

These are the basics of the indexing data for a search engine. The inverted index structure makes searching really fast by moving some processing to the indexing phase. When we are not bound by any index size concerns we can design our index for query performance and add additional fields that duplicate some of the data. This design for queries is what makes search engines similar to how lots of the NoSQL solutions are used.

If you'd like to go deeper in the topics I recommend watching the talk What is in a Lucene index by Adrien Grand. He shows some of the concepts I have mentioned here (and a lot more) but also how those are implemented in Lucene.

Sonntag, 6. April 2014

BEDCon - Berlin Expert Days 2014

BEDCon is over again. This marks the third year I have been there and it still has the cheapest prices for a general conference I have seen in Germany. If you are looking for a nice conference in Berlin you should definitively consider it.

Interesting Talks

The three most interesting talks I attended:

  • Java App Servers are Dead by Eberhard Wolff. Contains some good arguments why deploying applications to app servers might not be the best idea. I especially liked the idea that the target application server is a dependency of your project (because you need a certain version) and dependencies should be explicit in you source tree.
  • Resilience with Hystrix by Uwe Friedrichsen. A more advanced talk that was perfect for me because I had seen an introduction to fault tolerance by Uwe Friedrichsen at XPDays Germany last year. Hystrix is a library I will definitively check out.
  • Log Managment with Graylog2 by Lennart Koopmann. Graylog2 is a full application for doing log management that includes Elasticsearch, MongoDB and a Play application. Lennart mentioned some interesting numbers about installations, a system with an impressive scale.

Talks I would have liked to see:

My Talks

Surprisingly I had two talks accepted for BEDCon. I am especially glad that the more important talk on Search-Driven Applications with Tobias of Exensio went really well and we were talking to a packed room. For me giving a talk in a team is far less stressful than giving a talk alone. I am looking forward to giving this talk again at Entwicklertag Karlsruhe in May. The slides are available on Speaker Deck.

My second short talk on Akka also went ok, the slides are online. If you are interested in Akka you can also have a look at my blogposts on message passing concurrency with Akka and supervision in Akka.

Tweets

I know, I am repeating myself, but again I stored all the tweets for the conference in Elasticsearch so I can look at them using Kibana. The usual rules apply, each retweet counts as a seperate tweet. I am using a sharded version of the index so some of the counts might not be totally exact.

Distribution

There seem to be more tweets on the first day. I also had the impression that there were fewer people there for the second day at all. The first day has a spike that is probably related to the blackout at around 12.

Hashtags

This is a surprise: elasticsearch beats any other hashtag. logstash, kibana, springmvc, roca ... very specialized hashtags as well. As you can see from the numbers there weren't that many tweets at all.

Mentions

An even bigger surprise for me: I seem to have got mentioned a lot. After looking into this, this is caused by retweets counting as a mention as well. I had some tweets that got retweeted a few times.

Negativity

There is one thing that had me quite upset. During a short power failure (which was not related to BEDCon at all) I was watching a short talk. The speaker had nothing better to do than to insult the technicians that were there to help him. I don't get this attitude and I hope to never see this speaker again on any conference I am attending.

BEDCon is a great conference, all the people involved are doing a great job. I hope in the following years I can find the time to go there again.

Freitag, 28. März 2014

JavaLand 2014: The Conference, the Talks, the Tweets

This week on Tuesday and Wednesday parts of the German Java community met for JavaLand, the conference of the German Java community. In this post I would like to talk about the conference in general, some of the talks I attended and show you some facts about what happened on Twitter.

Location and Organization

The conference has the most unusual setting one can imagine: A theme park, Phantasialand in Brühl. Though I was rather sceptical about this it really turned out to be a good choice. The rooms for the talks were mostly well suited and it was a nice place for meeting outside. On Tuesday evening and at some times during the day some of the attractions were open for use. Though I am not really into theme parks I have to admit that some of them really were fun.

Another unusual choice of the organizers: There were no lunch breaks. Every talk started at the full hour and lasted for 45 minutes, starting from 09:00 in the morning and continuing to the evening. You had 15 minutes to change rooms and get a coffee. The attendees could decide by themselves if they would like to eat a really quick lunch in 15 minutes or skip one of the slots completely. This circumvents some of the problems with large queues at other conferences or breaks that are too long for some participants.

The quality of the food was excellent, with buffets during lunch time and even Tuesday evening. There were several main dishes, different variations of salads and dessert. One of the best conference catering I have ever seen.

There were quite some community events, e.g. the Java User Group café or the innovation lab with the Oculus Rift and a Quadrocopter.

The Talks

Most of the talks were in German but there were also some in English. I doubt that you get a lot of value if you don't speak German and go to JavaLand just for the talks, though there were several high profile non German speakers. On several partner stages companies presented talks done by their employees. I had an especially good impression of the Codecentric stage and regret that I couldn't go to see Patrick Peschlow talking on Elasticsearch, a talk I heard good things about.

Some of the talks had a common theme:

Reactive Programming

There was a surprising amount of talks on reactive programming. First Roland Kuhn, the project lead of the Akka project, introduced the principles of building reactive applications. He showed how choosing an event driven approach can lead to scalable, resilient and responsive applications. He didn't go into technical details but only introduced the concepts.

At the end of day one, Michael Pisula gave a technical introduction into the actor framework Akka, also referring to a lot of the concepts Roland Kuhn mentioned in the morning. Though I already have a good grasp of Akka there were some details that were really useful.

At the beginning of the second day Niko Köbler and Heiko Spindler gave another talk on reactive programming. It was a mixture of concepts and technical details, showing MeteorJS, a framework for combining JavaScript on the client side with the server side, and again Akka.

Profiling

There were two sessions related to profiling: First Fabian Lange of Codecentric showing different aspects of profiling in general and some details on microbenchmarking. I will especially take a closer look on jmh, a microbenchmark tool in OpenJDK.

In "JVM and application bottleneck troubleshooting with simple tools" Daniel Witkowski of Azul demoed an example process of analyzing a problematic application. He used system tools and VisualVM to find and identify problems. A rather basic talk but nevertheless it is good to keep some of the practices in mind.

Building Web Applications

Felix Müller and Nils Wloka presented two different web frameworks. Play, presented by Felix Müller, is a framework that is mostly developed by Typesafe, the company behind Scala and Akka. You can use it from Java as well as Scala though the templates are always written in Scala. I started building an application using Play a while ago and had a quite good impression. If I had a use case for a full stack framework I would definitively have another look.

Nils Wloka did a session that obviously wasn't for everyone: 45 minutes of live coding in Clojure building a complete web application for voting on conference talks using Ring, Compojure and Hiccup. I am currently working on a simple application using the same stack so I could follow at least most of his explanations. I especially liked that he managed to finish the app, deploy it to OpenShift and used it to let the audience vote on his talk. Quite impressive.

I like it that both frameworks support rapid development. You don't need to restart the server as often as with common Java web development. A lot of the changes are available instantly.

Miscellaneous

There were two more talks I'd like to mention: On the second day Patrick Baumgartner gave an excellent introduction into Neo4J, the graph database on the JVM. He showed several use cases (including a graph of Whiskey brands that spawned quite a discussion on the properties of Whiskey in the audience) and some of the technical details. Though I just recently attended a long talk on Neo4J at JUG Karlsruhe and already had seen a similar talk by Patrick at BEDCon last year it was really entertaining and good for refreshing some of the knowledge.

Finally the highlight of the conference for me: On the first day Stefan Tilkov gave one of his excellent talks on architecture: He showed how splitting applications and building for replacement instead of building for reuse can lead to cleaner applications. He gave several examples of companies that had employed these principles. I have never attended a bad talk by Stefan so if he's giving a talk at a conference you are at you should definitively go see it.

The Tweets

As I have done before for Devoxx and other conferences I tracked all of the tweets with the hashtags #javaland and #javaland14, stored them in Elasticsearch and therefore had the possibility to analyze them with Kibana. I started tracking several months before the conference but I am only showing some of my findings for the conference days here, as those can give good insight into which topics were hot. Each retweet counts as a separate tweet so if there is one tweet that gets retweeted a lot this will have a strong impact on the numbers.

Timing

Looking at the distribution of the tweets for the two conference days we can see that there are a lot of tweets in the morning of the first day and surprisingly in the evening of the first day. I suspect those are mostly tweets by people announcing that they are now starting to ride the Black Mamba. Of course this might also be related to the start of the Java 8 launch event but I like the Black Mamba theory better.

Mentions

A good incication of popular speakers are the mentions. It's a huge surprise that the twitter handle for the conference only is at third place. The two speakers Arun Gupta and David Blevins seem to be rather popular on twitter (or they just had a really long discussion).

Hashtags and Common Terms

To see the trends we can have a look at the hashtags people used. I excluded #javaland as it is contained in every tweet anyway.

The Java 8 launch event is dominant but Java EE 7 and JavaFX both are strong as well. There was quite some interest for Stephen Chin and his Nighthacking sessions. Neo4J and Asciidoctor are quite a surprise (but not #wegotbeer, a reasonable hashtag).

Hashtags are one thing but we can also look at all the terms in the text of the tweets. I excluded a long list of common words so this is my curated list of the important terms.

"Great", "thanks" and "cool" ... I am not that deep into statistics but it seems to me that people liked the conference.

Conclusion

JavaLand was a fantastic conference. I had some doubts about the setting in the theme park but it was really great. I will definitively go there again next year, if you are in the area you should think about going there as well. Thanks to all organizers for doing a great job.

Freitag, 21. März 2014

Book Review: Instant Apache Solr for Indexing Data How-to

Indexing, the process of putting data in a search engine, often is the foundation of anything when building a search based application. With Instant Apache Solr Indexing Data Howto Alexandre Rafalovitch has written a book dedicated to different aspects of indexing.

The book is written in a cookbook style with tasks that are solved using different features of Apache Solr. Each task is classified with a difficulty level (simple, intermediate, advanced). The book shows you how to work with collections, index text and binary content (using http and Java) and how to use the Data Import Handler. You will learn about the difference between soft and hard commits, how the UpdateRequestProcessor works (showing the useful ScriptUpdateProcessor) and the functionality of atomic updates. The final task describes an approach to index content in multiple languages.

Though it is quite short the book contains some really useful information. As it is dedicated to indexing alone you can't really use it for learning how to work with all aspects of Apache Solr but you can get some bang for the buck for you daily work. The only thing that I missed in the book is a more detailed description of more filters and tokenizers. Nevertheless you get quite some value from the book for a reasonable price.

If you are looking for more in-depth information I recommend Apache Solr 3 Enterprise Search Server (which covers an older version) or the recently finished Solr in Action

Freitag, 14. März 2014

Building a Navigation from a Search Index

Though the classic use case for search engines is keyword search nowadays they are often used to serve content that doesn't resemble a search result list at all. As they are optimized for read access they make good candidates for serving any part of a website that needs a dynamic query mechanism. Faceting is most often used to display filters for refining the search results but it can also be used to build hierarchical navigations from results. I am using Solr in this post but the concept can also be used with Elasticsearch or even plain Lucene.

Browsing Search Results

What we are trying to achieve can often be seen on Ecommerce sites but is also useful for serving content. You will be familiar with the navigation that Amazon provides: You can browse the categories that are displayed in a hierarchical navigation.

Of course I am not familiar with the implementation details of how they are storing their content but search engines can be used to build something like this. You can index different types (editiorial content and products) and tag those with categories. The navigation and the page content is then built from the search results that are returned.

For a simple example we are indexing just products, two books and one record. Two Books by Haruki Murakami are in the category Books/Fiction and Books/Non-Fiction. The record by 13 & God is in the category Music/Downbeat. The resulting navigation should then be something like this:

  • Books
    • Non-Fiction
      • Haruki Murakami
    • Fiction
      • Haruki Murakami
  • Music
    • Downbeat
      • 13 & God

PathHierarchyTokenizer

Lucene provides the PathHierarchyTokenizer that can be used to split path like hierarchies. It takes a path as input and creates segments from it. For example when indexing Books/Non-Fiction/Haruki Murakami it will emit the following tokens:

  • Books
  • Books/Non-Fiction
  • Books/Non-Fiction/Haruki Murakami

What is important: It doesn't just split the string on a delimiter but creates a real hierarchy with all the parent paths. This can be used to build our navigation.

Solr

Let's see an example with Solr. The configuration and a unit test is also available on Github.

We are using a very simple schema with documents that only contain a title and a category

<fields>
    <field name="title" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
    <field name="category" type="category" indexed="true" stored="false"/>
</fields>

The title is just a string field but the category is a custom field that uses the PathHierarchyTokenizerFactory.

<fieldType name="category" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.PathHierarchyTokenizerFactory" delimiter="/"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
    </analyzer>
</fieldType>

When indexing data the path is split according to the rules of the PathHierarchyTokenizer. When querying we are taking the query term as it is so we can have exact matches.

Suppose we are now indexing three documents that are in the three categories we have seen above:

URL=http://localhost:8082/solr/update/json
curl $URL -H 'Content-type:application/json' -d '
[
  {"title":"What I Talk About When I Talk About Running", "category":"Books/Non-Fiction/Haruki Murakami"}, 
  {"title":"South of the Border, West of the Sun", "category":"Books/Fiction/Haruki Murakami"}, 
  {"title":"Own Your Ghost", "category":"Music/Downbeat/13&God"}
]'
curl "$URL?commit=true"

We can easily request the navigation using facets. We query on all documents but return no documents at all. A facet is returned that contains our navigation structure:

curl http://localhost:8082/solr/collection1/select?q=*%3A*&rows=0&wt=json&indent=true&facet=true&facet.field=category
{
  "responseHeader":{
    "status":0,
    "QTime":1},
  "response":{"numFound":3,"start":0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "category":[
        "Books",2,
        "Books/Fiction",1,
        "Books/Fiction/Haruki Murakami",1,
        "Books/Non-Fiction",1,
        "Books/Non-Fiction/Haruki Murakami",1,
        "Music",1,
        "Music/Downbeat",1,
        "Music/Downbeat/13&God",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

When displaying the navigation you can now simply split the paths according to their hierarchies. The queries that are executed for displaying the content contain a filter query that filters the currently selected navigation item.

curl "http://localhost:8082/solr/collection1/select?q=*%3A*&wt=json&fq=category:Books"
{
  "responseHeader":{
    "status":0,
    "QTime":27},
  "response":{"numFound":2,"start":0,"docs":[
      {
        "title":"What I Talk About When I Talk About Running"},
      {
        "title":"South of the Border, West of the Sun"}]
  }}

Using tags and exclusions you can even build the navigation using the same request that queries for the filtered documents.

curl "http://localhost:8082/solr/collection1/select?q=*%3A*&wt=json&fq=%7B%21tag%3Dcat%7Dcategory:Books&facet=true&facet.field=%7B%21ex%3Dcat%7Dcategory"
{
  "responseHeader":{
    "status":0,
    "QTime":2},
  "response":{"numFound":2,"start":0,"docs":[
      {
        "title":"What I Talk About When I Talk About Running"},
      {
        "title":"South of the Border, West of the Sun"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "category":[
        "Books",2,
        "Books/Fiction",1,
        "Books/Fiction/Haruki Murakami",1,
        "Books/Non-Fiction",1,
        "Books/Non-Fiction/Haruki Murakami",1,
        "Music",1,
        "Music/Downbeat",1,
        "Music/Downbeat/13&God",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

If the navigation you are retrieving from the search engine is your single source of truth you might also have to add a sort mechanism; by default all facets are sorted by their count. This works in our simple case but not for the real world. To have them sorted in a defined way you can add numeric identifiers. You would then index paths like 100|Books/110|Non-Fiction/111|Haruki Murakami and request alphanumeric sorting via facet.sort=index. When displaying the result you just remove the front of the string.

As you now are using the search engine to build the navigation you will immediately have the benefits of its filtering mechanisms. Only use categories that have online articles? Add a filter query fq=online:true. Make sure that there are no categories with products that are out of stock? fq=inStock:true.

Conclusion

Search engines offer great flexibility when delivering content. A lot of their functionality can be used to build applications that pull most of their content from the index.

Donnerstag, 6. März 2014

Prefix and Suffix Matches in Solr

Search engines are all about looking up strings. The user enters a query term that is then retrieved from the inverted index. Sometimes a user is looking for a value that is only a substring of values in the index and the user might be interested in those matches as well. This is especially important for languages like German that contain compound words like Semmelknödel where Knödel means dumpling and Semmel specializes the kind.

Wildcards

For demoing the approaches I am using a very simple schema. Documents consist of a text field and an id. The configuration as well as a unit test is also vailable on Github.

<fields>
    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="text" type="text_general" indexed="true" stored="false"/>
</fields>
<uniqueKey>id</uniqueKey>
<types>
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>
</types>

One approach that is quite popular when doing prefix or suffix matches is to use wildcards when querying. This can be done programmatically but you need to take care that any user input is then escaped correctly. Suppose you have the term dumpling in the index and a user enters the term dump. If you want to make sure that the query term matches the document in the index you can just add a wildcard to the user query in the code of your application so the resulting query then would be dump*

Generally you should be careful when doing too much magic like this: if a user is in fact looking for documents containing the word dump she might not be interested in documents containing dumpling. You need to decide for yourself if you would like to have only matches the user is interested in (precision) or show the user as many probable matches as possible (recall). This heavily depends on the use cases for your application.

You can increase the user experience a bit by boosting exact matches for your term. You need to create a more complicated query but this way documents containing an exact match will score higher:

dump^2 OR dump*

When creating a query like this you should also take care that the user can't add terms that will make the query invalid. The SolrJ method escapeQueryChars of the class ClientUtils can be used to escape the user input.

If you are now taking suffix matches into account the query can get quite complicated and creating a query like this on the client side is not for everyone. Depending on your application another approach can be the better solution: You can create another field containing NGrams during indexing.

Prefix Matches with NGrams

NGrams are substrings of your indexed terms that you can put in an additional field. Those substrings can then be used for lookups so there is no need for any wildcards. Using the (e)dismax handler you can automatically set a boost on your field that is used for exact matches so you get the same behaviour we have seen above.

For prefix matches we can use the EdgeNGramFilter that is configured for an additional field:

...
    <field name="text_prefix" type="text_prefix" indexed="true" stored="false"/>
...
    <copyField source="text" dest="text_prefix"/>
...    
    <fieldType name="text_prefix" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.LowerCaseTokenizerFactory"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        </analyzer>
    </fieldType>

During indexing time the text field value is copied to the text_prefix field and analyzed using the EdgeNGramFilter. Grams are created for any length between 3 and 15, starting from the front of the string. When indexing the term dumpling this would be

  • dum
  • dump
  • dumpl
  • dumpli
  • dumplin
  • dumpling

During query time the term is not split again so that the exact match for the substring can be used. As usual, the analyze view of the Solr admin backend can be a great help for seeing the analyzing process in action.

Using the dismax handler you can now pass in the user query as it is and just advice it to search on your fields by adding the parameter qf=text^2,text_prefix.

Suffix Matches

With languages that have compound words it's a common requirement to also do suffix matches. If a user queries for the term Knödel (dumpling) it is expected that documents that contain the termSemmelknödel also match.

Using Solr versions up to 4.3 this is no problem. You can use the EdgeNGramFilterFactory to create grams starting from the back of the string.

...
    <field name="text_suffix" type="text_suffix" indexed="true" stored="false"/>
...    
    <copyField source="text" dest="text_suffix"/>
...
    <fieldType name="text_suffix" class="solr.TextField" positionIncrementGap="100">
        <analyzer type="index">
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="back"/>
        </analyzer>
        <analyzer type="query">
            <tokenizer class="solr.KeywordTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
    </fieldType>
...

This creates suffixes of the indexed term that also contains the term knödel so our query works.

But, using more recent versions of Solr you will encounter a problem during indexing time:

java.lang.IllegalArgumentException: Side.BACK is not supported anymore as of Lucene 4.4, use ReverseStringFilter up-front and afterward
    at org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter.(EdgeNGramTokenFilter.java:114)
    at org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter.(EdgeNGramTokenFilter.java:149)
    at org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory.create(EdgeNGramFilterFactory.java:52)
    at org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory.create(EdgeNGramFilterFactory.java:34)

You can't use the EdgeNGramFilterFactory anymore for suffix ngrams. But fortunately the stack trace also advices us how to fix the problem. We have to combine it with ReverseStringFilter:

<fieldType name="text_suffix" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
        <filter class="solr.ReverseStringFilterFactory"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/>
        <filter class="solr.ReverseStringFilterFactory"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.LowerCaseTokenizerFactory"/>
    </analyzer>
</fieldType>

This will now yield the same results as before.

Conclusion

Whether you are going for manipulating your query by adding wildcards or if you should be using the NGram approach heavily depends on your use case and is also a matter of taste. Personally I am using NGrams most of the time as disk space normally isn't a concern for the kind of projects I am working on. Wildcard search has become a lot faster in Lucene 4 so I doubt there is a real benefit there anymore. Nevertheless I tend to do as much processing I can during indexing time.

Mittwoch, 26. Februar 2014

Introduction to the ODF Toolkit

Microsoft Office has been the dominating office suite and unfortunately it still is. For a long time not only the programs were closed but also the file format.

Open Document

Nevertheless there are open alternatives available, most notable Libre Office/Apache OpenOffice.org. In 2005 the OASIS foundation standardized Open Document, an open alternative to the proprietary world of Microsoft. Open Document is heavily influenced by the OpenOffice.org file format but is supported by multiple office suites and viewers.

Open Document files are zip files that contain some XML documents. You can go ahead and unzip any documents you might have:

unzip -l aufwaende-12.ods 
Archive:  aufwaende-12.ods
  Length      Date    Time    Name
---------  ---------- -----   ----
       46  2012-12-31 15:16   mimetype
      815  2012-12-31 15:16   meta.xml
     8680  2012-12-31 15:16   settings.xml
   171642  2012-12-31 15:16   content.xml
     3796  2012-12-31 15:16   Thumbnails/thumbnail.png
        0  2012-12-31 15:16   Configurations2/images/Bitmaps/
        0  2012-12-31 15:16   Configurations2/popupmenu/
        0  2012-12-31 15:16   Configurations2/toolpanel/
        0  2012-12-31 15:16   Configurations2/statusbar/
        0  2012-12-31 15:16   Configurations2/progressbar/
        0  2012-12-31 15:16   Configurations2/toolbar/
        0  2012-12-31 15:16   Configurations2/menubar/
        0  2012-12-31 15:16   Configurations2/accelerator/current.xml
        0  2012-12-31 15:16   Configurations2/floater/
    22349  2012-12-31 15:16   styles.xml
      993  2012-12-31 15:16   META-INF/manifest.xml
---------                     -------
   208321                     16 files

The mimetype file determines what kind of document it is (in this case application/vnd.oasis.opendocument.spreadsheet), META-INF/manifest.xml lists the files in the archive. The most important file is content.xml that contains the body of the document.

Server Side Processing

Though there are quite some viewers and editors for Open Document available when it comes to the server side the situation used to be different. For processing Microsoft Office files there is the Java library Apache POI, which provides a lot of functionality to read and manipulate Microsoft Office files. But if you wanted to process Open Document files nearly your only option was to install OpenOffice.org on the server and talk to it by means of its UNO API. Not exactly an easy thing to do.

ODF Toolkit

Fortunately there is light at the end of the tunnel: the ODF Toolkit project, currently incubating at Apache, provides lightweight access to files in the Open Document format from Java. As the name implies it's a toolkit, consisting of multiple projects.

The heart of it is the schema generator that ingests the Open Document specification that is available as a RelaxNG schema. It provides a template based facility to generate files from the ODF specification. Currently it only generates Java classes but it can also be used to create different files (think of documentation or accessors for different programming languages).

The next layer of the toolkit is ODFDOM. It provides templates that generate classes for DOM access of elements and attributes of ODF documents. Additionally it provides facilities like packaging and document encryption.

For example, you can list the file paths of an ODF document using the ODFPackage class:

OdfPackage pkg = OdfPackage.loadPackage("aufwaende-12.ods");
Set filePaths = pkg.getFilePaths();

If you are familiar with the Open Document spec ODFDOM will be the only library you need. But if you are like most of us and don't know all the elements and attributes by heart there is another project for you: Simple API provides easy access to a lot of the features you might expect from a library like this: You can deal with higher level abstractions like paragraphs for text or rows and cells in the spreadsheet world or search for and replace text.

This code snippet creates a spreadsheet, adds some cells to it and saves it:

SpreadsheetDocument doc = SpreadsheetDocument.newSpreadsheetDocument();
Table sheet = doc.getSheetByIndex(0);
sheet.getCellByPosition(0, 0).setStringValue("Betrag");
sheet.getCellByPosition(1, 0).setDoubleValue(23.0);
doc.save(File.createTempFile("odf", ".ods"));

Code

If you are interested in seeing more code using the ODF Toolkit you can have a look at the cookbook that contains a lot of useful code snippets for the Simple API. Additionally you should keep an eye on this blog for the second part of the series where we will look at an application that extracts data from spreadsheets.