Donnerstag, 24. Januar 2013

Make your Filters Match: Faceting in Solr

Facets are a great search feature that let users easily navigate to the documents they are looking for. Solr makes it really easy to use them though when naively querying for facet values you might see some unexpected behaviour. Read on to learn the basics of what is happening when you are passing in filter queries for faceting. Also, I'll show how you can leverage local params to choose a different query parser when selecting facet values.

Introduction

Facets are a way to display categories next to a users search results, often with a count of how many results are in this category. The user can then select one of those facet values to retrieve only those results that are assigned to this category. This way he doesn't have to know what category he is looking for when entering the search term as all the available categories are delivered with the search results. This approach is really popular on sites like Amazon and eBay and is a great way to guide the user.

Solr brought faceting to the Lucene world and arguably the feature was an important driving factor for its success (Lucene 3.4 introduced faceting as well). Facets can be build from terms in the index, custom queries and ranges though in this post we will only look at field facets.

As a very simple example consider this schema definition:

<fields>
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="text" type="text_general" indexed="true" stored="true"/>
<field name="author" type="string" indexed="true" stored="false"/>
</fields>

There are three fields, the id, a title that we'd probably like to search on and an author. The author is defined as a string field which means no analyzing at all. The faceting mechanism uses the term value and not a stored value so we want to make sure that the original value is preserved. I explicitly don't store the author information to make it clear that we are working with the indexed value.

Let's index some book data with curl (see this GitHub repo for the complete example including some unit tests that execute the same functionality using Java).

curl http://localhost:8082/solr/update -H "Content-Type: text/xml" --data-binary \
'<add><doc>
<field name="id">1</field>
<field name="text">On the Shortness of Life</field>
<field name="author">Seneca</field>
</doc>
<doc>
<field name="id">2</field>
<field name="text">What I Talk About When I Talk About Running</field>
<field name="author">Haruki Murakami</field>
</doc>
<doc>
<field name="id">3</field>
<field name="text">The Dude and the Zen Master</field>
<field name="author">Jeff "The Dude" Bridges</field>
</doc>
</add>'
curl http://localhost:8082/solr/update -H "Content-Type: text/xml" --data-binary '<commit />'

And verify that the documents are available:

curl http://localhost:8082/solr/query?q=*:*
{
"responseHeader":{
"status":0,
"QTime":3,
"params":{
"q":"*:*"}},
"response":{"numFound":3,"start":0,"docs":[
{
"id":"1",
"text":"On the Shortness of Life"},
{
"id":"2",
"text":"What I Talk About When I Talk About Running"},
{
"id":"3",
"text":"The Dude and the Zen Master"}]
}}

I'll omit parts of the response in the following examples. We can also have a look at the shiny new administration view of Solr 4 to see all terms that are indexed for the field author.

Each of the author names is indexed as one term.

Faceting

Let's move on to the faceting part. To let the user drill down on search results there are two steps involved. First you tell Solr that you would like to retrieve facets with the results. Facets are contained in an extra section of the response and consist of the indexed term as well as a count. As with most Solr parameters you can either send the necessary options with the query or preconfigure them in solrconfig.xml. This query has faceting on the author field enabled:

curl "http://localhost:8082/solr/query?q=*:*&facet=on&facet.field=author"
{
  "responseHeader":{...},
  "response":{"numFound":3,"start":0,"docs":[
      {
        "id":"1",
        "text":"On the Shortness of Life"},
      {
        "id":"2",
        "text":"What I Talk About When I Talk About Running"},
      {
        "id":"3",
        "text":"The Dude and the Zen Master"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Haruki Murakami",1,
        "Jeff \"The Dude\" Bridges",1,
        "Seneca",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

And this is what a configuration in solrconfig looks like:

<requestHandler name="/select" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="q">*:*</str>  
    <str name="echoParams">none</str>
    <int name="rows">10</int>
    <str name="df">text</str>
    <str name="facet">on</str>
    <str name="facet.field">author</str>
    <str name="facet.mincount">1</str>
  </lst>
</requestHandler>

This way we don't have to pass the parameters with the query anymore and can see which parts of the query change.

Common Filtering

When a user chooses a facet you issue the same query again, this time adding a filter query that restricts the search results to any that have the value for this certain fields set. In our case the user would only see books of one certain author. Let's start simple and pretend that a user can't handle the massive amount of 3 search results and is only interested in books on Seneca:

curl 'http://localhost:8082/solr/select?fq=author:Seneca'
{
  "responseHeader":{...},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"1",
        "text":"On the Shortness of Life"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Seneca",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

Works fine. We added a filter query that restricts the results to only those that are written by Seneca. Note that there is only one facet left because the search results don't contain any books by other authors. Let's see what happens when we try to filter the results to see only books by Haruki Murakami. We need to URL encode the blank, the rest of the query stays the same:

curl 'http://localhost:8082/solr/select?fq=author:Haruki%20Murakami'
{
  "responseHeader":{...},
  "response":{"numFound":0,"start":0,"docs":[]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[]},
    "facet_dates":{},
    "facet_ranges":{}}}

No results. Why is that? The default query parser for filter queries is the Lucene query parser. It tokenizes the query on whitespace, so even if we store the field unanalyzed it's not the query we are probably expecting to use. The query that is the result of the parsing process is not a term query as in our first example. It's a boolean query that consists of two term queries author:Haruki text:murakami. If you are familiar with the Lucene query syntax this won't be a surprise to you. If you prefix a term with a field name and a colon it will search on this field, otherwise it will search on the default field we declared in solrconfig.xml.

How can we fix it? Simple, just turn it into a phrase by surrounding the words with double quotes:

curl 'http://localhost:8082/solr/select?fq=author:"Haruki%20Murakami"'
{
  "responseHeader":{...},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"2",
        "text":"What I Talk About When I Talk About Running"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Haruki Murakami",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

Or, if you prefer, you can also escape the blank using the backslash, which yields the same result:

curl 'http://localhost:8082/solr/select?fq=author:Haruki\%20Murakami'

Fun fact: I am not that good at picking examples. If we are filtering on our last author we will be surprised (at least I scratched my head for a while):

curl 'http://localhost:8082/solr/select?fq=author:Jeff%20"The%20Dude"%20Bridges'
{
  "responseHeader":{...},
  "response":{"numFound":1,"start":0,"docs":[
      {
        "id":"3",
        "text":"The Dude and the Zen Master"}]
  },
  "facet_counts":{
    "facet_queries":{},
    "facet_fields":{
      "author":[
        "Jeff \"The Dude\" Bridges",1]},
    "facet_dates":{},
    "facet_ranges":{}}}

This actually seemed to work though we neither turned it into a phrase nor did we escape the blanks. If we look at how the Lucene query parser handles this query we see immediately why this returns a result. As with the last example this is turned into a boolean query, only the first query is executed against the author field. The other two tokens are searching on the default field and in this case "The Dude" matches the text field: author:Jeff text:"the dude" text:bridges. If you just want to match on the author field you can escape the blanks as we did in the example before:

curl 'http://localhost:8082/solr/select?fq=author:Jeff\%20\"The\%20Dude\"\%20Bridges'

I'll spare you with the response.

Using Local Params to set the Query Parser

At ApacheCon Europe in November Eric Hatcher did a really interesting presentation on query parsers in Solr where he introduced another, probably cleaner way to do this: You can use the local param syntax for choosing a different query parser. As we have learnt, the query parser defaults to the Lucene query parser. You can change the query parser for the query by setting the defType parameter, either via request parameters or in the solrconfig.xml but I am not aware of any way to set it for the filter queries. As we have unanalyzed terms the correct thing to do would be to use a TermQuery, which can be built using the TermQParserPlugin. To use this parser we can explicitly set it in the filter query:

curl 'http://localhost:8082/solr/select?fq={!term%20f=author%20v='Jeff%20"The%20Dude"%20Bridges'}'

Or, for better readability, without the URL encoding:

curl 'http://localhost:8082/solr/select?fq={!term f=author v='Jeff "The Dude" Bridges'}'

The local params are enclosed by curly braces. The value term is a shorthand for type='term', f is the fiels the TermQuery should be built for and v the value. Though this might look quirky at first this is a really powerful feature, especially since you can reference other request parameters from the local params. Consider this configuration of a request handler:

<requestHandler name="/selectfiltered" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="q">*:*</str>  
    <str name="echoParams">explicit</str>
    <int name="rows">10</int>
    <str name="wt">json</str>
    <str name="indent">true</str>
    <str name="df">text</str>
    <str name="facet">on</str>
    <str name="facet.field">author</str>
    <str name="facet.mincount">1</str>
  </lst>
  <lst name="appends">
    <str name="fq">{!term f=author v=$author}</str>
  </lst>
</requestHandler>

The default configuration is the same as we were using above. Only the appends section is new, which adds additional parameters to the request. There are similar local params as we were using via curl, but the real filter query is replaced by the variable $author. This can now be passed in cleanly via an aptly named parameter:

curl 'http://localhost:8082/solr/selectfiltered?author=Jeff%20"The%20Dude"%20Bridges'

There are a lot of powerful features in Solr that are not that commonly used. To see this example in Java have a look at the Github repository of this blogpost.

About Florian Hopf

I am working as a freelance software developer and consultant in Karlsruhe, Germany. If you liked this post you can follow me on Twitter or subscribe to my feed to get notified of new posts. If you think I could help you and your company and you'd like to work with me please contact me directly.

Keine Kommentare:

Kommentar veröffentlichen