Importing Atom feeds in Solr using the Data Import Handler

I am working on a search solution that makes some of the content I am producing available through one search interface. One of the content stores is the blog you are reading right now, which among other options makes the content available here using Atom.

Solr, my search server of choice, provides the Data Import Handler that can be used to import data on a regular basis from sources like databases via JDBC or remote XML sources, like Atom.

Data Import Handler used to be a core part of Solr but starting from 3.1 it is shipped as a separate jar and not included in the standard war anymore. I am using Maven with overlays for development so I have to add a dependency for it:

<dependencies>
  <dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr</artifactId>
    <version>3.6.0</version>
    <type>war</type>
  </dependency>
  <dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr-dataimporthandler</artifactId>
    <version>3.6.0</version>
    <type>jar</type>
  </dependency>
</dependencies>

To enable the data import handler you have to add a request handler to your solrconfig.xml. Request handlers are registered for a certain url and, as the name suggests, are responsible for handling incoming requests:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">data-config.xml</str>
  </lst>
</requestHandler>

The file data-config.xml that is referenced here contains the mapping logic as well as the endpoint to access:

<?xml version="1.0" encoding="UTF-8" ?>
<dataConfig>
    <dataSource type="URLDataSource" encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>
    <document>
        <entity name="blog"
                pk="url"
                url="http://fhopf.blogspot.com/feeds/posts/default?max-results=100"
                processor="XPathEntityProcessor"
                forEach="/feed/entry" transformer="DateFormatTransformer,HTMLStripTransformer,TemplateTransformer">
            <field column="title" xpath="/feed/entry/title"/>
            <field column="url" xpath="/feed/entry/link[@rel='alternate']/@href"/>
            <!-- 2012-03-07T21:35:51.229-08:00 -->
            <field column="last_modified" xpath="/feed/entry/updated" 
                dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss.SSS" locale="en"/>
            <field column="text" xpath="/feed/entry/content" stripHTML="true"/>
            <field column="category" xpath="/feed/entry/category/@term"/>
            <field column="type" template="blog"/> 
        </entity>
    </document>
</dataConfig>

First we configure which datasource to use. This is where you alternatively would use another implementation when fetching documents from a database.

Documents describe the fields that will be stored in the index. The attributes for the entity element determine where and how to fetch the data, most importantly the url and the processor. forEach contains an XPath to identify the elements we'd like to loop over. The transformer attribute is used to specify some classes that are the available when mapping the remote XML to the Solr fields.

The field elements contain the mapping between the Atom document and the Solr index fields. The column attribute determines the name of the index field, xpath determines the node to use in the remote XML document. You can use advanced XPath options like mapping to attributes of elements where only another attribute is set. E.g. /feed/entry/link[@rel='alternate']/@href points to an element that determines an alternative representation of a blog post entry:

<feed ...> 
  ...
  <entry> 
    ...
    <link rel='alternate' type='text/html' href='http://fhopf.blogspot.com/2012/03/testing-akka-actors-from-java.html' title='Testing Akka actors from Java'/>
    ...
  </entry>
...
</feed>

For the column last_modified we are transforming the remote date format to the internal Solr representation using the DateProcessor. I am not sure yet if this is the correct solution as it seems to me I'm losing the timezone information. For the text field we are first removing all html elements that are contained in the blog post using the HTMLStripTransformer. Finally, the type contains a hardcoded value that is set using the TemplateTransformer.

To have everything in one place let's see how the schema for our index looks like:

<field name="url" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true"/>
<field name="category" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="type" type="string" indexed="true" stored="false"/>

Finally, how can you trigger the dataimport? There is an option described in the Solr wiki, but probably a simple solution might be enough for you. I am using a shell script that is triggered by a cron job. These are the contents:

#!/bin/bash
curl localhost:8983/solr/dataimport?command=full-import

The data import handler is really easy to setup and you can use it to import quite a lot of data sources into your index. If you need more advanced crawling features you might want to have a look at Apache ManifoldCF, a connector framework for plugging content repositories into search engines like Apache Solr.