Freitag, 11. Mai 2012

Content Extraction with Apache Tika

Sometimes you need access to the content of documents, be it that you want to analyze it, store the content in a database or index it for searching. Different formats like word documents, pdfs and html documents need different treatment. Apache Tika is a project that combines several open source projects for reading content from a multitude of file formats and makes the textual content as well as some metadata available using a uniform API. I will show two ways how to leverage the power of Tika for your projects.

Accessing Tika programmatically

First, Tika can of course be used as a library. Surprisingly the user docs on the website explain a lot of the functionality that you might be interested in when writing custom parsers for Tika but don't show directly how to use it.

I am using Maven again, so I add a dependency for the most recent version:


tika-parsers also includes all the other projects that are used so be patient when Maven fetches all the transitive dependencies.

Let's see what some test code for extracting data from a pdf document called slides.pdf, that is available in the classpath, looks like.

Parser parser = new PdfParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
InputStream content = getClass().getResourceAsStream("/slides.pdf");
parser.parse(content, handler, metadata, new ParseContext());
assertEquals("Solr Vortrag", metadata.get(Metadata.TITLE));

First, we need to instanciate a Parser that is capable of reading the format, in this case PdfParser that uses PDFBox for extracting the content. The parse method expects some parameters to configure the parsing process as well as an InputStream that contains the data of the document. Metadata will contain all the metadata for the document, e.g. the title or the author after the parsing is finished.

Tika uses XHTML as the internal representation for all parsed content. This XHTML document can be processed by a SAX ContentHandler. A custom implementation BodyContentHandler returns all the text in the body area, which is the main content. The last parameter ParseContext can be used to configure the underlying parser instance.

The Metadata class consists of a Map-like structure with some common keys like the title as well as optional format specific information. You can look at the contents with a simple loop:

for (String name: metadata.names()) { 
    System.out.println(name + ": " + metadata.get(name));

This will produce an output similar to this:

xmpTPg:NPages: 17
Creation-Date: 2010-11-20T09:47:28Z
title: Solr Vortrag
created: Sat Nov 20 10:47:28 CET 2010
producer: 2.4
Content-Type: application/pdf
creator: Impress

The textual content of the document can be retrieved by calling the toString() method on the BodyContentHandler.

This is all fine if you exactly know that you only want to retrieve data from pdf documents. But you probably don't want to introduce a huge switch-block for determining the parser to use depending on the file name or some other information. Fortunately Tika also provides an AutodetectParser that employs different strategies for determining the content type of the document. The code above all stays the same, you just use a different parser:

Parser parser = new AutodetectParser();

This way you don't have to know what kind of document you are currently processing, Tika will provide you with metadata as well as the content. You can pass in additional hints for the parser e.g. the filename or the content type by setting it in the Metadata object.

Extracting content using Solr

If you are using the search server Solr you can also leverage its REST API for extracting the content. The default configuration has a request handler configured for /update/extract that you can send a document to and it will return the content it extracted using Tika. You just need to add the necessary libraries for the extraction. I am still using Maven so I have to add an additional dependency:


This will include all of the Tika dependencies as well as all necessary third party libraries.

Solr Cell, the request handler, normally is used to index binary files directly but you can also just use it for extraction. To transfer the content you can use any tool that can speak http, e.g. for curl this might look like this:

curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text"

By setting the parameter extractOnly to true we advice Solr that we don't want to index the content but want to have it extracted to the response. The result will be the standard Solr XML format that contains the body content as well as the metadata.

You can also use the Java client library SolrJ for doing the same:

ContentStreamUpdateRequest request = new ContentStreamUpdateRequest("/update/extract");
request.addFile(new File("slides.pdf"));
request.setParam("extractOnly", "true");
request.setParam("extractFormat", "text");
NamedList<Object> result = server.request(request);

The NamedList will contain entries for the body content as well as another NamedList with the metadata.


Robert has asked in the comments what the response looks like.
Solr uses configurable response writers for marshalling the message. The default format is xml but can be influenced by passing the wt attribute to the request. A simplified standard response looks like this:

curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text"
<?xml version="1.0" encoding="UTF-8"?>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1952</int></lst><str name="slides.pdf">


XML­basierte Konfiguration
Sammlung nützlicher Lucene­Module/Dismax


XML­basierte Konfiguration
Sammlung nützlicher Lucene­Module/Dismax
Java­Client SolrJ

[... more content ...]

</str><lst name="slides.pdf_metadata"><arr name="xmpTPg:NPages"><str>17</str></arr><arr name="Creation-Date"><str>2010-11-20T09:47:28Z</str></arr><arr name="title"><str>Solr Vortrag</str></arr><arr name="stream_source_info"><str>file</str></arr><arr name="created"><str>Sat Nov 20 10:47:28 CET 2010</str></arr><arr name="stream_content_type"><str>application/octet-stream</str></arr><arr name="stream_size"><str>425327</str></arr><arr name="producer"><str> 2.4</str></arr><arr name="stream_name"><str>slides.pdf</str></arr><arr name="Content-Type"><str>application/pdf</str></arr><arr name="creator"><str>Impress</str></arr></lst>

The response contains some metadata (how long the processing took), the content of the file as well as the metadata that is extracted from the document.

If you pass the atrribute wt and set it to json, the response is contained in a json structure:

curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text&wt=json"             
{"responseHeader":{"status":0,"QTime":217},"slides.pdf":"\n\n\n\n\n\n\n\n\n\n\n\nSolr Vortrag\n\n   \n\nEinfach mehr finden mit\n\nFlorian Hopf\n29.09.2010\n\n\n   \n\nSolr?\n\n\n   \n\nSolr?\n\nServer­ization of Lucene\n\n\n   \n\nApache Lucene?\n\nSearch engine library\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\nQuery Syntax \n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\nQuery Syntax \nScoring\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\n\n\n   \n\nArchitektur\n\nClient SolrWebapp Lucene\nhttp\n\nKommunikation über XML, JSON, JavaBin, Ruby, ...\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\nFacettierung\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\nFacettierung\nSammlung nützlicher Lucene­Module/Dismax\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\nFacettierung\nSammlung nützlicher Lucene­Module/Dismax\nJava­Client SolrJ\n\n\n   \n\nDemo\n\n\n   \n\nWas noch?\nAdmin­Interface\nCaching\nSkalierung\nSpellchecker\nMore­Like­This\nData Import Handler\nSolrCell\n\n\n   \n\nRessourcen\n\n\n\n\n","slides.pdf_metadata":["xmpTPg:NPages",["17"],"Creation-Date",["2010-11-20T09:47:28Z"],"title",["Solr Vortrag"],"stream_source_info",["file"],"created",["Sat Nov 20 10:47:28 CET 2010"],"stream_content_type",["application/octet-stream"],"stream_size",["425327"],"producer",[" 2.4"],"stream_name",["slides.pdf"],"Content-Type",["application/pdf"],"creator",["Impress"]]}

There are quite some ResponseWriters available for different languages, e.g. for Ruby. You can have a look at them at the bottom of this page:

Montag, 7. Mai 2012

Importing Atom feeds in Solr using the Data Import Handler

I am working on a search solution that makes some of the content I am producing available through one search interface. One of the content stores is the blog you are reading right now, which among other options makes the content available here using Atom.

Solr, my search server of choice, provides the Data Import Handler that can be used to import data on a regular basis from sources like databases via JDBC or remote XML sources, like Atom.

Data Import Handler used to be a core part of Solr but starting from 3.1 it is shipped as a separate jar and not included in the standard war anymore. I am using Maven with overlays for development so I have to add a dependency for it:


To enable the data import handler you have to add a request handler to your solrconfig.xml. Request handlers are registered for a certain url and, as the name suggests, are responsible for handling incoming requests:

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  <lst name="defaults">
    <str name="config">data-config.xml</str>

The file data-config.xml that is referenced here contains the mapping logic as well as the endpoint to access:

<?xml version="1.0" encoding="UTF-8" ?>
    <dataSource type="URLDataSource" encoding="UTF-8" connectionTimeout="5000" readTimeout="10000"/>
        <entity name="blog"
                forEach="/feed/entry" transformer="DateFormatTransformer,HTMLStripTransformer,TemplateTransformer">
            <field column="title" xpath="/feed/entry/title"/>
            <field column="url" xpath="/feed/entry/link[@rel='alternate']/@href"/>
            <!-- 2012-03-07T21:35:51.229-08:00 -->
            <field column="last_modified" xpath="/feed/entry/updated" 
                dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss.SSS" locale="en"/>
            <field column="text" xpath="/feed/entry/content" stripHTML="true"/>
            <field column="category" xpath="/feed/entry/category/@term"/>
            <field column="type" template="blog"/> 

First we configure which datasource to use. This is where you alternatively would use another implementation when fetching documents from a database.

Documents describe the fields that will be stored in the index. The attributes for the entity element determine where and how to fetch the data, most importantly the url and the processor. forEach contains an XPath to identify the elements we'd like to loop over. The transformer attribute is used to specify some classes that are the available when mapping the remote XML to the Solr fields.

The field elements contain the mapping between the Atom document and the Solr index fields. The column attribute determines the name of the index field, xpath determines the node to use in the remote XML document. You can use advanced XPath options like mapping to attributes of elements where only another attribute is set. E.g. /feed/entry/link[@rel='alternate']/@href points to an element that determines an alternative representation of a blog post entry:

<feed ...> 
    <link rel='alternate' type='text/html' href='' title='Testing Akka actors from Java'/>

For the column last_modified we are transforming the remote date format to the internal Solr representation using the DateProcessor. I am not sure yet if this is the correct solution as it seems to me I'm losing the timezone information. For the text field we are first removing all html elements that are contained in the blog post using the HTMLStripTransformer. Finally, the type contains a hardcoded value that is set using the TemplateTransformer.

To have everything in one place let's see how the schema for our index looks like:

<field name="url" type="string" indexed="true" stored="true" required="true"/>
<field name="title" type="text_general" indexed="true" stored="true"/>
<field name="category" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="last_modified" type="date" indexed="true" stored="true"/>
<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="type" type="string" indexed="true" stored="false"/>

Finally, how can you trigger the dataimport? There is an option described in the Solr wiki, but probably a simple solution might be enough for you. I am using a shell script that is triggered by a cron job. These are the contents:

curl localhost:8983/solr/dataimport?command=full-import

The data import handler is really easy to setup and you can use it to import quite a lot of data sources into your index. If you need more advanced crawling features you might want to have a look at Apache ManifoldCF, a connector framework for plugging content repositories into search engines like Apache Solr.