Content Extraction with Apache Tika

Sometimes you need access to the content of documents, be it that you want to analyze it, store the content in a database or index it for searching. Different formats like word documents, pdfs and html documents need different treatment. Apache Tika is a project that combines several open source projects for reading content from a multitude of file formats and makes the textual content as well as some metadata available using a uniform API. I will show two ways how to leverage the power of Tika for your projects.

Accessing Tika programmatically

First, Tika can of course be used as a library. Surprisingly the user docs on the website explain a lot of the functionality that you might be interested in when writing custom parsers for Tika but don't show directly how to use it.

I am using Maven again, so I add a dependency for the most recent version:

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-parsers</artifactId>
    <version>1.1</version>
    <type>jar</type>
</dependency>

tika-parsers also includes all the other projects that are used so be patient when Maven fetches all the transitive dependencies.

Let's see what some test code for extracting data from a pdf document called slides.pdf, that is available in the classpath, looks like.

Parser parser = new PdfParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
InputStream content = getClass().getResourceAsStream("/slides.pdf");
parser.parse(content, handler, metadata, new ParseContext());
assertEquals("Solr Vortrag", metadata.get(Metadata.TITLE));
assertTrue(handler.toString().contains("Lucene"));

First, we need to instanciate a Parser that is capable of reading the format, in this case PdfParser that uses PDFBox for extracting the content. The parse method expects some parameters to configure the parsing process as well as an InputStream that contains the data of the document. Metadata will contain all the metadata for the document, e.g. the title or the author after the parsing is finished.

Tika uses XHTML as the internal representation for all parsed content. This XHTML document can be processed by a SAX ContentHandler. A custom implementation BodyContentHandler returns all the text in the body area, which is the main content. The last parameter ParseContext can be used to configure the underlying parser instance.

The Metadata class consists of a Map-like structure with some common keys like the title as well as optional format specific information. You can look at the contents with a simple loop:

for (String name: metadata.names()) { 
System.out.println(name + ": " + metadata.get(name));
}

This will produce an output similar to this:

xmpTPg:NPages: 17
Creation-Date: 2010-11-20T09:47:28Z
title: Solr Vortrag
created: Sat Nov 20 10:47:28 CET 2010
producer: OpenOffice.org 2.4
Content-Type: application/pdf
creator: Impress

The textual content of the document can be retrieved by calling the toString() method on the BodyContentHandler.

This is all fine if you exactly know that you only want to retrieve data from pdf documents. But you probably don't want to introduce a huge switch-block for determining the parser to use depending on the file name or some other information. Fortunately Tika also provides an AutodetectParser that employs different strategies for determining the content type of the document. The code above all stays the same, you just use a different parser:

Parser parser = new AutodetectParser();

This way you don't have to know what kind of document you are currently processing, Tika will provide you with metadata as well as the content. You can pass in additional hints for the parser e.g. the filename or the content type by setting it in the Metadata object.

Extracting content using Solr

If you are using the search server Solr you can also leverage its REST API for extracting the content. The default configuration has a request handler configured for /update/extract that you can send a document to and it will return the content it extracted using Tika. You just need to add the necessary libraries for the extraction. I am still using Maven so I have to add an additional dependency:

<dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr</artifactId>
    <version>3.6.0</version>
    <type>war</type>
</dependency>
<dependency>
    <groupId>org.apache.solr</groupId>
    <artifactId>solr-cell</artifactId>
    <version>3.6.0</version>
    <type>jar</type>
</dependency>

This will include all of the Tika dependencies as well as all necessary third party libraries.

Solr Cell, the request handler, normally is used to index binary files directly but you can also just use it for extraction. To transfer the content you can use any tool that can speak http, e.g. for curl this might look like this:

curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text"

By setting the parameter extractOnly to true we advice Solr that we don't want to index the content but want to have it extracted to the response. The result will be the standard Solr XML format that contains the body content as well as the metadata.

You can also use the Java client library SolrJ for doing the same:

ContentStreamUpdateRequest request = new ContentStreamUpdateRequest("/update/extract");
request.addFile(new File("slides.pdf"));
request.setParam("extractOnly", "true");
request.setParam("extractFormat", "text");
NamedList<Object> result = server.request(request);

The NamedList will contain entries for the body content as well as another NamedList with the metadata.


Update


Robert has asked in the comments what the response looks like.
Solr uses configurable response writers for marshalling the message. The default format is xml but can be influenced by passing the wt attribute to the request. A simplified standard response looks like this:


curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text"
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">1952</int></lst><str name="slides.pdf">

Features

HTTP­Schnittstelle
XML­basierte Konfiguration
Facettierung
Sammlung nützlicher Lucene­Module/Dismax

Features

HTTP­Schnittstelle
XML­basierte Konfiguration
Facettierung
Sammlung nützlicher Lucene­Module/Dismax
Java­Client SolrJ

[... more content ...]

</str><lst name="slides.pdf_metadata"><arr name="xmpTPg:NPages"><str>17</str></arr><arr name="Creation-Date"><str>2010-11-20T09:47:28Z</str></arr><arr name="title"><str>Solr Vortrag</str></arr><arr name="stream_source_info"><str>file</str></arr><arr name="created"><str>Sat Nov 20 10:47:28 CET 2010</str></arr><arr name="stream_content_type"><str>application/octet-stream</str></arr><arr name="stream_size"><str>425327</str></arr><arr name="producer"><str>OpenOffice.org 2.4</str></arr><arr name="stream_name"><str>slides.pdf</str></arr><arr name="Content-Type"><str>application/pdf</str></arr><arr name="creator"><str>Impress</str></arr></lst>
</response>

The response contains some metadata (how long the processing took), the content of the file as well as the metadata that is extracted from the document.


If you pass the atrribute wt and set it to json, the response is contained in a json structure:


curl -F "file=@slides.pdf" "localhost:8983/solr/update/extract?extractOnly=true&extractFormat=text&wt=json"             
{"responseHeader":{"status":0,"QTime":217},"slides.pdf":"\n\n\n\n\n\n\n\n\n\n\n\nSolr Vortrag\n\n   \n\nEinfach mehr finden mit\n\nFlorian Hopf\n29.09.2010\n\n\n   \n\nSolr?\n\n\n   \n\nSolr?\n\nServer­ization of Lucene\n\n\n   \n\nApache Lucene?\n\nSearch engine library\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\nQuery Syntax \n\n\n   \n\nApache Lucene?\n\nSearch engine library\nTextbasierter Index\nText Analyzer\nQuery Syntax \nScoring\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\n\n\n   \n\nArchitektur\n\nClient SolrWebapp Lucene\nhttp\n\nKommunikation über XML, JSON, JavaBin, Ruby, ...\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\nFacettierung\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\nFacettierung\nSammlung nützlicher Lucene­Module/Dismax\n\n\n   \n\nFeatures\n\nHTTP­Schnittstelle\nXML­basierte Konfiguration\nFacettierung\nSammlung nützlicher Lucene­Module/Dismax\nJava­Client SolrJ\n\n\n   \n\nDemo\n\n\n   \n\nWas noch?\nAdmin­Interface\nCaching\nSkalierung\nSpellchecker\nMore­Like­This\nData Import Handler\nSolrCell\n\n\n   \n\nRessourcen\nhttp://lucene.apache.org/solr/\n\n\n\n","slides.pdf_metadata":["xmpTPg:NPages",["17"],"Creation-Date",["2010-11-20T09:47:28Z"],"title",["Solr Vortrag"],"stream_source_info",["file"],"created",["Sat Nov 20 10:47:28 CET 2010"],"stream_content_type",["application/octet-stream"],"stream_size",["425327"],"producer",["OpenOffice.org 2.4"],"stream_name",["slides.pdf"],"Content-Type",["application/pdf"],"creator",["Impress"]]}

There are quite some ResponseWriters available for different languages, e.g. for Ruby. You can have a look at them at the bottom of this page: http://wiki.apache.org/solr/QueryResponseWriter