Mittwoch, 19. Dezember 2012

Gradle is too Clever for my Plans

While writing this post about the Lucene Codec API I noticed something strange when running the tests with Gradle. When experimenting with a library feature most of the times I write unit tests that validate my expectations. This is a habit I learned from Lucene in Action and can also be useful in real world scenarios, e.g. to make sure that nothing breaks when you update a library.

OK, what happened? This time I did not only want to have the test result but also ran the test for a side effect, I wanted a Lucene index to be written to the /tmp directory to manually have a look at it. This worked fine for the first time, but not afterwards, e.g. after my machine was rebooted and the directory cleared.

It turns out that the Gradle developers know that a test shouldn't be used to execute stuff. So once the test is run successfully it is just not run again until its input changes! Though this bit me this time this is a really nice feature to speed up your builds. And if you really need to execute the tests, you can always run gradle cleanTest test.

Freitag, 7. Dezember 2012

Looking at a Plaintext Lucene Index

The Lucene file format is one of the reasons why Lucene is as fast as it is. An index consist of several binary files that you can't really inspect if you don't use tools like the fantastic Luke.

Starting with Lucene 4 the format for these files can be configured using the Codec API. Several implementations are provided with the release, among those the SimpleTextCodec that can be used to write the files in plaintext for learning and debugging purposes.

To configure the Codec you just set it on the IndexWriterConfig:

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
// recreate the index on each execution
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
config.setCodec(new SimpleTextCodec());

The rest of the indexing process stays exactly the same as it used to be:

Directory luceneDir = FSDirectory.open(plaintextDir);
try (IndexWriter writer = new IndexWriter(luceneDir, config)) {
    writer.addDocument(Arrays.asList(
            new TextField("title", "The title of my first document", Store.YES),
            new TextField("content", "The content of the first document", Store.NO)));

    writer.addDocument(Arrays.asList(
            new TextField("title", "The title of the second document", Store.YES),
            new TextField("content", "And this is the content", Store.NO)));
}

After running this code the index directory contains several files. Those are not the same type of files that are created using the default codec.

ls /tmp/lucene-plaintext/
_1_0.len  _1_1.len  _1.fld  _1.inf  _1.pst  _1.si  segments_2  segments.gen

The segments_x file is the starting point (x depends on the amount of times you have written to the index before and starts with 1). This still is a binary file but contains the information which codec is used to write to the index. It contains the name of each Codec that is used for writing a certain segment.

The rest of the index files are all plaintext. They do not contain the same information as their binary cousins. For example the .pst file represents the complete posting list, the structure you normally mean when talking about an inverted index:

field content
  term content
    doc 0
      freq 1
      pos 1
    doc 1
      freq 1
      pos 4
  term document
    doc 0
      freq 1
      pos 5
  term first
    doc 0
      freq 1
      pos 4
field title
  term document
    doc 0
      freq 1
      pos 5
    doc 1
      freq 1
      pos 5
  term first
    doc 0
      freq 1
      pos 4
  term my
    doc 0
      freq 1
      pos 3
  term second
    doc 1
      freq 1
      pos 4
  term title
    doc 0
      freq 1
      pos 1
    doc 1
      freq 1
      pos 1
END

The content that is marked as stored resides in the .fld file:

doc 0
  numfields 1
  field 0
    name title
    type string
    value The title of my first document
doc 1
  numfields 1
  field 0
    name title
    type string
    value The title of the second document
END

If you'd like to have a look at the rest of the files checkout the code at Github.

The SimpleTextCodec only is an interesting byproduct. The Codec API can be used for a lot useful things. For example the feature to read indices of older Lucene versions is implemented using seperate codecs. Also, you can mix several Codecs in an index so reindexing on version updates should not be necessary immediately. I am sure more useful codecs will pop up in the future.