Looking at a Plaintext Lucene Index
07 Dec 2012The Lucene file format is one of the reasons why Lucene is as fast as it is. An index consist of several binary files that you can't really inspect if you don't use tools like the fantastic Luke.
Starting with Lucene 4 the format for these files can be configured using the Codec API. Several implementations are provided with the release, among those the SimpleTextCodec that can be used to write the files in plaintext for learning and debugging purposes.
To configure the Codec you just set it on the IndexWriterConfig:
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);
// recreate the index on each execution
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
config.setCodec(new SimpleTextCodec());
The rest of the indexing process stays exactly the same as it used to be:
Directory luceneDir = FSDirectory.open(plaintextDir);
try (IndexWriter writer = new IndexWriter(luceneDir, config)) {
writer.addDocument(Arrays.asList(
new TextField("title", "The title of my first document", Store.YES),
new TextField("content", "The content of the first document", Store.NO)));
writer.addDocument(Arrays.asList(
new TextField("title", "The title of the second document", Store.YES),
new TextField("content", "And this is the content", Store.NO)));
}
After running this code the index directory contains several files. Those are not the same type of files that are created using the default codec.
ls /tmp/lucene-plaintext/
_1_0.len _1_1.len _1.fld _1.inf _1.pst _1.si segments_2 segments.gen
The segments_x file is the starting point (x depends on the amount of times you have written to the index before and starts with 1). This still is a binary file but contains the information which codec is used to write to the index. It contains the name of each Codec that is used for writing a certain segment.
The rest of the index files are all plaintext. They do not contain the same information as their binary cousins. For example the .pst file represents the complete posting list, the structure you normally mean when talking about an inverted index:
field content
term content
doc 0
freq 1
pos 1
doc 1
freq 1
pos 4
term document
doc 0
freq 1
pos 5
term first
doc 0
freq 1
pos 4
field title
term document
doc 0
freq 1
pos 5
doc 1
freq 1
pos 5
term first
doc 0
freq 1
pos 4
term my
doc 0
freq 1
pos 3
term second
doc 1
freq 1
pos 4
term title
doc 0
freq 1
pos 1
doc 1
freq 1
pos 1
END
The content that is marked as stored resides in the .fld file:
doc 0
numfields 1
field 0
name title
type string
value The title of my first document
doc 1
numfields 1
field 0
name title
type string
value The title of the second document
END
If you'd like to have a look at the rest of the files checkout the code at Github.
The SimpleTextCodec only is an interesting byproduct. The Codec API can be used for a lot useful things. For example the feature to read indices of older Lucene versions is implemented using seperate codecs. Also, you can mix several Codecs in an index so reindexing on version updates should not be necessary immediately. I am sure more useful codecs will pop up in the future.