My first five Lucene mistakes
I’ve been learning my way around Lucene, the popular Java open source information retrieval software library, lately. Which is to say, I’ve been making mistakes with Lucene.
Much of the time spent with any new technology is in making mistakes, discovering mistakes, and correcting mistakes. An inexperienced programmer would view these mistakes as impediments to progress. It’s far more sensible, and more encouraging, to view these mistakes as the evidence of progress.
With that in mind, I’ll have no shame in sharing my newbie Lucene blunders, with hopes of helping someone that hasn’t yet had the opportunity to fail in all of these ways.
#1: Using a database.
As a software developer, I tend to think that data doesn’t exist unless it is in a relational database. Of course, Lucene would index the data, but that index, like the index in the back of a book, would simply tell me where to find the data. Not true. With Lucene, you can choose for each indexed field whether the field should be stored in the index or not.
Obviously, the right way to store your data will depend on many things, but storing it in the index is the simplest option available. Not only does this mean that you don’t need to set up, connect to, and populate a database before beginning, but it streamlines the process of going from a text query to the resulting document.
#2: Storing numeric data in
Lucene will be able to search on numeric data as text, so long as numbers aren’t removed during the indexing process, regardless of whether you add it as a
doc.add(new Field("chapter", String.valueOf(chapter), Field.Store.YES, Field.Index.NOT_ANALYZED));
or as a
doc.add(new NumericField("chapter", Field.Store.YES, false).setIntValue(chapter));
The big difference comes in the ability to perform a query on a range of values. Which can be nice.
#3: Leaving the
We all know that it’s important to close resources (files, DB connections, etc.) that are no longer needed. But you can usually get away with being a little sloppy when you are working on a toy project at home with one user. Not so with the
I indexed the documents that I was hoping to search, failed to call
close() on the
writer, and found the following files in the index directory:
_0.fdt _0.fdx write.lock
I should have guessed what this meant, but I didn’t, and when I tried to search, I received the following:
org.apache.lucene.index.IndexNotFoundException: no segments* file found in org.apache.lucene.store.MMapDirectory@/home/user/path/to/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@67e8a1f6: files: [write.lock, _0.fdx, _0.fdt] at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:712) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75) at org.apache.lucene.index.IndexReader.open(IndexReader.java:462) ...
which clues me in to what’s going on. The
IndexWriter that I had long since forgotten about still owned the lock on the index. If I remember to close the
writer, then the index directory has no
write.lock file, and does have
segments.gen files. More importantly, my queries no longer give the above stack trace.
#4: Setting text field to
In following an example, I used the following to add a free text field to a Document:
doc.add(new Field("text", text, Field.Store.YES, Field.Index.NOT_ANALYZED));
I didn’t give much though to the
NOT_ANALYZED, as I didn’t really care about analysis at this point … or so I thought.
But I did care about non of my queries finding anything. It’s pretty simple.
NOT_ANALYZED means that the text is not broken down into individual tokens. In this case it can only match exact queries, which only makes sense for fields containing very few words, or fields that you don’t care to query. For any significant text that you want to be searchable, use
#5: Thinking that
TopDocs contained my documents
So I understood that my query, when executed, returned a
TopDocs object. I’ll return that to the caller, who can then dig in and find the desired fields, right? Wrong.
TopDocs does contain a
ScoreDocs array, each of which contains an
id for the relevant document. But to use that
id in finding anything else about the document, you need the
IndexSearcher itself. So the class that interacted with the
IndexSearcher should also have the responsibility of returning something more understandable, perhaps a list of domain objects.
I hope someone finds value in my first mistakes. What newbie Lucene mistakes have you made?