Code and comments

Practical and theoretical aspects of software development

My first five Lucene mistakes

with 4 comments

Lucene logo

I’ve been learning my way around Lucene, the popular Java open source information retrieval software library, lately. Which is to say, I’ve been making mistakes with Lucene.

Much of the time spent with any new technology is in making mistakes, discovering mistakes, and correcting mistakes. An inexperienced programmer would view these mistakes as impediments to progress. It’s far more sensible, and more encouraging, to view these mistakes as the evidence of progress.

With that in mind, I’ll have no shame in sharing my newbie Lucene blunders, with hopes of helping someone that hasn’t yet had the opportunity to fail in all of these ways.

#1: Using a database.

As a software developer, I tend to think that data doesn’t exist unless it is in a relational database. Of course, Lucene would index the data, but that index, like the index in the back of a book, would simply tell me where to find the data. Not true. With Lucene, you can choose for each indexed field whether the field should be stored in the index or not.

Obviously, the right way to store your data will depend on many things, but storing it in the index is the simplest option available. Not only does this mean that you don’t need to set up, connect to, and populate a database before beginning, but it streamlines the process of going from a text query to the resulting document.

#2: Storing numeric data in Field, not NumericField.

Lucene will be able to search on numeric data as text, so long as numbers aren’t removed during the indexing process, regardless of whether you add it as a Field

doc.add(new Field("chapter", String.valueOf(chapter), Field.Store.YES,
                  Field.Index.NOT_ANALYZED));

or as a NumericField

doc.add(new NumericField("chapter", Field.Store.YES, false).setIntValue(chapter));

The big difference comes in the ability to perform a query on a range of values. Which can be nice.

#3: Leaving the IndexWriter open.

We all know that it’s important to close resources (files, DB connections, etc.) that are no longer needed. But you can usually get away with being a little sloppy when you are working on a toy project at home with one user. Not so with the IndexWriter.

I indexed the documents that I was hoping to search, failed to call close() on the writer, and found the following files in the index directory:

_0.fdt
_0.fdx
write.lock

I should have guessed what this meant, but I didn’t, and when I tried to search, I received the following:

org.apache.lucene.index.IndexNotFoundException: no segments* file found in
    org.apache.lucene.store.MMapDirectory@/home/user/path/to/index
    lockFactory=org.apache.lucene.store.NativeFSLockFactory@67e8a1f6: files: [write.lock, _0.fdx, _0.fdt]
  at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:712)
  at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:75)
  at org.apache.lucene.index.IndexReader.open(IndexReader.java:462)
  ...

which clues me in to what’s going on. The IndexWriter that I had long since forgotten about still owned the lock on the index. If I remember to close the writer, then the index directory has no write.lock file, and does have segments_1 and segments.gen files. More importantly, my queries no longer give the above stack trace.

#4: Setting text field to NOT_ANALYZED

In following an example, I used the following to add a free text field to a Document:

doc.add(new Field("text", text, Field.Store.YES,
                  Field.Index.NOT_ANALYZED));

I didn’t give much though to the NOT_ANALYZED, as I didn’t really care about analysis at this point … or so I thought.

But I did care about non of my queries finding anything. It’s pretty simple. NOT_ANALYZED means that the text is not broken down into individual tokens. In this case it can only match exact queries, which only makes sense for fields containing very few words, or fields that you don’t care to query. For any significant text that you want to be searchable, use ANALYZED.

#5: Thinking that TopDocs contained my documents

So I understood that my query, when executed, returned a TopDocs object. I’ll return that to the caller, who can then dig in and find the desired fields, right? Wrong. TopDocs does contain a ScoreDocs array, each of which contains an id for the relevant document. But to use that id in finding anything else about the document, you need the IndexSearcher itself. So the class that interacted with the IndexSearcher should also have the responsibility of returning something more understandable, perhaps a list of domain objects.

Your mistakes?

I hope someone finds value in my first mistakes. What newbie Lucene mistakes have you made?

Advertisements

Written by Eric Wilson

December 14, 2011 at 12:08 pm

Posted in how-to

Tagged with , , ,

4 Responses

Subscribe to comments with RSS.

  1. Hi Eric,

    Was wondering if I could interview or just ask you some questions about your career change. I saw a comment you made on Stack Exchange about how you starting learning Java in your 30s and I was planning to do the same thing and was wondering how it worked out for you. Can’t seem to find a way to email you directly so I hope this is OK.

    Michael

    Michael

    December 18, 2011 at 12:59 pm

  2. This was very helpful. My index contains about 7 million documents. I just realized that I need to perform a range query on a field which is supposed to be numeric but I indexed it as text ! Hopefully someone will read this blog before they make the same mistake 🙂

    mike

    November 18, 2012 at 2:28 am

  3. Thanks for this. I closed the IndexWriter, but only after trying to search the indexed docs! Good catches here,

    Daniel Thornton

    August 14, 2014 at 7:24 pm

  4. I would say that using NOT_ANALYZED at all was one of the mistakes we didn’t notice until quite a while later. The problem with NOT_ANALYZED is that the QueryParser doesn’t know which fields are of that type, so you end up having to configure some fields at the QueryParser to use a KeywordAnalyzer. Over time, we realised that it was more reliable just to use ANALYZED with KeywordAnalyzer at indexing-time as well.

    trejkaz

    February 2, 2017 at 11:00 pm


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: