Random Etc. Notes to self. Work, play, and the rest.

A quick (less certain) note on using Lucene in Processing

In the spirit of continuing our impromptu database week on Processing Blogs (my post, toxi's post, then Florian Jennet updating the sql library), I thought I'd post another quick example using Lucene.

Last week Ryan and I needed a reliable way to search inside a data set we were working with. I had previously tried and failed to write my own useful search routine for the same data, so I wanted to take a look at Lucene instead. (This week, I might have used SQLite, but I hadn't tried it last week!).

Lucene isn't a relational database like MySQL or SQLite, although it has a few similarities with the way most database engines speed up queries using indexes. That's because Lucene is "just" the indexing part. You tell Lucene about your data, one part at a time, and then construct queries to ask it which parts of your data match the query. The key thing is that you keep hold of your data yourself and structure it any way you like, Lucene keeps its own representation of the data for searching. Because of this it can index much more data than you can store in memory.

Anyway, I put up an simple example Lucene applet here that indexes the text of Time Machine by H.G. Wells from Project Gutenberg and lets you type queries against it. The text itself and the Lucene index it builds are both quite small (tens of KB, compressed), but the applet is around 750KB. This is because Lucene's core jar file is about 500KB so it's more suited to standalone projects and applications.

The code is kind of documented, but Lucene was really too much for me to understand properly in just one day. Nevertheless, again, I hope people find it useful!


2 Comments

Hello Tom,
I was playing around with the Applet and I have one question. Where is made the decision to ignore some terms? As an example, if you type pretty common words like “and”, “at”, or “the” in the search box, you will get “no matches found”. Of course their is a lot of those words in the text data. Maybye too much then?
I also notice that you can fixed the amount of query by changing the number of the variable MAX_HITS in the class.
Thank you for all your nice work.

Posted by Morpholux on 1 August 2007 @ 11pm

That’s done by Lucene’s StandardAnalyzer which filters out a list of stop words (amongst other things). There are different Analyzers you can use to break down your text into searchable parts. Note that the same kind of analyzer is used in the index writer and in the query parser - I’m not sure if that’s necessary, but those are the places you should look.

Posted by TomC on 2 August 2007 @ 3am