A quick (less certain) note on using Lucene in Processing

1st August 2007 @ 10:02 pm
Processing, java, Programming, lucene, Data and Code
In the spirit of continuing our impromptu database week on Processing Blogs (my post, toxi's post, then Florian Jennet updating the sql library), I thought I'd post another quick example using Lucene.

Last week Ryan and I needed a reliable way to search inside a data set we were working with. I had previously tried and failed to write my own useful search routine for the same data, so I wanted to take a look at Lucene instead. (This week, I might have used SQLite, but I hadn't tried it last week!).

Lucene isn't a relational database like MySQL or SQLite, although it has a few similarities with the way most database engines speed up queries using indexes. That's because Lucene is "just" the indexing part. You tell Lucene about your data, one part at a time, and then construct queries to ask it which parts of your data match the query. The key thing is that you keep hold of your data yourself and structure it any way you like, Lucene keeps its own representation of the data for searching. Because of this it can index much more data than you can store in memory.

Anyway, I put up an simple example Lucene applet here that indexes the text of Time Machine by H.G. Wells from Project Gutenberg and lets you type queries against it. The text itself and the Lucene index it builds are both quite small (tens of KB, compressed), but the applet is around 750KB. This is because Lucene's core jar file is about 500KB so it's more suited to standalone projects and applications.

The code is kind of documented, but Lucene was really too much for me to understand properly in just one day. Nevertheless, again, I hope people find it useful!