Configure Lucene.Net with SQL Server

Yes, I've used it for exactly what you are describing. We had two services - one for read, and one for write, but only because we had multiple readers. I'm sure we could have done it with just one service (the writer) and embedded the reader in the web app and services.

I've used lucene.net as a general database indexer, so what I got back was basically DB id's (to indexed email messages), and I've also use it to get back enough info to populate search results or such without touching the database. It's worked great in both cases, tho the SQL can get a little slow, as you pretty much have to get an ID, select an ID etc. We got around this by making a temp table (with just the ID row in it) and bulk-inserting from a file (which was the output from lucene) then joining to the message table. Was a lot quicker.

Lucene isn't perfect, and you do have to think a little outside the relational database box, because it TOTALLY isn't one, but it's very very good at what it does. Worth a look, and, I'm told, doesn't have the "oops, sorry, you need to rebuild your index again" problems that MS SQL's FTI does.

BTW, we were dealing with 20-50million emails (and around 1 million unique attachments), totaling about 20GB of lucene index I think, and 250+GB of SQL database + attachments.

Performance was fantastic, to say the least - just make sure you think about, and tweak, your merge factors (when it merges index segments). There is no issue in having more than one segment, but there can be a BIG problem if you try to merge two segments which have 1mil items in each, and you have a watcher thread which kills the process if it takes too long..... (yes, that kicked our arse for a while). So keep the max number of documents per thinggie LOW (ie, dont set it to maxint like we did!)

EDIT Corey Trager documented how to use Lucene.NET in BugTracker.NET here.

I have not done it against database yet, your question is kinda open.

If you want to search an db, and can choose to use Lucene, I also guess that you can control when data is inserted to the database. If so, there is little reason to poll the db to find out if you need to reindex, just index as you insert, or create an queue table which can be used to tell lucene what to index.

I think we don't need another indexer that is ignorant about what it is doing, and reindexing everytime, or uses resources wasteful.

I have used lucene.net also as storage engine, because it's easier to distribute and setup alternate machines with an index than a database, it's just a filesystem copy, you can index on one machine, and just copy the new files to the other machines to distribute the index. All the searches and details are shown from the lucene index, and the database is just used for editing. This setup has been proven as a very scalable solution for our needs.

Regarding the differences between sql server and lucene, the principal problem with sql server 2005 full text search is that the service is decoupled from the relational engine, so joins, orders, aggregates and filter between the full text results and the relational columns are very expensive in performance terms, Microsoft claims that this issues have been addressed in sql server 2008, integrating the full text search inside the relational engine, but I don't have tested it. They also made the whole full text search much more transparent, in previous versions the stemmers, stopwords, and several other parts of the indexing where like a black box and difficult to understand, and in the new version are easier to see how they works.

With my experience, if sql server meet your requirements, it will be the easiest way, if you expect a lot of growth, complex queries or need a big control of the full text search, you might consider working with lucene from the start because it will be easier to scale and personalise.

I used Lucene.NET along with MySQL. My approach was to store primary key of db record in Lucene document along with indexed text. In pseudo code it looks like:

Store record:

insert text, other data to the table
get latest inserted ID
create lucene document
put (ID, text) into lucene document update lucene index
Querying
search lucene index
for each lucene doc in result set load data from DB by stored record's ID

Just to note, I switched from Lucene to Sphinx due to it superb performance

Configure Lucene.Net with SQL Server

Related

Recent Posts