Full-Text Search for Database Using Lucene Search Engine
We would like to show you how to create full-text index, thus making your database searchable. MySQL has full-text indexing capabilities built in, but if you want really powerful search which scales you’ll have to do it yourself.
This post is based on real experience of optimizing index and search for 30 million documents database.
By employing these optimization techniques we were able to speed up indexing time from 15 seconds per 1000 rows, to 1 second. After optimizing queries and index structure, searches ran 7-10 times faster as well.
Lucene is great and very flexible search engine, but if not properly used it may not perform very well during indexing and search. Especially when number of rows counts to millions.
As an example we take abstract database with user info and decide on how to create optimal index structure to speed up search and indexing time.
Database structure
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | CREATE TABLE users ( ID bigint(20) UNSIGNED NOT NULL AUTO_INCREMENT, Interests varchar(255) NOT NULL DEFAULT '', PhotoExists tinyint(1) NOT NULL DEFAULT '0', TS timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, Verified tinyint(1) NOT NULL DEFAULT '0', MYSPACE varchar(100) NOT NULL DEFAULT '', YAHOO varchar(100) NOT NULL DEFAULT '', SKYPE varchar(100) NOT NULL DEFAULT '', ICQ varchar(100) NOT NULL DEFAULT '', GOOGLE varchar(100) NOT NULL DEFAULT '', JABBER varchar(100) NOT NULL DEFAULT '', Sex char(1) NOT NULL DEFAULT '', State varchar(100) NOT NULL DEFAULT '', Lang varchar(100) NOT NULL DEFAULT '', Fullname varchar(254) NOT NULL DEFAULT '', Country varchar(100) NOT NULL DEFAULT '', City varchar(100) NOT NULL DEFAULT '', Birthday date NOT NULL DEFAULT '0000-00-00', About text NOT NULL, PRIMARY KEY (ID) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; |
Requirements
Let us define search requirements to our index:
- We need to allow user to search by particular fields, and run multi-field searches over all text fields we have
- We want to search by date range to find all users in age range, and perform YES/NO searches over boolean fields
- We also want to be able to find all users who use particular IM system or systems
- It should be possible to index millions of records
- Indexing must be fast and efficient
Here we will also have to decide on how to create optimal index structure and what to sacrifice, disk space or search time.
I am always inclined to spend more disk space, but keep searches as fast as possible.
Text Field
The main rule for optimizing search engines is to reduce number of hard drive seeks to minimum.
As you can see database has many fields and if we mirrored each field in separate lucene field, it wouldn’t be optimal.
First way to optimize index structure is to use combined field. We copy text from all text fields into one field called “content”.
The benefit of having one combined field is that we keep all text we need to search with OR logic in one place. It increases index size, but will pay off big during search because searches will run many times faster.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | // Create new document doc = new Document(); doc.add(new Field("ID",rs.getString("ID"), Field.Store.YES,Field.Index.NO)); // Join all searchable field into one String joinedText = ""; for (int j = 0; j < fields_combined.length; j++) { value = rs.getString(fields_combined[j]); if(value.length() > 0){ joinedText += value + "\n"; } } // Add text to combined search field doc.add(new Field("content", joinedText, Field.Store.NO, Field.Index.TOKENIZED)); writer.addDocument(doc); |
Boolean Fields
We need to be able to find all users who use particular instant messenger and for this we will need to index this information as well. But instead of creating separate field for each IM (for example icq:Y or jabber:N) and then using query like
+icq:Y +jabber:Y
we will create one field called IM, and save unique codes for each client separated with space.
This way for user who has ICQ and Yahoo we’ll index string “ICQ YAH” in field “IM”
Search query will be rewritten and will look like IM:ICQ for one term search, or IM:(ICQ YAH) for multiple term search.
This also gives us all freedom to perform boolean searches as we want.
As a result we have 1 field instead of 6, much less drive seeks and much faster indexing and search.
Optimization Techniques Summary
- Most obvious optimization - always call writer.optimize() after index update. This will merge index into one segment, and only one Searcher will be created during search.
- Use combined search field for all text fields instead (or on the top) of indexing them separately and searching with complex query like
field1:query OR field2:query ... OR fieldN:query - Reducing number of field make indexing and search much faster. Use combined field instead or on the top of separate fields if needed
- Do not use compound file format during indexing.
writer.setUseCompoundFile(false);If you want to keep index in single file, turn it on when indexing is complete. - Do not create Searcher and Analyzer for each search. When you create Searcher, it reads small dictionary into memory to speed up term lookups. It takes time. Keep Searcher opened and reuse it for every search. When index changes make sure to recreate you Searcher.
- Increase heap size for the java virtual machine
- Use quotes “” only to perform exact phrase searches. field:term is much better than field:”term”
Merge Factor
Lucene’s IndexWriter has parameter called Merge Factor. This is a number of segments writer keeps in memory (RAMDirectory) before flushing them to disk. Lucene documentation claims that if you want to speed up indexing, you should increase Merge Factor. Though it sounds logical, there is not much to gain here. And drawback with using big Merge Factor is that there could be too many open files. Keeping it to default 10 or 20 turns out to be the most reasonable choice.