Optimization of MySQL search using "like" and wildcards
Solution 1:
How long are your strings?
If they are relatively short (e.g. English words; avg_len=5) and you have database storage to spare, try this approach:
- For each word that you want to store in the table, instead take every possible suffix of that word. In other words, you keep stripping the first character until nothing is left. For example, the word
value
gives:value
alue
lue
ue
e
- Store each of these suffixes in the database.
- You can now search for substrings using
LIKE 'alu%'
(which will find 'alu' as part of 'value').
By storing all suffixes, you have removed the need for the leading wildcard (allowing an index to be used for fast lookup), at the cost of storage space.
Storage Cost
The number of characters required to store a word becomes word_len*word_len / 2
, i.e. quadratic in the word length, on a per-word basis. Here is the factor of increase for various word sizes:
- 3-letter word:
(3*3/2) / 3 = 1.5
- 5-letter word:
(5*5/2) / 5 = 2.5
- 7-letter word:
(7*7/2) / 7 = 3.5
- 12-letter word:
(12*12/2) / 12 = 6
The number of rows required to store a word increases from 1 to word_len
. Be mindful of this overhead. Additional columns should be kept to a minimum to avoid storing large amounts of redundant data. For instance, a page number on which the word was originally found should be fine (think unsigned smallint), but extensive metadata on the word should be stored in a separate table on a per-word basis, rather than for each suffix.
Considerations
There is a trade-off in where we split 'words' (or fragments). As a real-world example: what do we do with hyphens? Do we store the adjective five-letter
as one word or two?
The trade-off is as follows:
- Anything that is broken up cannot be found as a single element. If we store
five
andletter
separately, searching forfive-letter
orfiveletter
will fail. - Anything that is not broken up will take more storage space. Remember, the storage requirement increases quadratically in the word length.
For convenience, you might want to remove the hyphen and store fiveletter
. The word can now be found by searching five
, letter
, and fiveletter
. (If you strip hyphens from any search query as well, users can still successfully find five-letter
.)
Finally, there are ways of storing suffix arrays that do not incur much overhead, but I am not yet sure if they translate well to databases.
Solution 2:
Two ways:
(1) use an in-memory table so it goes very fast.
(2) cook up a better index and search algorithm than foo LIKE '%bar%'
. It's not possible to make any suggestions about this without knowing more about your problem.
As you have pointed out, the %bar% pattern guarantees a table-scan for every lookup, which nullifies any possible search ingenuity in the database software.