How to wisely combine shingles and edgeNgram to provide flexible full text search?
This is an interesting use case. Here's my take:
{
"settings": {
"analysis": {
"analyzer": {
"my_ngram_analyzer": {
"tokenizer": "my_ngram_tokenizer",
"filter": ["lowercase"]
},
"my_edge_ngram_analyzer": {
"tokenizer": "my_edge_ngram_tokenizer",
"filter": ["lowercase"]
},
"my_reverse_edge_ngram_analyzer": {
"tokenizer": "keyword",
"filter" : ["lowercase","reverse","substring","reverse"]
},
"lowercase_keyword": {
"type": "custom",
"filter": ["lowercase"],
"tokenizer": "keyword"
}
},
"tokenizer": {
"my_ngram_tokenizer": {
"type": "nGram",
"min_gram": "2",
"max_gram": "25"
},
"my_edge_ngram_tokenizer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "25"
}
},
"filter": {
"substring": {
"type": "edgeNGram",
"min_gram": 2,
"max_gram": 25
}
}
}
},
"mappings": {
"test_type": {
"properties": {
"text": {
"type": "string",
"analyzer": "my_ngram_analyzer",
"fields": {
"starts_with": {
"type": "string",
"analyzer": "my_edge_ngram_analyzer"
},
"ends_with": {
"type": "string",
"analyzer": "my_reverse_edge_ngram_analyzer"
},
"exact_case_insensitive_match": {
"type": "string",
"analyzer": "lowercase_keyword"
}
}
}
}
}
}
}
-
my_ngram_analyzer
is used to split every text into small pieces, how large the pieces are depends on your use case. I chose, for testing purposes, 25 chars.lowercase
is used since you said case-insensitive. Basically, this is the tokenizer used forsubstringof('table 1',name)
. The query is simple:
{
"query": {
"term": {
"text": {
"value": "table 1"
}
}
}
}
-
my_edge_ngram_analyzer
is used to split the text starting from the beginning and this is specifically used for thestartswith(name,'table 1')
use case. Again, the query is simple:
{
"query": {
"term": {
"text.starts_with": {
"value": "table 1"
}
}
}
}
- I found this the most tricky part - the one for
endswith(name,'table 1')
. For this I definedmy_reverse_edge_ngram_analyzer
which uses akeyword
tokenizer together withlowercase
and anedgeNGram
filter preceded and followed by areverse
filter. What this tokenizer basically does is to split the text in edgeNGrams but the edge is the end of the text, not the start (like with the regularedgeNGram
). The query:
{
"query": {
"term": {
"text.ends_with": {
"value": "table 1"
}
}
}
}
- for the
name eq 'table 1'
case, a simplekeyword
tokenizer together with alowercase
filter should do it The query:
{
"query": {
"term": {
"text.exact_case_insensitive_match": {
"value": "table 1"
}
}
}
}
Regarding query_string
, this changes the solution a bit, because I was counting on term
to not analyze the input text and to match it exactly with one of the terms in the index.
But this can be "simulated" with query_string
if the appropriate analyzer
is specified for it.
The solution would be a set of queries like the following (always use that analyzer, changing only the field name):
{
"query": {
"query_string": {
"query": "text.starts_with:(\"table 1\")",
"analyzer": "lowercase_keyword"
}
}
}