How to wisely combine shingles and edgeNgram to provide flexible full text search?

This is an interesting use case. Here's my take:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_ngram_analyzer": {
          "tokenizer": "my_ngram_tokenizer",
          "filter": ["lowercase"]
        },
        "my_edge_ngram_analyzer": {
          "tokenizer": "my_edge_ngram_tokenizer",
          "filter": ["lowercase"]
        },
        "my_reverse_edge_ngram_analyzer": {
          "tokenizer": "keyword",
          "filter" : ["lowercase","reverse","substring","reverse"]
        },
        "lowercase_keyword": {
          "type": "custom",
          "filter": ["lowercase"],
          "tokenizer": "keyword"
        }
      },
      "tokenizer": {
        "my_ngram_tokenizer": {
          "type": "nGram",
          "min_gram": "2",
          "max_gram": "25"
        },
        "my_edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "2",
          "max_gram": "25"
        }
      },
      "filter": {
        "substring": {
          "type": "edgeNGram",
          "min_gram": 2,
          "max_gram": 25
        }
      }
    }
  },
  "mappings": {
    "test_type": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "my_ngram_analyzer",
          "fields": {
            "starts_with": {
              "type": "string",
              "analyzer": "my_edge_ngram_analyzer"
            },
            "ends_with": {
              "type": "string",
              "analyzer": "my_reverse_edge_ngram_analyzer"
            },
            "exact_case_insensitive_match": {
              "type": "string",
              "analyzer": "lowercase_keyword"
            }
          }
        }
      }
    }
  }
}

my_ngram_analyzer is used to split every text into small pieces, how large the pieces are depends on your use case. I chose, for testing purposes, 25 chars. lowercase is used since you said case-insensitive. Basically, this is the tokenizer used for substringof('table 1',name). The query is simple:

{
  "query": {
    "term": {
      "text": {
        "value": "table 1"
      }
    }
  }
}

my_edge_ngram_analyzer is used to split the text starting from the beginning and this is specifically used for the startswith(name,'table 1') use case. Again, the query is simple:

{
  "query": {
    "term": {
      "text.starts_with": {
        "value": "table 1"
      }
    }
  }
}

I found this the most tricky part - the one for endswith(name,'table 1'). For this I defined my_reverse_edge_ngram_analyzer which uses a keyword tokenizer together with lowercase and an edgeNGram filter preceded and followed by a reverse filter. What this tokenizer basically does is to split the text in edgeNGrams but the edge is the end of the text, not the start (like with the regular edgeNGram). The query:

{
  "query": {
    "term": {
      "text.ends_with": {
        "value": "table 1"
      }
    }
  }
}

for the name eq 'table 1' case, a simple keyword tokenizer together with a lowercase filter should do it The query:

{
  "query": {
    "term": {
      "text.exact_case_insensitive_match": {
        "value": "table 1"
      }
    }
  }
}

Regarding query_string, this changes the solution a bit, because I was counting on term to not analyze the input text and to match it exactly with one of the terms in the index.

But this can be "simulated" with query_string if the appropriate analyzer is specified for it.

The solution would be a set of queries like the following (always use that analyzer, changing only the field name):

{
  "query": {
    "query_string": {
      "query": "text.starts_with:(\"table 1\")",
      "analyzer": "lowercase_keyword"
    }
  }
}

Dynamically Rendering asp:Image from BLOB entry in ASP.NET

Is it possible to "steal" an event handler from one control and give it to another?

How to perform JSF validation in actionListener or action method?

php mail() function on localhost

How do you pass data between view controllers in Swift?

How to use a C++ string in a structure when malloc()-ing the same structure?

Using XmlSerializer to serialize derived classes

How to get the content-type of a file in PHP?

Setting the internal buffer used by a standard stream (pubsetbuf)

Select groups which have at least one of a certain value

Failed to load map. Error contacting Google servers. This is probably an authentication issue

Why does overflow: hidden have the unexpected side-effect of growing in height to contain floated elements?

How to wisely combine shingles and edgeNgram to provide flexible full text search?

Related

Recent Posts