Elasticsearch - query_string with wildcards

It is true indeed that http://www.google.com will be tokenized by the standard analyzer into http and www.google.com and thus google.com will not be found.

So the standard analyzer alone will not help here, we need a token filter that will correctly transform URL tokens. Another way if your text field only contained URLs would have been to use the UAX Email URL tokenizer, but since the field can contain any other text (i.e. user comments), it won't work.

Fortunately, there's a new plugin around called analysis-url which provides an URL token filter, and this is exactly what we need (after a small modification I begged for, thanks @jlinn ;-) )

First, you need to install the plugin:

bin/plugin install https://github.com/jlinn/elasticsearch-analysis-url/releases/download/v2.2.0/elasticsearch-analysis-url-2.2.0.zip

Then, we can start playing. We need to create the proper analyzer for your text field:

curl -XPUT localhost:9200/test -d '{
  "settings": {
    "analysis": {
      "filter": {
        "url_host": {
          "type": "url",
          "part": "host",
          "url_decode": true,
          "passthrough": true
        }
      },
      "analyzer": {
        "url_host": {
          "filter": [
            "url_host"
          ],
          "tokenizer": "whitespace"
        }
      }
    }
  },
  "mappings": {
    "url": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "url_host"
        }
      }
    }
  }
}'

With this analyzer and mapping, we can properly index the host you want to be able to search for. For instance, let's analyze the string blabla bla http://www.google.com blabla using our new analyzer.

curl -XGET 'localhost:9200/urls/_analyze?analyzer=url_host&pretty' -d 'blabla bla http://www.google.com blabla'

We'll get the following tokens:

{
  "tokens" : [ {
    "token" : "blabla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "bla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "www.google.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "google.com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "com",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "blabla",
    "start_offset" : 0,
    "end_offset" : 0,
    "type" : "word",
    "position" : 5
  } ]
}

As you can see the http://www.google.com part will be tokenized into:

www.google.com
google.com i.e. what you expected
com

So now if your searchString is google.com you'll be able to find all the documents which have a text field containing google.com (or www.google.com).

Elasticsearch - query_string with wildcards

Related

Recent Posts