How to fuzzy match email or telephone by Elasticsearch?

An easy way to do this is to create a custom analyzer which makes use of the n-gram token filter for emails (=> see below index_email_analyzer and search_email_analyzer + email_url_analyzer for exact email matching) and edge-ngram token filter for phones (=> see below index_phone_analyzer and search_phone_analyzer).

The full index definition is available below.

PUT myindex
{
  "settings": {
    "analysis": {
      "analyzer": {
        "email_url_analyzer": {
          "type": "custom",
          "tokenizer": "uax_url_email",
          "filter": [ "trim" ]
        },
        "index_phone_analyzer": {
          "type": "custom",
          "char_filter": [ "digit_only" ],
          "tokenizer": "digit_edge_ngram_tokenizer",
          "filter": [ "trim" ]
        },
        "search_phone_analyzer": {
          "type": "custom",
          "char_filter": [ "digit_only" ],
          "tokenizer": "keyword",
          "filter": [ "trim" ]
        },
        "index_email_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [ "lowercase", "name_ngram_filter", "trim" ]
        },
        "search_email_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [ "lowercase", "trim" ]
        }
      },
      "char_filter": {
        "digit_only": {
          "type": "pattern_replace",
          "pattern": "\\D+",
          "replacement": ""
        }
      },
      "tokenizer": {
        "digit_edge_ngram_tokenizer": {
          "type": "edgeNGram",
          "min_gram": "1",
          "max_gram": "15",
          "token_chars": [ "digit" ]
        }
      },
      "filter": {
        "name_ngram_filter": {
          "type": "ngram",
          "min_gram": "1",
          "max_gram": "20"
        }
      }
    }
  },
  "mappings": {
    "your_type": {
      "properties": {
        "email": {
          "type": "string",
          "analyzer": "index_email_analyzer",
          "search_analyzer": "search_email_analyzer"
        },
        "phone": {
          "type": "string",
          "analyzer": "index_phone_analyzer",
          "search_analyzer": "search_phone_analyzer"
        }
      }
    }
  }
}

Now, let's dissect it one bit after another.

For the phone field, the idea is to index phone values with index_phone_analyzer, which uses an edge-ngram tokenizer in order to index all prefixes of the phone number. So if your phone number is 1362435647, the following tokens will be produced: 1, 13, 136, 1362, 13624, 136243, 1362435, 13624356, 13624356, 136243564, 1362435647.

Then when searching we use another analyzer search_phone_analyzer which will simply take the input number (e.g. 136) and match it against the phone field using a simple match or term query:

POST myindex
{ 
    "query": {
        "term": 
            { "phone": "136" }
    }
}

For the email field, we proceed in a similar way, in that we index the email values with the index_email_analyzer, which uses an ngram token filter, which will produce all possible tokens of varying length (between 1 and 20 chars) that can be taken from the email value. For instance: [email protected] will be tokenized to j, jo, joh, ..., gmail.com, ..., [email protected].

Then when searching, we'll use another analyzer called search_email_analyzer which will take the input and try to match it against the indexed tokens.

POST myindex
{ 
    "query": {
        "term": 
            { "email": "@gmail.com" }
    }
}

The email_url_analyzer analyzer is not used in this example but I've included it just in case you need to match on the exact email value.

How to fuzzy match email or telephone by Elasticsearch?

Related

Recent Posts