UTF8 encoding is longer than the max length 32766

Solution 1:

So you are running into an issue with the maximum size for a single term. When you set a field to not_analyzed it will treat it as one single term. The maximum size for a single term in the underlying Lucene index is 32766 bytes, which is I believe hard coded.

Your two primary options are to either change the type to binary or to continue to use string but set the index type to "no".

Solution 2:

If you really want not_analyzed on on the property because you want to do some exact filtering then you can use "ignore_above": 256

Here is an example of how I use it in php:

'mapping'    => [
    'type'   => 'multi_field',
    'path'   => 'full',
    'fields' => [
        '{name}' => [
            'type'     => 'string',
            'index'    => 'analyzed',
            'analyzer' => 'standard',
        ],
        'raw' => [
            'type'         => 'string',
            'index'        => 'not_analyzed',
            'ignore_above' => 256,
        ],
    ],
],

In your case you probably want to do as John Petrone told you and set "index": "no" but for anyone else finding this question after, like me, searching on that Exception then your options are:

  • set "index": "no"
  • set "index": "analyze"
  • set "index": "not_analyzed" and "ignore_above": 256

It depends on if and how you want to filter on that property.

Solution 3:

There is a better option than the one John posted. Because with that solution you can't search on the value anymore.

Back to the problem:

The problem is that by default field values will be used as a single term (complete string). If that term/string is longer than the 32766 bytes it can't be stored in Lucene .

Older versions of Lucene only registers a warning when terms are too long (and ignore the value). Newer versions throws an Exception. See bugfix: https://issues.apache.org/jira/browse/LUCENE-5472

Solution:

The best option is to define a (custom) analyzer on the field with the long string value. The analyzer can split out the long string in smaller strings/terms. That will fix the problem of too long terms.

Don't forget to also add an analyzer to the "_all" field if you are using that functionality.

Analyzers can be tested with the REST api. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html

Solution 4:

I needed to change the index part of the mapping to no instead of not_analyzed. That way the value is not indexed. It remains available in the returned document (from a search, a get, …) but I can't query it.

Solution 5:

One way of handling tokens that are over the lucene limit is to use the truncate filter. Similar to ignore_above for keywords. To demonstrate, I'm using 5. Elasticsearch suggests to use ignore_above = 32766 / 4 = 8191 since UTF-8 characters may occupy at most 4 bytes. https://www.elastic.co/guide/en/elasticsearch/reference/6.3/ignore-above.html

curl -H'Content-Type:application/json' localhost:9200/_analyze -d'{
  "filter" : [{"type": "truncate", "length": 5}],
  "tokenizer": {
    "type":    "pattern"
  },
  "text": "This movie \n= AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}'

Output:

{
  "tokens": [
    {
      "token": "This",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    },
    {
      "token": "movie",
      "start_offset": 5,
      "end_offset": 10,
      "type": "word",
      "position": 1
    },
    {
      "token": "AAAAA",
      "start_offset": 14,
      "end_offset": 52,
      "type": "word",
      "position": 2
    }
  ]
}