UTF8 encoding is longer than the max length 32766
Solution 1:
So you are running into an issue with the maximum size for a single term. When you set a field to not_analyzed it will treat it as one single term. The maximum size for a single term in the underlying Lucene index is 32766 bytes, which is I believe hard coded.
Your two primary options are to either change the type to binary or to continue to use string but set the index type to "no".
Solution 2:
If you really want not_analyzed
on on the property because you want to do some exact filtering then you can use "ignore_above": 256
Here is an example of how I use it in php:
'mapping' => [
'type' => 'multi_field',
'path' => 'full',
'fields' => [
'{name}' => [
'type' => 'string',
'index' => 'analyzed',
'analyzer' => 'standard',
],
'raw' => [
'type' => 'string',
'index' => 'not_analyzed',
'ignore_above' => 256,
],
],
],
In your case you probably want to do as John Petrone told you and set "index": "no"
but for anyone else finding this question after, like me, searching on that Exception then your options are:
- set
"index": "no"
- set
"index": "analyze"
- set
"index": "not_analyzed"
and"ignore_above": 256
It depends on if and how you want to filter on that property.
Solution 3:
There is a better option than the one John posted. Because with that solution you can't search on the value anymore.
Back to the problem:
The problem is that by default field values will be used as a single term (complete string). If that term/string is longer than the 32766 bytes it can't be stored in Lucene .
Older versions of Lucene only registers a warning when terms are too long (and ignore the value). Newer versions throws an Exception. See bugfix: https://issues.apache.org/jira/browse/LUCENE-5472
Solution:
The best option is to define a (custom) analyzer on the field with the long string value. The analyzer can split out the long string in smaller strings/terms. That will fix the problem of too long terms.
Don't forget to also add an analyzer to the "_all" field if you are using that functionality.
Analyzers can be tested with the REST api. http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-analyze.html
Solution 4:
I needed to change the index
part of the mapping to no
instead of not_analyzed
. That way the value is not indexed. It remains available in the returned document (from a search, a get, …) but I can't query it.
Solution 5:
One way of handling tokens that are over the lucene limit is to use the truncate
filter. Similar to ignore_above
for keywords. To demonstrate, I'm using 5
.
Elasticsearch suggests to use ignore_above = 32766 / 4 = 8191
since UTF-8 characters may occupy at most 4 bytes.
https://www.elastic.co/guide/en/elasticsearch/reference/6.3/ignore-above.html
curl -H'Content-Type:application/json' localhost:9200/_analyze -d'{
"filter" : [{"type": "truncate", "length": 5}],
"tokenizer": {
"type": "pattern"
},
"text": "This movie \n= AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"
}'
Output:
{
"tokens": [
{
"token": "This",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "movie",
"start_offset": 5,
"end_offset": 10,
"type": "word",
"position": 1
},
{
"token": "AAAAA",
"start_offset": 14,
"end_offset": 52,
"type": "word",
"position": 2
}
]
}