whoseUTF8encodingislongerthanthemaxlength32766),

编程

[2020-06-29T22:33:06,994][WARN ][logstash.outputs.elasticsearch] Could not index event to Elasticsearch. {:status=>400, :action=>["update", {:_id=>"w_2890", :_index=>"allworks_20190825", :_type=>"_doc", :routing=>nil, :retry_on_conflict=>1}, #<LogStash::Event:0x17e7bc27>], :response=>{"update"=>{"_index"=>"allworks_20190825", "_type"=>"_doc", "_id"=>"w_2890", "status"=>400, "error"=>{"type"=>"illegal_argument_exception", "reason"=>"Document contains at least one immense term in field="content" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: "[10, -27, -113, -92, -25, -79, -115, -26, -100, -119, -28, -70, -111, -17, -68, -116, -25, -101, -72, -28, -68, -96, -27, -101, -101, -25, -91, -98, -23, -99]...", original message: bytes can be at most 32766 in length; got 38080", "caused_by"=>{"type"=>"max_bytes_length_exceeded_exception", "reason"=>"bytes can be at most 32766 in length; got 38080"}}}}}
 

索引数据的时候,可能遇到类似于下面这种错误:

java.lang.IllegalArgumentException: Document contains at least one immense term in field="reqParams.data" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: "[123, 34, 98, 114, 111, 97, 100, 99, 97, 115, 116, 73, 100, 34, 58, 49, 52, 48, 56, 49, 57, 57, 57, 56, 56, 44, 34, 116, 121, 112]...", original message: bytes can be at most 32766 in length

原因分析:文档里面有一个巨大的term,超过了lucene处理的最大值(32766),不予处理并抛出异常。错误描述很明确,term太大了,超过了32766字节。首先,term是用于搜索的最小单位,一般来讲一个过长的term意义不会太大,有谁会去完整匹配一个100字的关键词呢?!一般都是输入一段关键语句,搜索引擎先将这关键语句分词,获取一系列的term,然后拿这些term去匹配已有文档的倒排索引,打分后返回结果。所以term一般不会很长,像32766这种长度的term即便存下来对于搜索来讲也是毫无意义的,所以当遇到这种超长的term时,如果可以只存储其部分信息,那么就可以解决我们遇到的immense term的问题了。

解决办法:ElasticSearch已经提供了解决方案,就是创建mapping映射的时候指定ignore_above属性。如下:

{

"mappings": {

"tweet": {

"properties": {

"message": {

"type": "text",

"fields": {

"full": {

"type": "keyword",

"ignore_above": 256

}

}

}

}

}

}

}

上面这个意思就是message字段默认使用分词索引,同时message.full字段不分词,当内容长度大于256字节时,只索引前面256个字符,后面的内容被丢弃。
注意:一般ignore_above设置就是为not_analyzed字段存在的,不可滥用。

 

 

https://github.com/DimonHo/DH_Note/issues/4

以上是 whoseUTF8encodingislongerthanthemaxlength32766), 的全部内容, 来源链接: utcz.com/z/517911.html

回到顶部