elasticsearch 同义词导致start_offset改变是怎么回事?
设置的同义词如下:
托尼-克罗斯=>托尼-克罗斯,克罗斯,托尼克罗斯,托尼,tk
index setting如下:
{ "settings": {
"index": {
"analysis": {
"filter": {
"my_synonym": {
"type": "synonym",
"synonyms_path": "my_synonym.txt",
"lenient": "true"
}
},
"analyzer": {
"my_ik_analyzer": {
"filter": [
"my_synonym"
],
"type": "custom",
"tokenizer": "my_ik_token"
}
},
"tokenizer": {
"my_ik_token": {
"type": "ik_max_word"
}
}
}
}
}
}
tokenizer(my_ik_token
)分词托尼-克罗斯
结果为
{ "tokens":[
{
"token":"托尼",
"start_offset":0,
"end_offset":2,
"type":"CN_WORD",
"position":0
},
{
"token":"克罗斯",
"start_offset":3,
"end_offset":6,
"type":"CN_WORD",
"position":1
},
{
"token":"罗斯",
"start_offset":4,
"end_offset":6,
"type":"CN_WORD",
"position":2
}
]
}
加上了synonym filter
的analyzer(my_ik_analyzer)
分词结果为:
{ "tokens": [
{
"token": "托尼",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "克罗斯",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "托尼",
"start_offset": 0,
"end_offset": 2,
"type": "SYNONYM",
"position": 0
},
{
"token": "托尼",
"start_offset": 0,
"end_offset": 6,
"type": "SYNONYM",
"position": 0
},
{
"token": "tk",
"start_offset": 0,
"end_offset": 6,
"type": "SYNONYM",
"position": 0
},
{
"token": "克罗斯",
"start_offset": 3,
"end_offset": 6,
"type": "SYNONYM",
"position": 1
},
{
"token": "罗斯",
"start_offset": 3,
"end_offset": 6,
"type": "SYNONYM",
"position": 1
},
{
"token": "尼克",
"start_offset": 3,
"end_offset": 6,
"type": "SYNONYM",
"position": 1
},
{
"token": "罗斯",
"start_offset": 4,
"end_offset": 6,
"type": "SYNONYM",
"position": 2
},
{
"token": "克罗斯",
"start_offset": 4,
"end_offset": 6,
"type": "SYNONYM",
"position": 2
},
{
"token": "罗斯",
"start_offset": 4,
"end_offset": 6,
"type": "SYNONYM",
"position": 3
}
]
}
可以看到克罗斯
出现了两次,其中有一次的start_offset
和end_offset
是错误的。
以上是 elasticsearch 同义词导致start_offset改变是怎么回事? 的全部内容, 来源链接: utcz.com/a/161836.html