用于电子邮件的ElasticSearch Analyzer和Tokenizer

对于以下情况,我在Google或ES中都找不到完美的解决方案,希望有人可以在此提供帮助。

假设在“电子邮件”字段下存储了五个电子邮件地址:

1. {"email": "john.doe@gmail.com"}

2. {"email": "john.doe@gmail.com, john.doe@outlook.com"}

3. {"email": "hello-john.doe@outlook.com"}

4. {"email": "john.doe@outlook.com}

5. {"email": "john@yahoo.com"}

我要满足以下搜索方案:

[搜索->接收]

“ john.doe@gmail.com”-> 1,2

“ john.doe@outlook.com”-> 2,4

“ john@yahoo.com”-> 5

“ john.doe”-> 1,2,3,4

“约翰”-> 1,2,3,4,5

“ gmail.com”-> 1,2

“ outlook.com”-> 2,3,4

前三个匹配项是必须的,对于其他匹配项,越精确越好。已经尝试了索引/搜索分析器,标记器和过滤器的不同组合。还尝试在匹配查询的条件下工作,但是没有找到理想的解决方案,欢迎任何想法,并且对映射,分析器或使用哪种查询没有限制,谢谢。

回答:

PUT /test

{

"settings": {

"analysis": {

"filter": {

"email": {

"type": "pattern_capture",

"preserve_original": 1,

"patterns": [

"([^@]+)",

"(\\p{L}+)",

"(\\d+)",

"@(.+)",

"([^-@]+)"

]

}

},

"analyzer": {

"email": {

"tokenizer": "uax_url_email",

"filter": [

"email",

"lowercase",

"unique"

]

}

}

}

},

"mappings": {

"emails": {

"properties": {

"email": {

"type": "string",

"analyzer": "email"

}

}

}

}

}

POST /test/emails/_bulk

{"index":{"_id":"1"}}

{"email": "john.doe@gmail.com"}

{"index":{"_id":"2"}}

{"email": "john.doe@gmail.com, john.doe@outlook.com"}

{"index":{"_id":"3"}}

{"email": "hello-john.doe@outlook.com"}

{"index":{"_id":"4"}}

{"email": "john.doe@outlook.com"}

{"index":{"_id":"5"}}

{"email": "john@yahoo.com"}

GET /test/emails/_search

{

"query": {

"term": {

"email": "john.doe@gmail.com"

}

}

}

以上是 用于电子邮件的ElasticSearch Analyzer和Tokenizer 的全部内容, 来源链接: utcz.com/qa/416330.html

回到顶部