如何通过Elasticsearch模糊匹配电子邮件或电话?

我想对Elasticsearch的电子邮件或电话进行模糊匹配。例如:

匹配所有以结尾的电子邮件 @gmail.com

要么

匹配所有电话开头136

我知道我可以使用通配符

{

"query": {

"wildcard" : {

"email": "*gmail.com"

}

}

}

但是性能很差。我尝试使用regexp:

{"query": {"regexp": {"email": {"value": "*163\.com*"} } } }

但是不起作用。

有更好的方法吗?

curl -XGET本地主机:9200 / user_data

{

"user_data": {

"aliases": {},

"mappings": {

"user_data": {

"properties": {

"address": {

"type": "string"

},

"age": {

"type": "long"

},

"comment": {

"type": "string"

},

"created_on": {

"type": "date",

"format": "dateOptionalTime"

},

"custom": {

"properties": {

"key": {

"type": "string"

},

"value": {

"type": "string"

}

}

},

"gender": {

"type": "string"

},

"name": {

"type": "string"

},

"qq": {

"type": "string"

},

"tel": {

"type": "string"

},

"updated_on": {

"type": "date",

"format": "dateOptionalTime"

},

}

}

},

"settings": {

"index": {

"creation_date": "1458832279465",

"uuid": "Fbmthc3lR0ya51zCnWidYg",

"number_of_replicas": "1",

"number_of_shards": "5",

"version": {

"created": "1070299"

}

}

},

"warmers": {}

}

}

映射:

{

"settings": {

"analysis": {

"analyzer": {

"index_phone_analyzer": {

"type": "custom",

"char_filter": [ "digit_only" ],

"tokenizer": "digit_edge_ngram_tokenizer",

"filter": [ "trim" ]

},

"search_phone_analyzer": {

"type": "custom",

"char_filter": [ "digit_only" ],

"tokenizer": "keyword",

"filter": [ "trim" ]

},

"index_email_analyzer": {

"type": "custom",

"tokenizer": "standard",

"filter": [ "lowercase", "name_ngram_filter", "trim" ]

},

"search_email_analyzer": {

"type": "custom",

"tokenizer": "standard",

"filter": [ "lowercase", "trim" ]

}

},

"char_filter": {

"digit_only": {

"type": "pattern_replace",

"pattern": "\\D+",

"replacement": ""

}

},

"tokenizer": {

"digit_edge_ngram_tokenizer": {

"type": "edgeNGram",

"min_gram": "3",

"max_gram": "15",

"token_chars": [ "digit" ]

}

},

"filter": {

"name_ngram_filter": {

"type": "ngram",

"min_gram": "3",

"max_gram": "20"

}

}

}

},

"mappings" : {

"user_data" : {

"properties" : {

"name" : {

"type" : "string",

"analyzer" : "ik"

},

"age" : {

"type" : "integer"

},

"gender": {

"type" : "string"

},

"qq" : {

"type" : "string"

},

"email" : {

"type" : "string",

"analyzer": "index_email_analyzer",

"search_analyzer": "search_email_analyzer"

},

"tel" : {

"type" : "string",

"analyzer": "index_phone_analyzer",

"search_analyzer": "search_phone_analyzer"

},

"address" : {

"type": "string",

"analyzer" : "ik"

},

"comment" : {

"type" : "string",

"analyzer" : "ik"

},

"created_on" : {

"type" : "date",

"format" : "dateOptionalTime"

},

"updated_on" : {

"type" : "date",

"format" : "dateOptionalTime"

},

"custom": {

"type" : "nested",

"properties" : {

"key" : {

"type" : "string"

},

"value" : {

"type" : "string"

}

}

}

}

}

}

}

回答:

一种简单的方法是创建一个自定义分析器,该分析器使用电子邮件的n-gram令牌过滤器(=>参见下文index_email_analyzersearch_email_analyzer+

email_url_analyzer进行精确的电子邮件匹配)和电话的edge-

ngram令牌过滤器(=>参见下文index_phone_analyzersearch_phone_analyzer)。

完整的索引定义在下面提供。

PUT myindex

{

"settings": {

"analysis": {

"analyzer": {

"email_url_analyzer": {

"type": "custom",

"tokenizer": "uax_url_email",

"filter": [ "trim" ]

},

"index_phone_analyzer": {

"type": "custom",

"char_filter": [ "digit_only" ],

"tokenizer": "digit_edge_ngram_tokenizer",

"filter": [ "trim" ]

},

"search_phone_analyzer": {

"type": "custom",

"char_filter": [ "digit_only" ],

"tokenizer": "keyword",

"filter": [ "trim" ]

},

"index_email_analyzer": {

"type": "custom",

"tokenizer": "standard",

"filter": [ "lowercase", "name_ngram_filter", "trim" ]

},

"search_email_analyzer": {

"type": "custom",

"tokenizer": "standard",

"filter": [ "lowercase", "trim" ]

}

},

"char_filter": {

"digit_only": {

"type": "pattern_replace",

"pattern": "\\D+",

"replacement": ""

}

},

"tokenizer": {

"digit_edge_ngram_tokenizer": {

"type": "edgeNGram",

"min_gram": "1",

"max_gram": "15",

"token_chars": [ "digit" ]

}

},

"filter": {

"name_ngram_filter": {

"type": "ngram",

"min_gram": "1",

"max_gram": "20"

}

}

}

},

"mappings": {

"your_type": {

"properties": {

"email": {

"type": "string",

"analyzer": "index_email_analyzer",

"search_analyzer": "search_email_analyzer"

},

"phone": {

"type": "string",

"analyzer": "index_phone_analyzer",

"search_analyzer": "search_phone_analyzer"

}

}

}

}

}

现在,让我们一点一点地剖析它。

对于该phone字段,其想法是使用来索引电话值index_phone_analyzer,该索引使用edge-

ngram标记器来索引电话号码的所有前缀。所以,如果您的电话号码1362435647,下面的标记会产生:113136136213624136243136243513624356136243561362435641362435647

然后,在搜索时,我们使用另一个分析器search_phone_analyzer,该分析器将简单地获取输入数字(例如136),并phone使用简单matchterm查询将其与字段进行匹配:

POST myindex

{

"query": {

"term":

{ "phone": "136" }

}

}

对于该email字段,我们以类似的方式进行操作,因为我们使用来对电子邮件值进行索引,该索引index_email_analyzer使用了ngram令牌过滤器,该过滤器将生成所有可能的长度不同(在1到20个字符之间)的令牌,这些令牌可以从电子邮件值。例如:john@gmail.com将被标记化到jjojoh,…

gmail.com,… john@gmail.com

然后在搜索时,我们将使用另一个名为的分析器search_email_analyzer,它将接受输入并尝试将其与索引标记进行匹配。

POST myindex

{

"query": {

"term":

{ "email": "@gmail.com" }

}

}

email_url_analyzer分析仪并没有在本例中使用,但我已经为了以防万一,你需要确切的电子邮件值匹配包括它。

以上是 如何通过Elasticsearch模糊匹配电子邮件或电话? 的全部内容, 来源链接: utcz.com/qa/433202.html

回到顶部