使用Elastic search从文本中提取关键字(多词)

我有一个充满关键字的索引,根据这些关键字,我想从输入文本中提取关键字。

以下是示例关键字索引。请注意,关键字也可以是多个单词,或者基本上是唯一的标签。

{

"hits": {

"total": 2000,

"hits": [

{

"id": 1,

"keyword": "thousand eyes"

},

{

"id": 2,

"keyword": "facebook"

},

{

"id": 3,

"keyword": "superdoc"

},

{

"id": 4,

"keyword": "quora"

},

{

"id": 5,

"keyword": "your story"

},

{

"id": 6,

"keyword": "Surgery"

},

{

"id": 7,

"keyword": "lending club"

},

{

"id": 8,

"keyword": "ad roll"

},

{

"id": 9,

"keyword": "the honest company"

},

{

"id": 10,

"keyword": "Draft kings"

}

]

}

}

现在,如果输入文本为 ,则搜索结果应为 。此外,搜索应

回答:

只有一种真正的方法可以做到这一点。您必须将您的数据作为关键字建立索引,并使用带状疱疹对其进行分析:

看到这个复制品:

首先,我们将创建两个自定义分析器:keyword和shingles:

PUT test

{

"settings": {

"analysis": {

"analyzer": {

"my_analyzer_keyword": {

"type": "custom",

"tokenizer": "keyword",

"filter": [

"asciifolding",

"lowercase"

]

},

"my_analyzer_shingle": {

"type": "custom",

"tokenizer": "standard",

"filter": [

"asciifolding",

"lowercase",

"shingle"

]

}

}

}

},

"mappings": {

"your_type": {

"properties": {

"keyword": {

"type": "string",

"index_analyzer": "my_analyzer_keyword",

"search_analyzer": "my_analyzer_shingle"

}

}

}

}

}

现在,让我们使用您提供的数据创建一些示例数据:

POST /test/your_type/1

{

"id": 1,

"keyword": "thousand eyes"

}

POST /test/your_type/2

{

"id": 2,

"keyword": "facebook"

}

POST /test/your_type/3

{

"id": 3,

"keyword": "superdoc"

}

POST /test/your_type/4

{

"id": 4,

"keyword": "quora"

}

POST /test/your_type/5

{

"id": 5,

"keyword": "your story"

}

POST /test/your_type/6

{

"id": 6,

"keyword": "Surgery"

}

POST /test/your_type/7

{

"id": 7,

"keyword": "lending club"

}

POST /test/your_type/8

{

"id": 8,

"keyword": "ad roll"

}

POST /test/your_type/9

{

"id": 9,

"keyword": "the honest company"

}

POST /test/your_type/10

{

"id": 10,

"keyword": "Draft kings"

}

最后查询以运行搜索:

POST /test/your_type/_search

{

"query": {

"match": {

"keyword": "I saw the news of lending club on facebook, your story and quora"

}

}

}

这是结果:

{

"took": 6,

"timed_out": false,

"_shards": {

"total": 5,

"successful": 5,

"failed": 0

},

"hits": {

"total": 4,

"max_score": 0.009332742,

"hits": [

{

"_index": "test",

"_type": "your_type",

"_id": "2",

"_score": 0.009332742,

"_source": {

"id": 2,

"keyword": "facebook"

}

},

{

"_index": "test",

"_type": "your_type",

"_id": "7",

"_score": 0.009332742,

"_source": {

"id": 7,

"keyword": "lending club"

}

},

{

"_index": "test",

"_type": "your_type",

"_id": "4",

"_score": 0.009207102,

"_source": {

"id": 4,

"keyword": "quora"

}

},

{

"_index": "test",

"_type": "your_type",

"_id": "5",

"_score": 0.0014755741,

"_source": {

"id": 5,

"keyword": "your story"

}

}

]

}

}

那么它在幕后做什么?

  1. 它将您的文档索引为整个关键字(它将整个字符串作为单个标记发出)。我还添加了asiifolding过滤器,因此它可以对字母进行规范化(即é成为e)和小写过滤器(不区分大小写的搜索)。因此例如Draft kings被索引为draft kings
  2. 现在,搜索分析器使用的是相同的逻辑,除了它的令牌生成器发出单词令牌,并在其之上创建带状疱疹(令牌的组合)之外,它将匹配第一步中索引的关键字。

以上是 使用Elastic search从文本中提取关键字(多词) 的全部内容, 来源链接: utcz.com/qa/435936.html

回到顶部