使用Elastic search从文本中提取关键字(多词)
我有一个充满关键字的索引,根据这些关键字,我想从输入文本中提取关键字。
以下是示例关键字索引。请注意,关键字也可以是多个单词,或者基本上是唯一的标签。
{ "hits": {
"total": 2000,
"hits": [
{
"id": 1,
"keyword": "thousand eyes"
},
{
"id": 2,
"keyword": "facebook"
},
{
"id": 3,
"keyword": "superdoc"
},
{
"id": 4,
"keyword": "quora"
},
{
"id": 5,
"keyword": "your story"
},
{
"id": 6,
"keyword": "Surgery"
},
{
"id": 7,
"keyword": "lending club"
},
{
"id": 8,
"keyword": "ad roll"
},
{
"id": 9,
"keyword": "the honest company"
},
{
"id": 10,
"keyword": "Draft kings"
}
]
}
}
现在,如果输入文本为 ,则搜索结果应为 。此外,搜索应
回答:
只有一种真正的方法可以做到这一点。您必须将您的数据作为关键字建立索引,并使用带状疱疹对其进行分析:
看到这个复制品:
首先,我们将创建两个自定义分析器:keyword和shingles:
PUT test{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer_keyword": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"asciifolding",
"lowercase"
]
},
"my_analyzer_shingle": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"asciifolding",
"lowercase",
"shingle"
]
}
}
}
},
"mappings": {
"your_type": {
"properties": {
"keyword": {
"type": "string",
"index_analyzer": "my_analyzer_keyword",
"search_analyzer": "my_analyzer_shingle"
}
}
}
}
}
现在,让我们使用您提供的数据创建一些示例数据:
POST /test/your_type/1{
"id": 1,
"keyword": "thousand eyes"
}
POST /test/your_type/2
{
"id": 2,
"keyword": "facebook"
}
POST /test/your_type/3
{
"id": 3,
"keyword": "superdoc"
}
POST /test/your_type/4
{
"id": 4,
"keyword": "quora"
}
POST /test/your_type/5
{
"id": 5,
"keyword": "your story"
}
POST /test/your_type/6
{
"id": 6,
"keyword": "Surgery"
}
POST /test/your_type/7
{
"id": 7,
"keyword": "lending club"
}
POST /test/your_type/8
{
"id": 8,
"keyword": "ad roll"
}
POST /test/your_type/9
{
"id": 9,
"keyword": "the honest company"
}
POST /test/your_type/10
{
"id": 10,
"keyword": "Draft kings"
}
最后查询以运行搜索:
POST /test/your_type/_search{
"query": {
"match": {
"keyword": "I saw the news of lending club on facebook, your story and quora"
}
}
}
这是结果:
{ "took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 0.009332742,
"hits": [
{
"_index": "test",
"_type": "your_type",
"_id": "2",
"_score": 0.009332742,
"_source": {
"id": 2,
"keyword": "facebook"
}
},
{
"_index": "test",
"_type": "your_type",
"_id": "7",
"_score": 0.009332742,
"_source": {
"id": 7,
"keyword": "lending club"
}
},
{
"_index": "test",
"_type": "your_type",
"_id": "4",
"_score": 0.009207102,
"_source": {
"id": 4,
"keyword": "quora"
}
},
{
"_index": "test",
"_type": "your_type",
"_id": "5",
"_score": 0.0014755741,
"_source": {
"id": 5,
"keyword": "your story"
}
}
]
}
}
那么它在幕后做什么?
- 它将您的文档索引为整个关键字(它将整个字符串作为单个标记发出)。我还添加了asiifolding过滤器,因此它可以对字母进行规范化(即
é
成为e
)和小写过滤器(不区分大小写的搜索)。因此例如Draft kings
被索引为draft kings
- 现在,搜索分析器使用的是相同的逻辑,除了它的令牌生成器发出单词令牌,并在其之上创建带状疱疹(令牌的组合)之外,它将匹配第一步中索引的关键字。
以上是 使用Elastic search从文本中提取关键字(多词) 的全部内容, 来源链接: utcz.com/qa/435936.html