使用ElasticSearch进行文件名搜索

Z时代
2024-01-10
分类：问答

我想使用ElasticSearch搜索文件名（而不是文件的内容）。因此，我需要找到文件名的一部分（完全匹配，没有模糊搜索）。

示例：

我有以下名称的文件：

My_first_file_created_at_2012.01.13.doc My_second_file_created_at_2012.01.13.pdf Another file.txt And_again_another_file.docx foo.bar.txt

现在，我要搜索2012.01.13以获取前两个文件。

搜索file或ile应返回除最后一个文件名以外的所有文件名。

如何使用ElasticSearch做到这一点？

这是我测试过的，但始终返回零结果：

curl -X DELETE localhost:9200/files
curl -X PUT    localhost:9200/files -d '
{
  "settings" : {
    "index" : {
      "analysis" : {
        "analyzer" : {
          "filename_analyzer" : {
            "type" : "custom",
            "tokenizer" : "lowercase",
            "filter"    : ["filename_stop", "filename_ngram"]
          }
        },
        "filter" : {
          "filename_stop" : {
            "type" : "stop",
            "stopwords" : ["doc", "pdf", "docx"]
          },
          "filename_ngram" : {
            "type" : "nGram",
            "min_gram" : 3,
            "max_gram" : 255
          }
        }
      }
    }
  },
  "mappings": {
    "files": {
      "properties": {
        "filename": {
          "type": "string",
          "analyzer": "filename_analyzer"
        }
      }
    }
  }
}
'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }'
curl -X POST "http://localhost:9200/files/_refresh"
FILES='
http://localhost:9200/files/_search?q=filename:2012.01.13
'
for file in ${FILES}
do
  echo; echo; echo ">>> ${file}"
  curl "${file}&pretty=true"
done

回答：

您粘贴的内容存在各种问题：

创建索引时，请指定：

"mappings": {
    "files": {

但实际上您的类型file不是files。如果您检查了映射，您将立即看到：

curl -XGET 'http://127.0.0.1:9200/files/_mapping?pretty=1'
# {
#    "files" : {
#       "files" : {
#          "properties" : {
#             "filename" : {
#                "type" : "string",
#                "analyzer" : "filename_analyzer"
#             }
#          }
#       },
#       "file" : {
#          "properties" : {
#             "filename" : {
#                "type" : "string"
#             }
#          }
#       }
#    }
# }

您已经指定了lowercase令牌生成器，但是它删除了不是字母的任何内容（请参阅docs），因此您的数字已被完全删除。

您可以使用analytics API进行检查：

curl -XGET 'http://127.0.0.1:9200/_analyze?pretty=1&text=My_file_2012.01.13.doc&tokenizer=lowercase' # { # "tokens" : [ # { # "end_offset" : 2, # "position" : 1, # "start_offset" : 0, # "type" : "word", # "token" : "my" # }, # { # "end_offset" : 7, # "position" : 2, # "start_offset" : 3, # "type" : "word", # "token" : "file" # }, # { # "end_offset" : 22, # "position" : 3, # "start_offset" : 19, # "type" : "word", # "token" : "doc" # } # ] # }

您在索引分析器和搜索分析器中都包括了ngram令牌过滤器。这对于索引分析器很好，因为您希望对ngram进行索引。但是，当您搜索时，您想搜索的是完整字符串，而不是每个ngram。

例如，如果您"abcd"使用长度为1到4的ngram进行索引，则最终将得到以下标记：

a b c d ab bc cd abc bcd

但是，如果您搜索"dcba"（不匹配），并且还使用ngrams分析搜索词，则实际上是在搜索：

d c b a dc cb ba dbc cba

因此a，b，c和d将匹配！

首先，您需要选择正确的分析仪。您的用户可能会搜索单词，数字或日期，但可能不会期望ile匹配file。相反，使用edge ngrams

可能会更有用，它会将ngram锚定到每个单词的开头（或结尾）。

另外，为什么要排除docx等等？用户肯定会想要搜索文件类型吗？

因此，通过删除不是字母或数字的任何内容（使用模式tokenizer），将每个文件名分成较小的令牌：

My_first_file_2012.01.13.doc
=> my first file 2012 01 13 doc

然后对于索引分析器，我们还将在每个标记上使用边缘ngram：

my     => m my
first  => f fi fir firs first
file   => f fi fil file
2012   => 2 20 201 201
01     => 0 01
13     => 1 13
doc    => d do doc

我们创建索引如下：

curl -XPUT 'http://127.0.0.1:9200/files/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "filename_search" : {
               "tokenizer" : "filename",
               "filter" : ["lowercase"]
            },
            "filename_index" : {
               "tokenizer" : "filename",
               "filter" : ["lowercase","edge_ngram"]
            }
         },
         "tokenizer" : {
            "filename" : {
               "pattern" : "[^\\p{L}\\d]+",
               "type" : "pattern"
            }
         },
         "filter" : {
            "edge_ngram" : {
               "side" : "front",
               "max_gram" : 20,
               "min_gram" : 1,
               "type" : "edgeNGram"
            }
         }
      }
   },
   "mappings" : {
      "file" : {
         "properties" : {
            "filename" : {
               "type" : "string",
               "search_analyzer" : "filename_search",
               "index_analyzer" : "filename_index"
            }
         }
      }
   }
}
'

现在，测试我们的分析仪是否正常工作：

curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_search' [results snipped] "token" : "my" "token" : "first" "token" : "file" "token" : "2012" "token" : "01" "token" : "13" "token" : "doc"

curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_index' "token" : "m" "token" : "my" "token" : "f" "token" : "fi" "token" : "fir" "token" : "firs" "token" : "first" "token" : "f" "token" : "fi" "token" : "fil" "token" : "file" "token" : "2" "token" : "20" "token" : "201" "token" : "2012" "token" : "0" "token" : "01" "token" : "1" "token" : "13" "token" : "d" "token" : "do" "token" : "doc"

OK-似乎工作正常。因此，让我们添加一些文档：

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }' curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }' curl -X POST "http://localhost:9200/files/_refresh"

并尝试搜索：

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "text" : { "filename" : "2012.01" } } } ' # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.06780553, # "_index" : "files", # "_id" : "PsDvfFCkT4yvJnlguxJrrQ", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.06780553, # "_index" : "files", # "_id" : "ER5RmyhATg-Eu92XNGRu-w", # "_type" : "file" # } # ], # "max_score" : 0.06780553, # "total" : 2 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 4 # }

成功！

我意识到搜索2012.01将使两者匹配2012.01.12，2012.12.01因此我尝试将查询更改为使用文本短语查询。但是，这没有用。事实证明，边缘ngram过滤器会增加每个ngram的位置计数（而我本以为每个ngram的位置将与单词开头相同）。

在点（3）中提到的问题使用时以上只是一个问题query_string，field或text查询它试图匹配任何令牌。但是，对于text_phrase查询，它将尝试以正确的顺序匹配所有令牌。

为了演示该问题，请使用不同的日期为另一个文档建立索引：

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_third_file_created_at_2012.12.01.doc" }' curl -X POST "http://localhost:9200/files/_refresh"

并执行与上述相同的搜索：

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "text" : { "filename" : { "query" : "2012.01" } } } } ' # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_third_file_created_at_2012.12.01.doc" # }, # "_score" : 0.22097087, # "_index" : "files", # "_id" : "xmC51lIhTnWplOHADWJzaQ", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.13137488, # "_index" : "files", # "_id" : "ZUezxDgQTsuAaCTVL9IJgg", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.13137488, # "_index" : "files", # "_id" : "XwLNnSlwSeyYtA2y64WuVw", # "_type" : "file" # } # ], # "max_score" : 0.22097087, # "total" : 3 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 5 # }

第一个结果的日期2012.12.01不是最匹配的日期2012.01。因此，仅匹配该确切短语，我们可以执行以下操作：

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "text_phrase" : { "filename" : { "query" : "2012.01", "analyzer" : "filename_index" } } } } ' # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.55737644, # "_index" : "files", # "_id" : "ZUezxDgQTsuAaCTVL9IJgg", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.55737644, # "_index" : "files", # "_id" : "XwLNnSlwSeyYtA2y64WuVw", # "_type" : "file" # } # ], # "max_score" : 0.55737644, # "total" : 2 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 7 # }

或者，如果您仍然要匹配所有3个文件（因为用户可能会记住文件名中的某些单词，但顺序错误），则可以运行两个查询，但要以正确的顺序增加文件名的重要性：

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1' -d ' { "query" : { "bool" : { "should" : [ { "text_phrase" : { "filename" : { "boost" : 2, "query" : "2012.01", "analyzer" : "filename_index" } } }, { "text" : { "filename" : "2012.01" } } ] } } } ' # [Fri Feb 24 16:31:02 2012] Response: # { # "hits" : { # "hits" : [ # { # "_source" : { # "filename" : "My_first_file_created_at_2012.01.13.doc" # }, # "_score" : 0.56892186, # "_index" : "files", # "_id" : "ZUezxDgQTsuAaCTVL9IJgg", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_second_file_created_at_2012.01.13.pdf" # }, # "_score" : 0.56892186, # "_index" : "files", # "_id" : "XwLNnSlwSeyYtA2y64WuVw", # "_type" : "file" # }, # { # "_source" : { # "filename" : "My_third_file_created_at_2012.12.01.doc" # }, # "_score" : 0.012931341, # "_index" : "files", # "_id" : "xmC51lIhTnWplOHADWJzaQ", # "_type" : "file" # } # ], # "max_score" : 0.56892186, # "total" : 3 # }, # "timed_out" : false, # "_shards" : { # "failed" : 0, # "successful" : 5, # "total" : 5 # }, # "took" : 4 # }

以上是使用ElasticSearch进行文件名搜索的全部内容，来源链接： utcz.com/qa/412782.html

使用ElasticSearch进行文件名搜索

回答：

其他人也看了：