使用ElasticSearch进行文件名搜索

我想使用ElasticSearch搜索文件名(而不是文件的内容)。因此,我需要找到文件名的一部分(完全匹配,没有模糊搜索)。

示例:

我有以下名称的文件:

My_first_file_created_at_2012.01.13.doc

My_second_file_created_at_2012.01.13.pdf

Another file.txt

And_again_another_file.docx

foo.bar.txt

现在,我要搜索2012.01.13以获取前两个文件。

搜索fileile应返回除最后一个文件名以外的所有文件名。

如何使用ElasticSearch做到这一点?

这是我测试过的,但始终返回零结果:

curl -X DELETE localhost:9200/files

curl -X PUT localhost:9200/files -d '

{

"settings" : {

"index" : {

"analysis" : {

"analyzer" : {

"filename_analyzer" : {

"type" : "custom",

"tokenizer" : "lowercase",

"filter" : ["filename_stop", "filename_ngram"]

}

},

"filter" : {

"filename_stop" : {

"type" : "stop",

"stopwords" : ["doc", "pdf", "docx"]

},

"filename_ngram" : {

"type" : "nGram",

"min_gram" : 3,

"max_gram" : 255

}

}

}

}

},

"mappings": {

"files": {

"properties": {

"filename": {

"type": "string",

"analyzer": "filename_analyzer"

}

}

}

}

}

'

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }'

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }'

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }'

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }'

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }'

curl -X POST "http://localhost:9200/files/_refresh"

FILES='

http://localhost:9200/files/_search?q=filename:2012.01.13

'

for file in ${FILES}

do

echo; echo; echo ">>> ${file}"

curl "${file}&pretty=true"

done

回答:

您粘贴的内容存在各种问题:

创建索引时,请指定:

"mappings": {

"files": {

但实际上您的类型file不是files。如果您检查了映射,您将立即看到:

curl -XGET 'http://127.0.0.1:9200/files/_mapping?pretty=1'

# {

# "files" : {

# "files" : {

# "properties" : {

# "filename" : {

# "type" : "string",

# "analyzer" : "filename_analyzer"

# }

# }

# },

# "file" : {

# "properties" : {

# "filename" : {

# "type" : "string"

# }

# }

# }

# }

# }

您已经指定了lowercase令牌生成器,但是它删除了不是字母的任何内容(请参阅docs),因此您的数字已被完全删除。

您可以使用analytics API进行检查:

curl -XGET 'http://127.0.0.1:9200/_analyze?pretty=1&text=My_file_2012.01.13.doc&tokenizer=lowercase'

# {

# "tokens" : [

# {

# "end_offset" : 2,

# "position" : 1,

# "start_offset" : 0,

# "type" : "word",

# "token" : "my"

# },

# {

# "end_offset" : 7,

# "position" : 2,

# "start_offset" : 3,

# "type" : "word",

# "token" : "file"

# },

# {

# "end_offset" : 22,

# "position" : 3,

# "start_offset" : 19,

# "type" : "word",

# "token" : "doc"

# }

# ]

# }

您在索引分析器和搜索分析器中都包括了ngram令牌过滤器。这对于索引分析器很好,因为您希望对ngram进行索引。但是,当您搜索时,您想搜索的是完整字符串,而不是每个ngram。

例如,如果您"abcd"使用长度为1到4的ngram进行索引,则最终将得到以下标记:

a b c d ab bc cd abc bcd

但是,如果您搜索"dcba"(不匹配),并且还使用ngrams分析搜索词,则实际上是在搜索:

d c b a dc cb ba dbc cba

因此abcd将匹配!

首先,您需要选择正确的分析仪。您的用户可能会搜索单词,数字或日期,但可能不会期望ile匹配file。相反,使用edge ngrams

可能会更有用,它会将ngram锚定到每个单词的开头(或结尾)。

另外,为什么要排除docx等等?用户肯定会想要搜索文件类型吗?

因此,通过删除不是字母或数字的任何内容(使用模式tokenizer),将每个文件名分成较小的令牌:

My_first_file_2012.01.13.doc

=> my first file 2012 01 13 doc

然后对于索引分析器,我们还将在每个标记上使用边缘ngram:

my     => m my

first => f fi fir firs first

file => f fi fil file

2012 => 2 20 201 201

01 => 0 01

13 => 1 13

doc => d do doc

我们创建索引如下:

curl -XPUT 'http://127.0.0.1:9200/files/?pretty=1'  -d '

{

"settings" : {

"analysis" : {

"analyzer" : {

"filename_search" : {

"tokenizer" : "filename",

"filter" : ["lowercase"]

},

"filename_index" : {

"tokenizer" : "filename",

"filter" : ["lowercase","edge_ngram"]

}

},

"tokenizer" : {

"filename" : {

"pattern" : "[^\\p{L}\\d]+",

"type" : "pattern"

}

},

"filter" : {

"edge_ngram" : {

"side" : "front",

"max_gram" : 20,

"min_gram" : 1,

"type" : "edgeNGram"

}

}

}

},

"mappings" : {

"file" : {

"properties" : {

"filename" : {

"type" : "string",

"search_analyzer" : "filename_search",

"index_analyzer" : "filename_index"

}

}

}

}

}

'

现在,测试我们的分析仪是否正常工作:

curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_search' 

[results snipped]

"token" : "my"

"token" : "first"

"token" : "file"

"token" : "2012"

"token" : "01"

"token" : "13"

"token" : "doc"

curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_index' 

"token" : "m"

"token" : "my"

"token" : "f"

"token" : "fi"

"token" : "fir"

"token" : "firs"

"token" : "first"

"token" : "f"

"token" : "fi"

"token" : "fil"

"token" : "file"

"token" : "2"

"token" : "20"

"token" : "201"

"token" : "2012"

"token" : "0"

"token" : "01"

"token" : "1"

"token" : "13"

"token" : "d"

"token" : "do"

"token" : "doc"

OK-似乎工作正常。因此,让我们添加一些文档:

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }'

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }'

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }'

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }'

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }'

curl -X POST "http://localhost:9200/files/_refresh"

并尝试搜索:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '

{

"query" : {

"text" : {

"filename" : "2012.01"

}

}

}

'

# {

# "hits" : {

# "hits" : [

# {

# "_source" : {

# "filename" : "My_second_file_created_at_2012.01.13.pdf"

# },

# "_score" : 0.06780553,

# "_index" : "files",

# "_id" : "PsDvfFCkT4yvJnlguxJrrQ",

# "_type" : "file"

# },

# {

# "_source" : {

# "filename" : "My_first_file_created_at_2012.01.13.doc"

# },

# "_score" : 0.06780553,

# "_index" : "files",

# "_id" : "ER5RmyhATg-Eu92XNGRu-w",

# "_type" : "file"

# }

# ],

# "max_score" : 0.06780553,

# "total" : 2

# },

# "timed_out" : false,

# "_shards" : {

# "failed" : 0,

# "successful" : 5,

# "total" : 5

# },

# "took" : 4

# }

成功!

我意识到搜索2012.01将使两者匹配2012.01.122012.12.01因此我尝试将查询更改为使用文本短语查询。但是,这没有用。事实证明,边缘ngram过滤器会增加每个ngram的位置计数(而我本以为每个ngram的位置将与单词开头相同)。

在点(3)中提到的问题使用时以上只是一个问题query_stringfieldtext查询它试图匹配任何令牌。但是,对于text_phrase查询,它将尝试以正确的顺序匹配所有令牌。

为了演示该问题,请使用不同的日期为另一个文档建立索引:

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_third_file_created_at_2012.12.01.doc" }'

curl -X POST "http://localhost:9200/files/_refresh"

并执行与上述相同的搜索:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '

{

"query" : {

"text" : {

"filename" : {

"query" : "2012.01"

}

}

}

}

'

# {

# "hits" : {

# "hits" : [

# {

# "_source" : {

# "filename" : "My_third_file_created_at_2012.12.01.doc"

# },

# "_score" : 0.22097087,

# "_index" : "files",

# "_id" : "xmC51lIhTnWplOHADWJzaQ",

# "_type" : "file"

# },

# {

# "_source" : {

# "filename" : "My_first_file_created_at_2012.01.13.doc"

# },

# "_score" : 0.13137488,

# "_index" : "files",

# "_id" : "ZUezxDgQTsuAaCTVL9IJgg",

# "_type" : "file"

# },

# {

# "_source" : {

# "filename" : "My_second_file_created_at_2012.01.13.pdf"

# },

# "_score" : 0.13137488,

# "_index" : "files",

# "_id" : "XwLNnSlwSeyYtA2y64WuVw",

# "_type" : "file"

# }

# ],

# "max_score" : 0.22097087,

# "total" : 3

# },

# "timed_out" : false,

# "_shards" : {

# "failed" : 0,

# "successful" : 5,

# "total" : 5

# },

# "took" : 5

# }

第一个结果的日期2012.12.01不是最匹配的日期2012.01。因此,仅匹配该确切短语,我们可以执行以下操作:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '

{

"query" : {

"text_phrase" : {

"filename" : {

"query" : "2012.01",

"analyzer" : "filename_index"

}

}

}

}

'

# {

# "hits" : {

# "hits" : [

# {

# "_source" : {

# "filename" : "My_first_file_created_at_2012.01.13.doc"

# },

# "_score" : 0.55737644,

# "_index" : "files",

# "_id" : "ZUezxDgQTsuAaCTVL9IJgg",

# "_type" : "file"

# },

# {

# "_source" : {

# "filename" : "My_second_file_created_at_2012.01.13.pdf"

# },

# "_score" : 0.55737644,

# "_index" : "files",

# "_id" : "XwLNnSlwSeyYtA2y64WuVw",

# "_type" : "file"

# }

# ],

# "max_score" : 0.55737644,

# "total" : 2

# },

# "timed_out" : false,

# "_shards" : {

# "failed" : 0,

# "successful" : 5,

# "total" : 5

# },

# "took" : 7

# }

或者,如果您仍然要匹配所有3个文件(因为用户可能会记住文件名中的某些单词,但顺序错误),则可以运行两个查询,但要以正确的顺序增加文件名的重要性:

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '

{

"query" : {

"bool" : {

"should" : [

{

"text_phrase" : {

"filename" : {

"boost" : 2,

"query" : "2012.01",

"analyzer" : "filename_index"

}

}

},

{

"text" : {

"filename" : "2012.01"

}

}

]

}

}

}

'

# [Fri Feb 24 16:31:02 2012] Response:

# {

# "hits" : {

# "hits" : [

# {

# "_source" : {

# "filename" : "My_first_file_created_at_2012.01.13.doc"

# },

# "_score" : 0.56892186,

# "_index" : "files",

# "_id" : "ZUezxDgQTsuAaCTVL9IJgg",

# "_type" : "file"

# },

# {

# "_source" : {

# "filename" : "My_second_file_created_at_2012.01.13.pdf"

# },

# "_score" : 0.56892186,

# "_index" : "files",

# "_id" : "XwLNnSlwSeyYtA2y64WuVw",

# "_type" : "file"

# },

# {

# "_source" : {

# "filename" : "My_third_file_created_at_2012.12.01.doc"

# },

# "_score" : 0.012931341,

# "_index" : "files",

# "_id" : "xmC51lIhTnWplOHADWJzaQ",

# "_type" : "file"

# }

# ],

# "max_score" : 0.56892186,

# "total" : 3

# },

# "timed_out" : false,

# "_shards" : {

# "failed" : 0,

# "successful" : 5,

# "total" : 5

# },

# "took" : 4

# }

以上是 使用ElasticSearch进行文件名搜索 的全部内容, 来源链接: utcz.com/qa/412782.html

回到顶部