如何通过Elasticsearch中的嵌套字段计算多个唯一文档?

我正在尝试计算具有唯一嵌套字段值的文档(以及文档本身)。看起来获得唯一文档有效。但是,当我尝试执行的请求时count,出现如下错误:

禁止:org.elasticsearch.client.ResponseException:方法[POST],主机 [http://

localhost:9200],URI [/ package /

_count?ignore_throttled = true&ignore_unavailable = false&expand_wildcards =

open&allow_no_indices = true],状态行[HTTP / 1.1 400错误的请求] {“错误”:{“

root_cause”:[{“ type”:“ parsing_exception”,“原因”:“请求不支持[collapse]”,“

line”:1,“ col”:216} ],“ type”:“ parsing_exception”,“

reason”:“请求不支持[collapse]”,“ line”:1,“ col”:216},“ status”:400}

代码:

        BoolQueryBuilder innerTemplNestedBuilder = QueryBuilders.boolQuery();

NestedQueryBuilder templatesNestedQuery = QueryBuilders.nestedQuery("attachment", innerTemplNestedBuilder, ScoreMode.None);

BoolQueryBuilder mainQueryBuilder = QueryBuilders.boolQuery().must(templatesNestedQuery);

if (!isEmpty(templateName)) {

innerTemplNestedBuilder.filter(QueryBuilders.termQuery("attachment.name", templateName));

}

SearchSourceBuilder searchSourceBuilder = SearchSourceBuilder.searchSource()

.collapse(new CollapseBuilder("attachment.uuid"))

.query(mainQueryBuilder);

// NEXT LINE CAUSE ERROR

long count = client.count(new CountRequest("package").source(searchSourceBuilder), RequestOptions.DEFAULT).getCount(); <<<<<<<<<< ERROR HERE

// THIS WORKS

SearchResponse searchResponse = client.search(

new SearchRequest(

new String[] {"package"},

searchSourceBuilder.timeout(new TimeValue(20, TimeUnit.SECONDS)).from(offset).size(limit)

).indices("package").searchType(SearchType.DFS_QUERY_THEN_FETCH),

RequestOptions.DEFAULT

);

return ....;

该方法的总体意图是获取一部分文档以及所有此类文档的数量。可能已经有另一种方法可以满足这种需求。如果我尝试count使用aggregationscardinality-我得到的结果为零,并且看起来不适用于嵌套字段。

计数要求:

{

"query": {

"bool": {

"must": [

{

"nested": {

"query": {

"bool": {

"adjust_pure_negative": true,

"boost": 1.0

}

},

"path": "attachment",

"ignore_unmapped": false,

"score_mode": "none",

"boost": 1.0

}

}

],

"adjust_pure_negative": true,

"boost": 1.0

}

},

"collapse": {

"field": "attachment.uuid"

}

}

如何创建映射:

curl -X DELETE "localhost:9200/package?pretty"

curl -X PUT "localhost:9200/package?include_type_name=true&pretty" -H 'Content-Type: application/json' -d '{

"settings" : {

"number_of_shards" : 1,

"number_of_replicas" : 1

}}'

curl -X PUT "localhost:9200/package/_mappings?pretty" -H 'Content-Type: application/json' -d'

{

"dynamic": false,

"properties" : {

"attachment": {

"type": "nested",

"properties": {

"uuid" : { "type" : "keyword" },

"name" : { "type" : "text" }

}

},

"uuid" : {

"type" : "keyword"

}

}

}

'

代码生成的结果查询应如下所示:

curl -X POST "localhost:9200/package/_count?&pretty" -H 'Content-Type: application/json' -d' { "query" :

{

"bool": {

"must": [

{

"nested": {

"query": {

"bool": {

"adjust_pure_negative": true,

"boost": 1.0

}

},

"path": "attachment",

"ignore_unmapped": false,

"score_mode": "none",

"boost": 1.0

}

}

],

"adjust_pure_negative": true,

"boost": 1.0

}

},

"collapse": {

"field": "attachment.uuid"

}

}'

回答:

折叠只能在_search上下文中使用,而不能在中使用_count

其次,您的查询甚至可以做什么?您那里有很多多余的参数,例如boost:1etc。您不妨说:

POST /package/_count?&pretty

{

"query": {

"bool": {

"must": [

{

"nested": {

"path": "attachment",

"query": {

"match_all": {}

}

}

}

]

}

}

}

这实际上什么也没做:)


回答:

假设有3个文档,其中2个具有相同的attachment.uuid值:

[

{

"attachment":{

"uuid":"04144e14-62c3-11ea-bc55-0242ac130003"

}

},

{

"attachment":{

"uuid":"04144e14-62c3-11ea-bc55-0242ac130003"

}

},

{

"attachment":{

"uuid":"100b9632-62c3-11ea-bc55-0242ac130003"

}

}

]

要获取s 的terms细分uuid,请运行

GET package/_search

{

"size": 0,

"aggs": {

"nested_uniques": {

"nested": {

"path": "attachment"

},

"aggs": {

"subagg": {

"terms": {

"field": "attachment.uuid"

}

}

}

}

}

}

产生

...

{

"aggregations":{

"nested_uniques":{

"doc_count":3,

"subagg":{

"doc_count_error_upper_bound":0,

"sum_other_doc_count":0,

"buckets":[

{

"key":"04144e14-62c3-11ea-bc55-0242ac130003",

"doc_count":2

},

{

"key":"100b9632-62c3-11ea-bc55-0242ac130003",

"doc_count":1

}

]

}

}

}

}


回答:

GET package/_search

{

"size": 0,

"aggs": {

"nested_uniques": {

"nested": {

"path": "attachment"

},

"aggs": {

"scripted_uniques": {

"scripted_metric": {

"init_script": "state.my_map = [:];",

"map_script": """

if (doc.containsKey('attachment.uuid')) {

state.my_map[doc['attachment.uuid'].value.toString()] = 1;

}

""",

"combine_script": """

def sum = 0;

for (c in state.my_map.entrySet()) {

sum += 1

}

return sum

""",

"reduce_script": """

def sum = 0;

for (agg in states) {

sum += agg;

}

return sum;

"""

}

}

}

}

}

}

哪个返回

...

{

"aggregations":{

"nested_uniques":{

"doc_count":3,

"scripted_uniques":{

"value":2

}

}

}

}

而这scripted_uniques: 2正是您所追求的。


注意:我使用嵌套的脚本指标aggs解决了该用例,但是如果您知道更干净的方法,我非常乐于学习!

以上是 如何通过Elasticsearch中的嵌套字段计算多个唯一文档? 的全部内容, 来源链接: utcz.com/qa/424420.html

回到顶部