How to get latest values for each group with an Elasticsearch query?
Solution 1:
You can use a top_hits
aggregation that groups on the country
field, returns 1 doc per group, and orders the docs by the collected date descending:
POST /test/_search?search_type=count
{
"aggs": {
"group": {
"terms": {
"field": "country"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"collected": {
"order": "desc"
}
}
]
}
}
}
}
}
}
Solution 2:
For those like user1892775 who run into "Fielddata is disabled on text fields by default...", you can create a multi field (https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html). So you might have mapping like:
"mapping": {
"properties": {
"country": {"type": "string", "fields": {"raw": {"type": "string", "index": "not_analyzed"}}}
}
Then your query would look like
POST /test/_search?search_type=count
{
"aggs": {
"group": {
"terms": {
"field": "country.raw"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"sort": [
{
"collected": {
"order": "desc"
}
}
]
}
}
}
}
}
}
(Note the use of country.raw)
Solution 3:
The answer marked correct worked great for me. Here is how I added some extra filters. This is version 7.4 on AWS.
The field I'm grouping by is a keyword field named tags.
-
For each group (tag), get top 3 documents sorted by date_uploaded descending.
-
Also show the total amount of documents within each group (tag).
-
Only consider non-deleted documents belonging to user 22.
-
Only return 10 groups (tags), sorted alphabetically.
-
For each document, return its ID (book_id) and date_uploaded. (Default is that all info is returned.)
-
Size:0 keeps the query from returning lots of info about all the documents.
{'query': {'bool': {'filter': [{'terms': {'user_id': [22]}}, {'terms': {'deleted': ['false']}}]}}, 'size': 0, "aggs": { "group": { "terms": { "field": "tags.keyword", "size":10, "order":{ "_key": "asc" } }, "aggs": { "group_docs": { "top_hits": { "size": 3, "_source":["book_id","date_uploaded"], "sort": [ {"date_uploaded": { "order": "desc" }}] } } } } }
}
Here is how to get each group (tag in my case) and the document matches for each group.
query_results = ... result of query
buckets = query_results["aggregations"]["group"]["buckets"]
for bucket in buckets:
tag = bucket["key"]
tag_doc_count = bucket["doc_count"]
print tag, tag_total_doc_count
tag_hits = bucket["group_docs"]["hits"]["hits"]
for hit in tag_hits:
source = hit["_source"]
print source["book_id"], source["date_uploaded"]
FYI, the "group" term can be named anything. Just make sure to use the same name when getting buckets from your query results.