How to get latest values for each group with an Elasticsearch query?

Solution 1:

You can use a top_hits aggregation that groups on the country field, returns 1 doc per group, and orders the docs by the collected date descending:

POST /test/_search?search_type=count
{
    "aggs": {
        "group": {
            "terms": {
                "field": "country"
            },
            "aggs": {
                "group_docs": {
                    "top_hits": {
                        "size": 1,
                        "sort": [
                            {
                                "collected": {
                                    "order": "desc"
                                }
                            }
                        ]
                    }
                }
            }
        }
    }
}

Solution 2:

For those like user1892775 who run into "Fielddata is disabled on text fields by default...", you can create a multi field (https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html). So you might have mapping like:

"mapping": {
    "properties": {
      "country": {"type": "string", "fields": {"raw": {"type": "string", "index": "not_analyzed"}}}
}

Then your query would look like

POST /test/_search?search_type=count
{
    "aggs": {
    "group": {
        "terms": {
            "field": "country.raw"
        },
        "aggs": {
            "group_docs": {
                "top_hits": {
                    "size": 1,
                    "sort": [
                        {
                            "collected": {
                                "order": "desc"
                            }
                        }
                    ]
                }
            }
        }
    }
  }
}

(Note the use of country.raw)

Solution 3:

The answer marked correct worked great for me. Here is how I added some extra filters. This is version 7.4 on AWS.

The field I'm grouping by is a keyword field named tags.

For each group (tag), get top 3 documents sorted by date_uploaded descending.
Also show the total amount of documents within each group (tag).
Only consider non-deleted documents belonging to user 22.
Only return 10 groups (tags), sorted alphabetically.
For each document, return its ID (book_id) and date_uploaded. (Default is that all info is returned.)

Size:0 keeps the query from returning lots of info about all the documents.

  {'query': {'bool': {'filter': [{'terms': {'user_id': [22]}}, {'terms': {'deleted': ['false']}}]}},
   'size': 0,
   "aggs": {
      "group": {
          "terms": {
              "field": "tags.keyword",
              "size":10,
              "order":{ "_key": "asc" }
          },  
          "aggs": {
              "group_docs": {
                  "top_hits": {
                      "size": 3,
                      "_source":["book_id","date_uploaded"],
                      "sort": [ {"date_uploaded": { "order": "desc" }}]
                  }
              }
          }
      }
  }

}

Here is how to get each group (tag in my case) and the document matches for each group.

query_results = ... result of query
buckets = query_results["aggregations"]["group"]["buckets"]
for bucket in buckets:
    tag = bucket["key"]
    tag_doc_count = bucket["doc_count"]
    print tag, tag_total_doc_count
    tag_hits = bucket["group_docs"]["hits"]["hits"]
    for hit in tag_hits:
        source = hit["_source"]
        print source["book_id"], source["date_uploaded"]

FYI, the "group" term can be named anything. Just make sure to use the same name when getting buckets from your query results.

How to get latest values for each group with an Elasticsearch query?

Solution 1:

Solution 2:

Solution 3:

Related

Recent Posts