How to get latest values for each group with an Elasticsearch query?

Solution 1:

You can use a top_hits aggregation that groups on the country field, returns 1 doc per group, and orders the docs by the collected date descending:

POST /test/_search?search_type=count
{
    "aggs": {
        "group": {
            "terms": {
                "field": "country"
            },
            "aggs": {
                "group_docs": {
                    "top_hits": {
                        "size": 1,
                        "sort": [
                            {
                                "collected": {
                                    "order": "desc"
                                }
                            }
                        ]
                    }
                }
            }
        }
    }
}

Solution 2:

For those like user1892775 who run into "Fielddata is disabled on text fields by default...", you can create a multi field (https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html). So you might have mapping like:

"mapping": {
    "properties": {
      "country": {"type": "string", "fields": {"raw": {"type": "string", "index": "not_analyzed"}}}
}

Then your query would look like

POST /test/_search?search_type=count
{
    "aggs": {
    "group": {
        "terms": {
            "field": "country.raw"
        },
        "aggs": {
            "group_docs": {
                "top_hits": {
                    "size": 1,
                    "sort": [
                        {
                            "collected": {
                                "order": "desc"
                            }
                        }
                    ]
                }
            }
        }
    }
  }
}

(Note the use of country.raw)

Solution 3:

The answer marked correct worked great for me. Here is how I added some extra filters. This is version 7.4 on AWS.

The field I'm grouping by is a keyword field named tags.

  • For each group (tag), get top 3 documents sorted by date_uploaded descending.

  • Also show the total amount of documents within each group (tag).

  • Only consider non-deleted documents belonging to user 22.

  • Only return 10 groups (tags), sorted alphabetically.

  • For each document, return its ID (book_id) and date_uploaded. (Default is that all info is returned.)

  • Size:0 keeps the query from returning lots of info about all the documents.

      {'query': {'bool': {'filter': [{'terms': {'user_id': [22]}}, {'terms': {'deleted': ['false']}}]}},
       'size': 0,
       "aggs": {
          "group": {
              "terms": {
                  "field": "tags.keyword",
                  "size":10,
                  "order":{ "_key": "asc" }
              },  
              "aggs": {
                  "group_docs": {
                      "top_hits": {
                          "size": 3,
                          "_source":["book_id","date_uploaded"],
                          "sort": [ {"date_uploaded": { "order": "desc" }}]
                      }
                  }
              }
          }
      }
    

    }

Here is how to get each group (tag in my case) and the document matches for each group.

query_results = ... result of query
buckets = query_results["aggregations"]["group"]["buckets"]
for bucket in buckets:
    tag = bucket["key"]
    tag_doc_count = bucket["doc_count"]
    print tag, tag_total_doc_count
    tag_hits = bucket["group_docs"]["hits"]["hits"]
    for hit in tag_hits:
        source = hit["_source"]
        print source["book_id"], source["date_uploaded"]

FYI, the "group" term can be named anything. Just make sure to use the same name when getting buckets from your query results.