How to limit contained associations per record/group?

I have a Model, Articles, which hasMany Abstracts. I want to load the 10 latest Articles, and for each Article, the Abstract with the highest number of points. My function looks like this:

public function getArticles($category, $viewName) {
            $subArticles = $this->Articles->findByCategory($category)->contain([
                    'Abstracts' => function ($q) {
                            return $q
                                    ->select(['body', 'points', 'article_id'])
                                    ->where(['Abstracts.approved' => true])
                                    ->limit(10)
                                    ->order(['Abstracts.points' => 'DESC']);
                    }
            ])
            ->limit(10)
            ->order(['Articles.created' => 'DESC']) ;
            $this->set( $viewName . 'Articles', $subArticles );
    }

The result that I get is not what I intend though. Looking through the SQL, first CakePHP is getting the articles.id of everything in the category (fine). Then, CakePHP goes into the Abstracts table, using those 10 articles.id's it just found, and asks for the 10 Abstracts with the highest votes (that belong to those Articles).

The problem is that I want 1 Abstract for each Article, not the 10 Abstracts belonging to any Article in that category. How can I fix this? Thanks!

EDIT

ndm suggested that this was a duplicate of Using limit() on contained model so I attempted the solution there. Namely, I added this to my Model:

 $this->hasOne('TopAbstract', [
            'className' => 'Abstracts',
            'foreignKey' => 'abstract_id',
            'strategy' => 'select',
            'sort' => ['TopAbstract.points' => 'DESC'],
            'conditions' => function ($e, $query) {
            $query->limit(1);
            return $e;
    } ]);

And then I try to find the Articles byCategory, with contain(['TopAbstract']), only this kills my SQL. It dies a horrible death:

Error: SQLSTATE[HY000]: General error: 1 near ")": syntax error

Debug doesn't even show the query that killed it, so I'm not sure how to debug this one?

EDIT

Talking to myself a bit, but the error is definitely in the 'conditions' part of the hasOne. I take that out, and it works fine. Can't find an example of how this is supposed to look on the interwebs.. anyone have any idea?


Solution 1:

What you are looking for, is a solution to the greatest-n-per-group problem. You didn't mention any specific RDBMS, but nonetheless see also http://dev.mysql.com/doc/refman/5.6/en/example-maximum-column-group-row.html

A library solution

For those who are a little bit adventurous, I've developed some custom associations that transparently integrate into the ORM layer, and allow for basic limit per group for hasMany and belongsToMany relations: https://github.com/icings/partitionable.

Using them, the solution for the question would be to set up an association like this:

$this
    ->partitionableHasMany('TopAbstracts')
    ->setClassName('Abstracts')
    ->setLimit(1)
    ->setSort([
        'Abstracts.points' => 'DESC',
        'Abstracts.id' => 'ASC',
    ]);

TopAbstracts could then be contained just like any other association.

Custom solutions on association level

So let's give this a try, here's three options that can be applied on association level (defining the conditions could also be moved into custom finders), however you might consider them as not that "straightforward".


Select strategy - Using a join on a grouping, max-value subquery

$this->hasOne('TopAbstracts', [
    'className' => 'Abstracts',
    'strategy' => 'select',
    'conditions' => function (\Cake\Database\Expression\QueryExpression $exp, \Cake\ORM\Query $query) {
        $query->innerJoin(
            [
                'AbstractsFilter' => $query
                    ->connection()
                    ->newQuery()
                    ->select(['article_id', 'points' => $query->func()->max('points')])
                    ->from('abstracts')
                    ->group('article_id')
            ],
            [
                'TopAbstracts.article_id = AbstractsFilter.article_id',
                'TopAbstracts.points = AbstractsFilter.points'
            ]
        );
        return [];
    }
]);

This will select the top abstracts via a join query that is based on the max points, it will look something like

SELECT
    TopAbstracts.id AS `TopAbstracts__id`, ...
FROM
    abstracts TopAbstracts
INNER JOIN (
        SELECT
            article_id, (MAX(points)) AS `points`
        FROM
            abstracts
        GROUP BY
            article_id
    )
    AbstractsFilter ON (
        TopAbstracts.article_id = AbstractsFilter.article_id
        AND
        TopAbstracts.points = AbstractsFilter.points
    )
WHERE
    TopAbstracts.article_id in (1,2,3,4,5,6,7,8, ...)

Select strategy - Using left self-join filtering

$this->hasOne('TopAbstracts', [
    'className' => 'Abstracts',
    'strategy' => 'select',
    'conditions' => function (\Cake\Database\Expression\QueryExpression $exp, \Cake\ORM\Query $query) {
        $query->leftJoin(
            ['AbstractsFilter' => 'abstracts'],
            [
                'TopAbstracts.article_id = AbstractsFilter.article_id',
                'TopAbstracts.points < AbstractsFilter.points'
            ]);
        return $exp->add(['AbstractsFilter.id IS NULL']);
    }
]);

This will use a self-join that filters based on the rows that don't have a.points < b.points, it will look something like

SELECT
    TopAbstracts.id AS `TopAbstracts__id`, ...
FROM 
    abstracts TopAbstracts
LEFT JOIN
    abstracts AbstractsFilter ON (
        TopAbstracts.article_id = AbstractsFilter.article_id
        AND
        TopAbstracts.points < AbstractsFilter.points
    )
WHERE
    (AbstractsFilter.id IS NULL AND TopAbstracts.article_id in (1,2,3,4,5,6,7,8, ...))

Join strategy - Using a subquery for the join condition

$this->hasOne('TopAbstracts', [
    'className' => 'Abstracts',
    'foreignKey' => false,
    'conditions' => function (\Cake\Database\Expression\QueryExpression $exp, \Cake\ORM\Query $query) {
        $subquery = $query
            ->connection()
            ->newQuery()
            ->select(['SubTopAbstracts.id'])
            ->from(['SubTopAbstracts' => 'abstracts'])
            ->where(['Articles.id = SubTopAbstracts.article_id'])
            ->order(['SubTopAbstracts.points' => 'DESC'])
            ->limit(1);

        return $exp->add(['TopAbstracts.id' => $subquery]);
    }
]);

This will use a correlated subquery that uses a rather specific select with simple ordering and limiting to pick the top comment. Note that the foreignKey option is set to false in order to avoid an additional Articles.id = TopAbstracts.article_id condition to be compiled into the join conditions.

The query will look something like

SELECT
    Articles.id AS `Articles__id`, ... ,
    TopAbstracts.id AS `TopAbstracts__id`, ...
FROM
    articles Articles
LEFT JOIN
    abstracts TopAbstracts ON (
        TopAbstracts.id = (
            SELECT
                SubTopAbstracts.id
            FROM
                abstracts SubTopAbstracts
            WHERE
                Articles.id = SubTopAbstracts.article_id
            ORDER BY
                SubTopAbstracts.points DESC
            LIMIT
                1
        )
    )

All these 3 options will query and inject the records without any hackery, it's just not very "straightforward".


A manual approach

For the sake of completeness, it is of course always possible to manually load the associcated records and format the results appropriately, for example using result formatters, see for example CakePHP Entity contain without foreign key


Select strategy and reversed ordering

Just for reference, one of the weird solutions I stumbled over initially. This one really shouldn't be used!

This will select all associated abstracts, and then the ORM will iterate over them and for each article pick the first one with a matching article_id value. So in theory, when ordered descing on points, the ORM should pick the one with he most points.

While I would have expected this to work out of the box, it seems that the ORM iterates over the results in reversed order, which will cause the wrong rows to be picked. In order to get this working, the query needs to use the opposite order that would normally need to be used, ie ASC instead of DESC.

$this->hasOne('TopAbstracts', [
    'className' => 'Abstracts',
    'foreignKey' => 'abstract_id',
    'strategy' => 'select',
    'conditions' => function (\Cake\Database\Expression\QueryExpression $exp, \Cake\ORM\Query $query) {
        $query->order(['TopAbstracts.points' => 'ASC']);
        return [];
    }
]);

Also the function needs to return an empty array instead of the expression like shown in the linked answer, as this will cause invalid SQL to be compiled. Both of these behaviours, the reversed order iterating and the invalid SQL might be bugs.

While this will work, it will always select all associated abstracts, not only the top ones, which might be considered rather inefficient, and look something like

SELECT
    Articles.id AS `Articles__id`, ...
FROM
    articles Articles
SELECT
    TopAbstracts.id AS `TopAbstracts__id`, ...
FROM
    abstracts TopAbstracts
WHERE
    TopAbstracts.article_id in (1,2,3,4,5,6,7,8, ...)
ORDER BY
    TopAbstracts.points ASC