Entity Framework Include OrderBy random generates duplicate data

When I retrieve a list of items from a database including some children (via .Include), and order the randomly, EF gives me an unexpected result.. I creates/clones addition items..

To explain myself better, I've created a small and simple EF CodeFirst project to reproduce the problem. First i shall give you the code for this project.

The project

Create a basic MVC3 project and add the EntityFramework.SqlServerCompact package via Nuget.
That adds the latest versions of the following packages:

EntityFramework v4.3.0
SqlServerCompact v4.0.8482.1
EntityFramework.SqlServerCompact v4.1.8482.2
WebActivator v1.5

The Models and DbContext

using System.Collections.Generic;
using System.Data.Entity;

namespace RandomWithInclude.Models
{
    public class PeopleContext : DbContext
    {
        public DbSet<Person> Persons { get; set; }
        public DbSet<Address> Addresses { get; set; }
    }

    public class Person
    {
        public int ID { get; set; }
        public string Name { get; set; }

        public virtual ICollection<Address> Addresses { get; set; }
    }

    public class Address
    {
        public int ID { get; set; }
        public string AdressLine { get; set; }

        public virtual Person Person { get; set; }
    }
}

The DB Setup and Seed data: EF.SqlServerCompact.cs

using System.Collections.Generic;
using System.Data.Entity;
using System.Data.Entity.Infrastructure;
using RandomWithInclude.Models;

[assembly: WebActivator.PreApplicationStartMethod(typeof(RandomWithInclude.App_Start.EF), "Start")]

namespace RandomWithInclude.App_Start
{
    public static class EF
    {
        public static void Start()
        {
            Database.DefaultConnectionFactory = new SqlCeConnectionFactory("System.Data.SqlServerCe.4.0");
            Database.SetInitializer(new DbInitializer());
        }
    }
    public class DbInitializer : DropCreateDatabaseAlways<PeopleContext>
    {
        protected override void Seed(PeopleContext context)
        {
            var address1 = new Address {AdressLine = "Street 1, City 1"};
            var address2 = new Address {AdressLine = "Street 2, City 2"};
            var address3 = new Address {AdressLine = "Street 3, City 3"};
            var address4 = new Address {AdressLine = "Street 4, City 4"};
            var address5 = new Address {AdressLine = "Street 5, City 5"};
            context.Addresses.Add(address1);
            context.Addresses.Add(address2);
            context.Addresses.Add(address3);
            context.Addresses.Add(address4);
            context.Addresses.Add(address5);
            var person1 = new Person {Name = "Person 1", Addresses = new List<Address> {address1, address2}};
            var person2 = new Person {Name = "Person 2", Addresses = new List<Address> {address3}};
            var person3 = new Person {Name = "Person 3", Addresses = new List<Address> {address4, address5}};
            context.Persons.Add(person1);
            context.Persons.Add(person2);
            context.Persons.Add(person3);
        }
    }
}

The controller: HomeController.cs

using System;
using System.Data.Entity;
using System.Linq;
using System.Web.Mvc;
using RandomWithInclude.Models;

namespace RandomWithInclude.Controllers
{
    public class HomeController : Controller
    {
        public ActionResult Index()
        {
            var db = new PeopleContext();
            var persons = db.Persons
                                .Include(p => p.Addresses)
                                .OrderBy(p => Guid.NewGuid());

            return View(persons.ToList());
        }
    }
}

The View: Index.cshtml

@using RandomWithInclude.Models
@model IList<Person>

<ul>
    @foreach (var person in Model)
    {
        <li>
            @person.Name
        </li>
    }
</ul>

this should be all, and you application should compile :)

The problem

As you can see, we have 2 straightforward models (Person and Address) and Person can have multiple Addresses.
We seed the generated database 3 persons and 5 addresses.
If we get all the persons from the database, including the addresses and randomize the results and just print out the names of those persons, that's where it all goes wrong.

As a result, i sometimes get 4 persons, sometimes 5 and sometimes 3, and i expect 3. Always.
e.g.:

Person 1
Person 3
Person 1
Person 3
Person 2

So.. it's copying/cloning data! And that's not cool..
It just seems that EF looses track of what addresses are a child of which person..

The generated SQL query is this:

SELECT 
    [Project1].[ID] AS [ID], 
    [Project1].[Name] AS [Name], 
    [Project1].[C2] AS [C1], 
    [Project1].[ID1] AS [ID1], 
    [Project1].[AdressLine] AS [AdressLine], 
    [Project1].[Person_ID] AS [Person_ID]
FROM ( SELECT 
    NEWID() AS [C1], 
    [Extent1].[ID] AS [ID], 
    [Extent1].[Name] AS [Name], 
    [Extent2].[ID] AS [ID1], 
    [Extent2].[AdressLine] AS [AdressLine], 
    [Extent2].[Person_ID] AS [Person_ID], 
    CASE WHEN ([Extent2].[ID] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C2]
    FROM  [People] AS [Extent1]
    LEFT OUTER JOIN [Addresses] AS [Extent2] ON [Extent1].[ID] = [Extent2].[Person_ID]
)  AS [Project1]
ORDER BY [Project1].[C1] ASC, [Project1].[ID] ASC, [Project1].[C2] ASC

Workarounds

If i remove the .Include(p =>p.Addresses) from the query, everything goes fine. but of course the addresses aren't loaded and accessing that collection will make a new call to the database every time.
I can first get the data from the database and randomize later by just adding a .ToList() before the .OrderBy.. like this: var persons = db.Persons.Include(p => p.Addresses).ToList().OrderBy(p => Guid.NewGuid());

Does anybody have any idea of why it is happening like this?
Might this be a bug in the SQL generation?

As one can sort it out by reading AakashM answer and Nicolae Dascalu answer, it strongly seems Linq OrderBy requires a stable ranking function, which NewID/Guid.NewGuid is not.

So we have to use another random generator that would be stable inside a single query.

To achieve this, before each querying, use a .Net Random generator to get a random number. Then combine this random number with a unique property of the entity to get randomly sorted. And to 'randomize' a bit the result, checksum it. (checksum is a SQL Server function that compute a hash; original idea founded on this blog.)

Assuming Person Id is an int, you could write your query this way :

// Random instances should be stored and reused, not instanciated at each usage.
// But beware, it is not thread safe. If you want to share it between threads, you
// would have to use locks, see its documentation.
// https://docs.microsoft.com/en-us/dotnet/api/system.random.
// But using locks is a bad idea for scalability, especially in a Web context.
var randomGenerator = new Random();
// ...

var rnd = randomGenerator.NextDouble();
var persons = db.Persons
    .Include(p => p.Addresses)
    .OrderBy(p => SqlFunctions.Checksum(p.Id * rnd));

Like the NewGuid hack, this is very probably not a good random generator with a good distribution and so on. But it does not cause entities to get duplicated in results.

Beware:
If your query ordering does not guarantees uniqueness of your entities ranking, you must complement it for guarantying it. By example, if you use a non-unique property of your entities for the checksum call, then add something like .ThenBy(p => p.Id) after the OrderBy.
If your ranking is not unique for your queried root entity, its included children may get mixed with children of other entities having the same ranking. And then the bug will stay here.

Note:
I would prefer use .Next() method to get an int then combine it through a xor (^) to an entity int unique property, rather than using a double and multiply it. But SqlFunctions.Checksum unfortunately does not provide an overload for int data type, though the SQL server function is supposed to support it. You may use a cast to overcome this, but for keeping it simple I finally had chosen to go with the multiply.

tl;dr: There's a leaky abstraction here. To us, Include is a simple instruction to stick a collection of things onto each single returned Person row. But EF's implementation of Include is done by returning a whole row for each Person-Address combo, and reassembling at the client. Ordering by a volatile value causes those rows to become shuffled, breaking apart the Person groups that EF is relying on.

When we have a look at ToTraceString() for this LINQ:

 var people = c.People.Include("Addresses");
 // Note: no OrderBy in sight!

we see

SELECT 
[Project1].[Id] AS [Id], 
[Project1].[Name] AS [Name], 
[Project1].[C1] AS [C1], 
[Project1].[Id1] AS [Id1], 
[Project1].[Data] AS [Data], 
[Project1].[PersonId] AS [PersonId]
FROM ( SELECT 
    [Extent1].[Id] AS [Id], 
    [Extent1].[Name] AS [Name], 
    [Extent2].[Id] AS [Id1], 
    [Extent2].[PersonId] AS [PersonId], 
    [Extent2].[Data] AS [Data], 
    CASE WHEN ([Extent2].[Id] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C1]
    FROM  [Person] AS [Extent1]
    LEFT OUTER JOIN [Address] AS [Extent2] ON [Extent1].[Id] = [Extent2].[PersonId]
)  AS [Project1]
ORDER BY [Project1].[Id] ASC, [Project1].[C1] ASC

So we get n rows for each A, plus 1 row for each P without any As.

Adding an OrderBy clause, however, puts the thing-to-order-by at the start of the ordered columns:

var people = c.People.Include("Addresses").OrderBy(p => Guid.NewGuid());

gives

SELECT 
[Project1].[Id] AS [Id], 
[Project1].[Name] AS [Name], 
[Project1].[C2] AS [C1], 
[Project1].[Id1] AS [Id1], 
[Project1].[Data] AS [Data], 
[Project1].[PersonId] AS [PersonId]
FROM ( SELECT 
    NEWID() AS [C1], 
    [Extent1].[Id] AS [Id], 
    [Extent1].[Name] AS [Name], 
    [Extent2].[Id] AS [Id1], 
    [Extent2].[PersonId] AS [PersonId], 
    [Extent2].[Data] AS [Data], 
    CASE WHEN ([Extent2].[Id] IS NULL) THEN CAST(NULL AS int) ELSE 1 END AS [C2]
    FROM  [Person] AS [Extent1]
    LEFT OUTER JOIN [Address] AS [Extent2] ON [Extent1].[Id] = [Extent2].[PersonId]
)  AS [Project1]
ORDER BY [Project1].[C1] ASC, [Project1].[Id] ASC, [Project1].[C2] ASC

So in your case, where the ordered-by-thing is not a property of a P, but is instead volatile, and therefore can be different for different P-A records of the same P, the whole thing falls apart.

I'm not sure where on the working-as-intended ~~~ cast-iron bug continuum this behaviour falls. But at least now we know about it.

I dont think there is an issue in query generation, but there is definately an issue when EF tries to convert rows into object.

It looks like there is an inherent assumption here that data for the same person in a joined statement will be returned grouped together order by or not.

for example the result of a joined query will always be

P.Id P.Name  A.Id A.StreetLine
1    Person 1 10    --- 
1    Person 1 11
2    Person 2 12
3    Person 3 13
3    Person 3 14

even if you order by some other column, same person would always appear one after the other.

this assumption is mostly true for any joined query.

But there is a deeper issue here i think. OrderBy is for when you want data in certain order ( as opposite to random), so that assumption does seem reasonable.

i think you should really get data out and then randomize it according to some other means in your code

From theory: To sort a list of items, the compare function should be stable relative to items; this means that for any 2 items x, y the result of x< y should be the same as many time is queried(called).

I think the issue is related to misunderstanding of specification(documentation) of OrderBy method: keySelector - A function to extract a key from an element.

EF didn't mention explicitly if the provided function should return the same value for same object as many times is called (in your case returns different/random values), but I think the "key" term that they used in documentation implicitly suggested this.

Entity Framework Include OrderBy random generates duplicate data

The project

The problem

Workarounds

Related

Recent Posts