The JPA hashCode() / equals() dilemma

There have been some discussions here about JPA entities and which hashCode()/equals() implementation should be used for JPA entity classes. Most (if not all) of them depend on Hibernate, but I'd like to discuss them JPA-implementation-neutrally (I am using EclipseLink, by the way).

All possible implementations are having their own advantages and disadvantages regarding:

  • hashCode()/equals() contract conformity (immutability) for List/Set operations
  • Whether identical objects (e.g. from different sessions, dynamic proxies from lazily-loaded data structures) can be detected
  • Whether entities behave correctly in detached (or non-persisted) state

As far I can see, there are three options:

  1. Do not override them; rely on Object.equals() and Object.hashCode()
    • hashCode()/equals() work
    • cannot identify identical objects, problems with dynamic proxies
    • no problems with detached entities
  2. Override them, based on the primary key
    • hashCode()/equals() are broken
    • correct identity (for all managed entities)
    • problems with detached entities
  3. Override them, based on the Business-Id (non-primary key fields; what about foreign keys?)
    • hashCode()/equals() are broken
    • correct identity (for all managed entities)
    • no problems with detached entities

My questions are:

  1. Did I miss an option and/or pro/con point?
  2. What option did you choose and why?



UPDATE 1:

By "hashCode()/equals() are broken", I mean that successive hashCode() invocations may return differing values, which is (when correctly implemented) not broken in the sense of the Object API documentation, but which causes problems when trying to retrieve a changed entity from a Map, Set or other hash-based Collection. Consequently, JPA implementations (at least EclipseLink) will not work correctly in some cases.

UPDATE 2:

Thank you for your answers -- most of them have remarkable quality.
Unfortunately, I am still unsure which approach will be the best for a real-life application, or how to determine the best approach for my application. So, I'll keep the question open and hope for some more discussions and/or opinions.


Solution 1:

Read this very nice article on the subject: Don't Let Hibernate Steal Your Identity.

The conclusion of the article goes like this:

Object identity is deceptively hard to implement correctly when objects are persisted to a database. However, the problems stem entirely from allowing objects to exist without an id before they are saved. We can solve these problems by taking the responsibility of assigning object IDs away from object-relational mapping frameworks such as Hibernate. Instead, object IDs can be assigned as soon as the object is instantiated. This makes object identity simple and error-free, and reduces the amount of code needed in the domain model.

Solution 2:

I always override equals/hashcode and implement it based on the business id. Seems the most reasonable solution for me. See the following link.

To sum all this stuff up, here is a listing of what will work or won't work with the different ways to handle equals/hashCode: enter image description here

EDIT:

To explain why this works for me:

  1. I don't usually use hashed-based collection (HashMap/HashSet) in my JPA application. If I must, I prefer to create UniqueList solution.
  2. I think changing business id on runtime is not a best practice for any database application. On rare cases where there is no other solution, I'd do special treatment like remove the element and put it back to the hashed-based collection.
  3. For my model, I set the business id on constructor and doesn't provide setters for it. I let JPA implementation to change the field instead of the property.
  4. UUID solution seems to be overkill. Why UUID if you have natural business id? I would after all set the uniqueness of the business id in the database. Why having THREE indexes for each table in the database then?

Solution 3:

I personally already used all of these three stategies in different projects. And I must say that option 1 is in my opinion the most practicable in a real life app. In my experience breaking hashCode()/equals() conformity leads to many crazy bugs as you will every time end up in situations where the result of equality changes after an entity has been added to a collection.

But there are further options (also with their pros and cons):


a) hashCode/equals based on a set of immutable, not null, constructor assigned, fields

(+) all three criterias are guaranteed

(-) field values must be available to create a new instance

(-) complicates handling if you must change one of then


b) hashCode/equals based on a primary key that is assigned by the application (in the constructor) instead of JPA

(+) all three criterias are guaranteed

(-) you cannot take advantage of simple reliable ID generation stategies like DB sequences

(-) complicated if new entities are created in a distributed environment (client/server) or app server cluster


c) hashCode/equals based on a UUID assigned by the constructor of the entity

(+) all three criterias are guaranteed

(-) overhead of UUID generation

(-) may be a little risk that twice the same UUID is used, depending on algorythm used (may be detected by an unique index on DB)

Solution 4:

We usually have two IDs in our entities:

  1. Is for persistence layer only (so that persistence provider and database can figure out relationships between objects).
  2. Is for our application needs (equals() and hashCode() in particular)

Take a look:

@Entity
public class User {

    @Id
    private int id;  // Persistence ID
    private UUID uuid; // Business ID

    // assuming all fields are subject to change
    // If we forbid users change their email or screenName we can use these
    // fields for business ID instead, but generally that's not the case
    private String screenName;
    private String email;

    // I don't put UUID generation in constructor for performance reasons. 
    // I call setUuid() when I create a new entity
    public User() {
    }

    // This method is only called when a brand new entity is added to 
    // persistence context - I add it as a safety net only but it might work 
    // for you. In some cases (say, when I add this entity to some set before 
    // calling em.persist()) setting a UUID might be too late. If I get a log 
    // output it means that I forgot to call setUuid() somewhere.
    @PrePersist
    public void ensureUuid() {
        if (getUuid() == null) {
            log.warn(format("User's UUID wasn't set on time. " 
                + "uuid: %s, name: %s, email: %s",
                getUuid(), getScreenName(), getEmail()));
            setUuid(UUID.randomUUID());
        }
    }

    // equals() and hashCode() rely on non-changing data only. Thus we 
    // guarantee that no matter how field values are changed we won't 
    // lose our entity in hash-based Sets.
    @Override
    public int hashCode() {
        return getUuid().hashCode();
    }

    // Note that I don't use direct field access inside my entity classes and
    // call getters instead. That's because Persistence provider (PP) might
    // want to load entity data lazily. And I don't use 
    //    this.getClass() == other.getClass() 
    // for the same reason. In order to support laziness PP might need to wrap
    // my entity object in some kind of proxy, i.e. subclassing it.
    @Override
    public boolean equals(final Object obj) {
        if (this == obj)
            return true;
        if (!(obj instanceof User))
            return false;
        return getUuid().equals(((User) obj).getUuid());
    }

    // Getters and setters follow
}

EDIT: to clarify my point regarding calls to setUuid() method. Here's a typical scenario:

User user = new User();
// user.setUuid(UUID.randomUUID()); // I should have called it here
user.setName("Master Yoda");
user.setEmail("[email protected]");

jediSet.add(user); // here's bug - we forgot to set UUID and 
                   //we won't find Yoda in Jedi set

em.persist(user); // ensureUuid() was called and printed the log for me.

jediCouncilSet.add(user); // Ok, we got a UUID now

When I run my tests and see the log output I fix the problem:

User user = new User();
user.setUuid(UUID.randomUUID());

Alternatively, one can provide a separate constructor:

@Entity
public class User {

    @Id
    private int id;  // Persistence ID
    private UUID uuid; // Business ID

    ... // fields

    // Constructor for Persistence provider to use
    public User() {
    }

    // Constructor I use when creating new entities
    public User(UUID uuid) {
        setUuid(uuid);
    }

    ... // rest of the entity.
}

So my example would look like this:

User user = new User(UUID.randomUUID());
...
jediSet.add(user); // no bug this time

em.persist(user); // and no log output

I use a default constructor and a setter, but you may find two-constructors approach more suitable for you.

Solution 5:

If you want to use equals()/hashCode() for your Sets, in the sense that the same entity can only be in there once, then there is only one option: Option 2. That's because a primary key for an entity by definition never changes (if somebody indeed updates it, it's not the same entity anymore)

You should take that literally: Since your equals()/hashCode() are based on the primary key, you must not use these methods, until the primary key is set. So you shouldn't put entities in the set, until they're assigned a primary key. (Yes, UUIDs and similar concepts may help to assign primary keys early.)

Now, it's theoretically also possible to achieve that with Option 3, even though so-called "business-keys" have the nasty drawback that they can change: "All you'll have to do is delete the already inserted entities from the set(s), and re-insert them." That is true - but it also means, that in a distributed system, you'll have to make sure, that this is done absolutely everywhere the data has been inserted to (and you'll have to make sure, that the update is performed, before other things occur). You'll need a sophisticated update mechanism, especially if some remote systems aren't currently reachable...

Option 1 can only be used, if all the objects in your sets are from the same Hibernate session. The Hibernate documentation makes this very clear in chapter 13.1.3. Considering object identity:

Within a Session the application can safely use == to compare objects.

However, an application that uses == outside of a Session might produce unexpected results. This might occur even in some unexpected places. For example, if you put two detached instances into the same Set, both might have the same database identity (i.e., they represent the same row). JVM identity, however, is by definition not guaranteed for instances in a detached state. The developer has to override the equals() and hashCode() methods in persistent classes and implement their own notion of object equality.

It continues to argue in favor of Option 3:

There is one caveat: never use the database identifier to implement equality. Use a business key that is a combination of unique, usually immutable, attributes. The database identifier will change if a transient object is made persistent. If the transient instance (usually together with detached instances) is held in a Set, changing the hashcode breaks the contract of the Set.

This is true, if you

  • cannot assign the id early (e.g. by using UUIDs)
  • and yet you absolutely want to put your objects in sets while they're in transient state.

Otherwise, you're free to choose Option 2.

Then it mentions the need for a relative stability:

Attributes for business keys do not have to be as stable as database primary keys; you only have to guarantee stability as long as the objects are in the same Set.

This is correct. The practical problem I see with this is: If you can't guarantee absolute stability, how will you be able to guarantee stability "as long as the objects are in the same Set". I can imagine some special cases (like using sets only for a conversation and then throwing it away), but I would question the general practicability of this.


Short version:

  • Option 1 can only be used with objects within a single session.
  • If you can, use Option 2. (Assign PK as early as possible, because you can't use the objects in sets until the PK is assigned.)
  • If you can guarantee relative stability, you can use Option 3. But be careful with this.