Managing highly repetitive code and documentation in Java

Highly repetitive code is generally a bad thing, and there are design patterns that can help minimize this. However, sometimes it's simply inevitable due to the constraints of the language itself. Take the following example from java.util.Arrays:

/**
 * Assigns the specified long value to each element of the specified
 * range of the specified array of longs.  The range to be filled
 * extends from index <tt>fromIndex</tt>, inclusive, to index
 * <tt>toIndex</tt>, exclusive.  (If <tt>fromIndex==toIndex</tt>, the
 * range to be filled is empty.)
 *
 * @param a the array to be filled
 * @param fromIndex the index of the first element (inclusive) to be
 *        filled with the specified value
 * @param toIndex the index of the last element (exclusive) to be
 *        filled with the specified value
 * @param val the value to be stored in all elements of the array
 * @throws IllegalArgumentException if <tt>fromIndex &gt; toIndex</tt>
 * @throws ArrayIndexOutOfBoundsException if <tt>fromIndex &lt; 0</tt> or
 *         <tt>toIndex &gt; a.length</tt>
 */
public static void fill(long[] a, int fromIndex, int toIndex, long val) {
    rangeCheck(a.length, fromIndex, toIndex);
    for (int i=fromIndex; i<toIndex; i++)
        a[i] = val;
}

The above snippet appears in the source code 8 times, with very little variation in the documentation/method signature but exactly the same method body, one for each of the root array types int[], short[], char[], byte[], boolean[], double[], float[], and Object[].

I believe that unless one resorts to reflection (which is an entirely different subject in itself), this repetition is inevitable. I understand that as a utility class, such high concentration of repetitive Java code is highly atypical, but even with the best practice, repetition does happen! Refactoring doesn't always work because it's not always possible (the obvious case is when the repetition is in the documentation).

Obviously maintaining this source code is a nightmare. A slight typo in the documentation, or a minor bug in the implementation, is multiplied by however many repetitions was made. In fact, the best example happens to involve this exact class:

Google Research Blog - Extra, Extra - Read All About It: Nearly All Binary Searches and Mergesorts are Broken (by Joshua Bloch, Software Engineer)

The bug is a surprisingly subtle one, occurring in what many thought to be just a simple and straightforward algorithm.

    // int mid =(low + high) / 2; // the bug
    int mid = (low + high) >>> 1; // the fix

The above line appears 11 times in the source code!

So my questions are:

  • How are these kinds of repetitive Java code/documentation handled in practice? How are they developed, maintained, and tested?
    • Do you start with "the original", and make it as mature as possible, and then copy and paste as necessary and hope you didn't make a mistake?
    • And if you did make a mistake in the original, then just fix it everywhere, unless you're comfortable with deleting the copies and repeating the whole replication process?
    • And you apply this same process for the testing code as well?
  • Would Java benefit from some sort of limited-use source code preprocessing for this kind of thing?
    • Perhaps Sun has their own preprocessor to help write, maintain, document and test these kind of repetitive library code?

A comment requested another example, so I pulled this one from Google Collections: com.google.common.base.Predicates lines 276-310 (AndPredicate) vs lines 312-346 (OrPredicate).

The source for these two classes are identical, except for:

  • AndPredicate vs OrPredicate (each appears 5 times in its class)
  • "And(" vs Or(" (in the respective toString() methods)
  • #and vs #or (in the @see Javadoc comments)
  • true vs false (in apply; ! can be rewritten out of the expression)
  • -1 /* all bits on */ vs 0 /* all bits off */ in hashCode()
  • &= vs |= in hashCode()

Solution 1:

For people that absolutely need performance, boxing and unboxing and generified collections and whatnot are big no-no's.

The same problem happens in performance computing where you need the same complex to work both for float and double (say some of the method shown in Goldberd's "What every computer scientist should know about floating-point numbers" paper).

There's a reason why Trove's TIntIntHashMap runs circles around Java's HashMap<Integer,Integer> when working with a similar amount of data.

Now how are Trove collection's source code written?

By using source code instrumentation of course :)

There are several Java libraries for higher performance (much higher than the default Java ones) that use code generators to create the repeated source code.

We all know that "source code instrumentation" is evil and that code generation is crap, but still that's how people who really know what they're doing (i.e. the kind of people that write stuff like Trove) do it :)

For what it is worth we generate source code that contains big warnings like:

/*
 * This .java source file has been auto-generated from the template xxxxx
 * 
 * DO NOT MODIFY THIS FILE FOR IT SHALL GET OVERWRITTEN
 * 
 */

Solution 2:

If you absolutely must duplicate code, follow the great examples you've given and group all of that code in one place where it's easy to find and fix when you have to make a change. Document the duplication and, more importantly, the reason for the duplication so that everyone who comes after you is aware of both.

Solution 3:

From Wikipedia Don't Repeat Yourself (DRY) or Duplication is Evil (DIE)

In some contexts, the effort required to enforce the DRY philosophy may be greater than the effort to maintain separate copies of the data. In some other contexts, duplicated information is immutable or kept under a control tight enough to make DRY not required.

There is probably no answer or technique to prevent problems like that.

Solution 4:

Even fancy pants languages like Haskell have repetitive code (see my post on haskell and serialization)

It seems there are three choices to this problem:

  1. Use reflection and lose performance
  2. Use preprocessing like Template Haskell or Caml4p equivalent for your language and live with nastiness
  3. Or my personal favorite use macros if your language supports it (scheme, and lisp)

I consider the macros different than preprocessing because the macros are usually in the same language that the target is where as preprocessing is a different language.

I think Lisp/Scheme macros would solve many of these problems.

Solution 5:

I get that Sun has to document like this for the Java SE library code and maybe other 3rd party library writers do as well.

However, I think it is an utter waste to copy and paste documentation throughout a file like this in code that is only used in house. I know many people will disagree because it will make their in house JavaDocs look less clean. However, the trade off is that is makes their code more clean which, in my opinion, is more important.