"Normalization" vs. "canonicalization"

It seems both normalization and canonicalization are used to describe the effort to transform from an arbitrary form to a unique form. Is there any difference between the two words? Why is there XML normalize-space but not canonicalize-space? And a file name in a canonical-form but not a file name in a normalized-form?

I cannot help you with the specifics of these terms as they relate to software engineering specifically, but as far as these terms are used generally, there is a marked difference between normalization and canonicalization. Understanding these differences may help you figure out the difference in your particular context.

Normalization is the process of standardizing something according to a set of predefined rules.

Canonicalization is a bastardization of canonize, which means to make canonical, or to include in a canon of knowledge.

So for your XML file, you are bringing it up to a standard by normalizing the whitespace content. You are not including an XML file in a larger body of work or set of rules when you do this, so you are not canonizing it.

On the other hand, a file name designated as canonical to indicate how it is listed in the canon, or body of work in which it is included.

A file name could (and theoretically does) have a normalized form as well: up to some number of alphanumeric characters plus a file extension of no more than four characters, or whatever the actual rule is.

I am familiar with normalization in the context of data structures. Normal forms are consistent with a set of rules that specifies how the data should be organized into tables in order to make them as efficient as possible (broadly speaking).

Canonicalization on the other hand is more context specific; your mileage may vary, as they say. An example adapted from the related link that FumbleFingers posted would be something like:

We can "normalize" the expression x - 3 + 12 = y by reducing it to x + 9 = y, or 9 + x = y, or even y - 9 = x, but its "canonical" form would be y = x + 9.

Canonical representations have been part of the standard language of mathematics for a very long time, as this NGram shows... enter image description here

...so canonicalization isn't really a programming world bastardization of canonize as has been suggested. It's just that computers have many different ways of storing & accessing "the same" data, so the word gets used more in IT than in pure or even applied mathematics.

As @Mitch pointed out in the first comment, In the context of software engineering, they are synonymous. Arguing about whether the IT world "should" use canonicalisation or normalisation is like arguing about whether computer files are stored in folders or directories.

(Sorry about the disconcerting switch from US to UK spelling. I was "quoting" at first, but I just couldn't keep repeating what to me as a Brit is an orthographic abomination!)

LATER: As regards the exact meaning of canonicalisation - it's long been used by mathematician, chemists, etc. Here are definitions of canonical form and normal form, from which you may conclude that canonical is effectively a "subset" of normal, in that there may be more different normal forms of an expression/formula than there are canonical forms.

Note - this 1928 reference clearly implies there may be more than one method of canonicalisation, so you can't just say a "canonical" form is one that adheres to "the recognised standard". It's arguably a more exacting term than normalisation, but in the end this is probably a pendantic distinction.

From a computer perspective (especially from a file name/path handling), normalization is a subset of canonicalization. They generally have different purposes: normalization is generally used for reduction of redundancy, canonicalization for the purpose of identity comparison/categorization.

Take this path a/../b/./c.txt for example.
The normalized version is b/c.txt, which can be used to display a more straightforward name.
The canonicalized version is /home/joe/b/c.txt (assuming /home/joe/ to be the current directory). This allows 2 paths to be compared for identity.

Both normalization and canonicalization are relative to your problem space that needs to solved and modeled appropriately. For example, b/c.txt may be a perfect canonical value in certain contexts (say for a program that deals only with paths in the joe's user directory).
Or /home/joe/b/c.txt may not be canonical enough for situations what deal with distributed file names, where something like //m1/home/joe/b/c.txt is more appropriate.

As for the XML domain space: there are two problems to be resolved there:

redundant data, for example a value containing multiple spaces, or different kind of whitespace, where "normalize" is the more appropriate term/operation.
comparing two XML documents. Both <a n1="v1" n2="v2"/> and <a n2="v2" n1="v2"/> are minimal (no extra 'stuff' in them), but cannot be straightforwardly compared. Hence, to compare them, we need to canonicalize them, for example by listing attributes ordered by their name, thus making the comparison straightforward. This is a simplification, the real XML rules are more complicated, since they need to handle a lot of other cases (for example <a/> vs <a></a>, or <a n="v"/> vs <a n="v" />, etc).

Lastly, though the original canonicalization intent is mainly for identity comparison, it tends to be used beyond that -- sometimes is better to always canonicalize the data before storage because it makes it easy to handle it in the future.

"Normalization" vs. "canonicalization"

Related

Recent Posts