What are the options for storing hierarchical data in a relational database?

Good Overviews

Generally speaking, you're making a decision between fast read times (for example, nested set) or fast write times (adjacency list). Usually, you end up with a combination of the options below that best fit your needs. The following provides some in-depth reading:

  • One more Nested Intervals vs. Adjacency List comparison: the best comparison of Adjacency List, Materialized Path, Nested Set, and Nested Interval I've found.
  • Models for hierarchical data: slides with good explanations of tradeoffs and example usage
  • Representing hierarchies in MySQL: very good overview of Nested Set in particular
  • Hierarchical data in RDBMSs: a most comprehensive and well-organized set of links I've seen, but not much in the way of explanation

Options

Ones I am aware of and general features:

  1. Adjacency List:
  • Columns: ID, ParentID
  • Easy to implement.
  • Cheap node moves, inserts, and deletes.
  • Expensive to find the level, ancestry & descendants, path
  • Avoid N+1 via Common Table Expressions in databases that support them
  1. Nested Set (a.k.a Modified Preorder Tree Traversal)
  • Columns: Left, Right
  • Cheap ancestry, descendants
  • Very expensive O(n/2) moves, inserts, deletes due to volatile encoding
  1. Bridge Table (a.k.a. Closure Table /w triggers)
  • Uses separate join table with ancestor, descendant, depth (optional)
  • Cheap ancestry and descendants
  • Writes costs O(log n) (size of the subtree) for insert, updates, deletes
  • Normalized encoding: good for RDBMS statistics & query planner in joins
  • Requires multiple rows per node
  1. Lineage Column (a.k.a. Materialized Path, Path Enumeration)
  • Column: lineage (e.g. /parent/child/grandchild/etc...)
  • Cheap descendants via prefix query (e.g. LEFT(lineage, #) = '/enumerated/path')
  • Writes costs O(log n) (size of the subtree) for insert, updates, deletes
  • Non-relational: relies on Array datatype or serialized string format
  1. Nested Intervals
  • Like nested set, but with real/float/decimal so that the encoding isn't volatile (inexpensive move/insert/delete)
  • Has real/float/decimal representation/precision issues
  • Matrix encoding variant adds ancestor encoding (materialized path) for "free", but with the added trickiness of linear algebra.
  1. Flat Table
  • A modified Adjacency List that adds a Level and Rank (e.g. ordering) column to each record.
  • Cheap to iterate/paginate over
  • Expensive move and delete
  • Good Use: threaded discussion - forums / blog comments
  1. Multiple lineage columns
  • Columns: one for each lineage level, refers to all the parents up to the root, levels down from the item's level are set to NULL
  • Cheap ancestors, descendants, level
  • Cheap insert, delete, move of the leaves
  • Expensive insert, delete, move of the internal nodes
  • Hard limit to how deep the hierarchy can be

Database Specific Notes

MySQL

  • Use session variables for Adjacency List

Oracle

  • Use CONNECT BY to traverse Adjacency Lists

PostgreSQL

  • ltree datatype for Materialized Path

SQL Server

  • General summary
  • 2008 offers HierarchyId data type that appears to help with the Lineage Column approach and expand the depth that can be represented.

My favorite answer is as what the first sentence in this thread suggested. Use an Adjacency List to maintain the hierarchy and use Nested Sets to query the hierarchy.

The problem up until now has been that the coversion method from an Adjacecy List to Nested Sets has been frightfully slow because most people use the extreme RBAR method known as a "Push Stack" to do the conversion and has been considered to be way to expensive to reach the Nirvana of the simplicity of maintenance by the Adjacency List and the awesome performance of Nested Sets. As a result, most people end up having to settle for one or the other especially if there are more than, say, a lousy 100,000 nodes or so. Using the push stack method can take a whole day to do the conversion on what MLM'ers would consider to be a small million node hierarchy.

I thought I'd give Celko a bit of competition by coming up with a method to convert an Adjacency List to Nested sets at speeds that just seem impossible. Here's the performance of the push stack method on my i5 laptop.

Duration for     1,000 Nodes = 00:00:00:870 
Duration for    10,000 Nodes = 00:01:01:783 (70 times slower instead of just 10)
Duration for   100,000 Nodes = 00:49:59:730 (3,446 times slower instead of just 100) 
Duration for 1,000,000 Nodes = 'Didn't even try this'

And here's the duration for the new method (with the push stack method in parenthesis).

Duration for     1,000 Nodes = 00:00:00:053 (compared to 00:00:00:870)
Duration for    10,000 Nodes = 00:00:00:323 (compared to 00:01:01:783)
Duration for   100,000 Nodes = 00:00:03:867 (compared to 00:49:59:730)
Duration for 1,000,000 Nodes = 00:00:54:283 (compared to something like 2 days!!!)

Yes, that's correct. 1 million nodes converted in less than a minute and 100,000 nodes in under 4 seconds.

You can read about the new method and get a copy of the code at the following URL. http://www.sqlservercentral.com/articles/Hierarchy/94040/

I also developed a "pre-aggregated" hierarchy using similar methods. MLM'ers and people making bills of materials will be particularly interested in this article. http://www.sqlservercentral.com/articles/T-SQL/94570/

If you do stop by to take a look at either article, jump into the "Join the discussion" link and let me know what you think.


Adjacency Model + Nested Sets Model

I went for it because I could insert new items to the tree easily (you just need a branch's id to insert a new item to it) and also query it quite fast.

+-------------+----------------------+--------+-----+-----+
| category_id | name                 | parent | lft | rgt |
+-------------+----------------------+--------+-----+-----+
|           1 | ELECTRONICS          |   NULL |   1 |  20 |
|           2 | TELEVISIONS          |      1 |   2 |   9 |
|           3 | TUBE                 |      2 |   3 |   4 |
|           4 | LCD                  |      2 |   5 |   6 |
|           5 | PLASMA               |      2 |   7 |   8 |
|           6 | PORTABLE ELECTRONICS |      1 |  10 |  19 |
|           7 | MP3 PLAYERS          |      6 |  11 |  14 |
|           8 | FLASH                |      7 |  12 |  13 |
|           9 | CD PLAYERS           |      6 |  15 |  16 |
|          10 | 2 WAY RADIOS         |      6 |  17 |  18 |
+-------------+----------------------+--------+-----+-----+
  • Every time you need all children of any parent you just query the parent column.
  • If you needed all descendants of any parent you query for items which have their lft between lft and rgt of parent.
  • If you needed all parents of any node up to the root of the tree, you query for items having lft lower than the node's lft and rgt bigger than the node's rgt and sort the by parent.

I needed to make accessing and querying the tree faster than inserts, that's why I chose this

The only problem is to fix the left and right columns when inserting new items. well I created a stored procedure for it and called it every time I inserted a new item which was rare in my case but it is really fast. I got the idea from the Joe Celko's book, and the stored procedure and how I came up with it is explained here in DBA SE https://dba.stackexchange.com/q/89051/41481