How to represent a data tree in SQL?

I've bookmarked this slidshare about SQL-Antipatterns, which discusses several alternatives: http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back?src=embed

The recommendation from there is to use a Closure Table (it's explained in the slides).

Here is the summary (slide 77):

                  | Query Child | Query Subtree | Modify Tree | Ref. Integrity
Adjacency List    |    Easy     |     Hard      |    Easy     |      Yes
Path Enumeration  |    Easy     |     Easy      |    Hard     |      No
Nested Sets       |    Hard     |     Easy      |    Hard     |      No
Closure Table     |    Easy     |     Easy      |    Easy     |      Yes

The easiest implementation is adjacency list structure:

id  parent_id  data

However, some databases, particularly MySQL, have some issues in handling this model, because it requires an ability to run recursive queries which MySQL lacks.

Another model is nested sets:

id lft rgt data

where lft and rgt are arbitrary values that define the hierarchy (any child's lft, rgt should be within any parent's lft, rgt)

This does not require recursive queries, but it slower and harder to maintain.

However, in MySQL this can be improved using SPATIAL abitilies.

See these articles in my blog:

Adjacency list vs. nested sets: PostgreSQL
Adjacency list vs. nested sets: SQL Server
Adjacency list vs. nested sets: Oracle
Adjacency list vs. nested sets: MySQL

for more detailed explanations.

I'm suprised that nobody mentioned the materialized path solution, which is probably the fastest way of working with trees in standard SQL.

In this approach, every node in the tree has a column path, where the full path from the root to the node is stored. This involves very simple and fast queries.

Have a look at the example table node:

+---------+-------+
| node_id | path  |
+---------+-------+
| 0       |       |
| 1       | 1     |
| 2       | 2     |
| 3       | 3     |
| 4       | 1.4   |
| 5       | 2.5   |
| 6       | 2.6   |
| 7       | 2.6.7 |
| 8       | 2.6.8 |
| 9       | 2.6.9 |
+---------+-------+

In order to get the children of node x, you can write the following query:

SELECT * FROM node WHERE path LIKE CONCAT((SELECT path FROM node WHERE node_id = x), '.%')

Keep in mind, that the column path should be indexed, in order to perform fast with the LIKE clause.

If you are using PostgreSQL you can use ltree, a package in the contrib extension (comes by default) which implements the tree data structure.

From the docs:

CREATE TABLE test (path ltree);
INSERT INTO test VALUES ('Top');
INSERT INTO test VALUES ('Top.Science');
INSERT INTO test VALUES ('Top.Science.Astronomy');
INSERT INTO test VALUES ('Top.Science.Astronomy.Astrophysics');
INSERT INTO test VALUES ('Top.Science.Astronomy.Cosmology');
INSERT INTO test VALUES ('Top.Hobbies');
INSERT INTO test VALUES ('Top.Hobbies.Amateurs_Astronomy');
INSERT INTO test VALUES ('Top.Collections');
INSERT INTO test VALUES ('Top.Collections.Pictures');
INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy');
INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy.Stars');
INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy.Galaxies');
INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy.Astronauts');
CREATE INDEX path_gist_idx ON test USING GIST (path);
CREATE INDEX path_idx ON test USING BTREE (path);

You can do queries like:

ltreetest=> SELECT path FROM test WHERE path <@ 'Top.Science';
                path
------------------------------------
 Top.Science
 Top.Science.Astronomy
 Top.Science.Astronomy.Astrophysics
 Top.Science.Astronomy.Cosmology
(4 rows)

It depends on how you will be querying and updating the data. If you store all the data in one row, it's basically a single unit that you can't query into or partially update without rewriting all the data.

If you want to store each element as a row, you should first read Managing Hierarchical Data in MySQL (MySQL specific, but the advice holds for many other databases too).

If you're only ever accessing an entire tree, the adjacency list model makes it difficult to retrieve all nodes under the root without using a recursive query. If you add an extra column that links back to the head then you can do SELECT * WHERE head_id = @id and get the whole tree in one non-recursive query, but it denormalizes the database.

Some databases have custom extensions that make storing and retrieving heirarchical data easier, for example Oracle has CONNECT BY.

How to represent a data tree in SQL?

Related

Recent Posts