Storing deep hierarchies in a database

Introducing nested sets: part one

Image for post
Image for post
Photo by Blake Weyland on Unsplash

The problem we’re trying to solve

Relational databases are great for storing a classic parent/child relationship. Take the example of continents:

continent
id name
1 Europe
2 America

and countries:

country
id name continent_id
1 France 1
2 United Kingdom 1
3 Sweden 1
4 Canada 2

Each country belongs within a specified continent (let’s just pretend it’s that simple for this example!), specified by its id in the continent_id field. The query to fetch details for a given country is straightforward:

SELECT
continent.name AS continent,
country.name AS country
FROM
continent,
country
WHERE
country.continent_id = continent.id
AND country.id = '2';

and produces the output:

+-----------+----------------+
| continent | country |
+-----------+----------------+
| Europe | United Kingdom |
+-----------+----------------+
1 row in set (0.00 sec)

But what if you want to store other location granularities, or other hierarchical data such as generic subject categories, with a variable or unbounded number of levels?

A naive solution

A default approach might be to generalise and store a ‘place’ id as a reference to a place’s parent:

place
id name parent_place
1 Europe NULL
2 France 1
3 United Kingdom 1
4 England 3
5 Wales 3
6 Sweden 1

We can then use a join each time we want to ascend the hierarchy, like so:

SELECT
p1.name AS place1,
p2.name AS place2,
p3.name AS place3
FROM
place p1,
place p2,
place p3
WHERE
p1.id = '4'
AND p2.id = p1.parent_place
AND p3.id = p2.parent_place;

And we get results such as:

+---------+----------------+--------+
| place1 | place2 | place3 |
+---------+----------------+--------+
| England | United Kingdom | Europe |
+---------+----------------+--------+
1 row in set (0.00 sec)

This is fine if we have a fixed depth that we know ahead of time, but it doesn’t scale well, and it’s impractical if we have an unbounded depth.

In such cases, we can simply keep re-executing a query, feeding it the id of each ancestor until we get a NULL parent_place back, but that could be a lot of queries for a potentially deep hierarchy such as a subject categorisation.

All we’ve really done with this approach is to shove everything into a single table, without actually solving the original problem. How can we represent a generic hierarchy with unbounded depth, and still make the data accessible?

Enter ‘nested sets’.

Nested sets

Nested sets solve exactly this problem — and even bring along some bonuses for the ride! Here’s the setup:

location
id name lft rgt
1 Europe 1 14
2 France 2 3
3 United Kingdom 4 11
4 England 5 6
5 Wales 7 8
6 Scotland 9 10
7 Sweden 12 13

First, it’s very important to note that the two columns we’ve introduced — lft and rgt — do not reference other locations. In fact, they stand for ‘left’ and ‘right’, and they’re all (well, mostly) we need to store a full, unbounded hierarchy and be able to answer questions such as:

  1. What are my ancestors?
  2. What are my descendants?
  3. What are my siblings, in order?

each with a single straightforward query. That sounds too simple, right —how have we managed to pack all that information in, with the addition of just a single column? It turns out it’s all in the properties of lft and rgt and how we maintain those properties when we update the data.

Take a look at the following diagram which represents the data in the previous table:

Image for post
Image for post
Each box represents a record in our table, with its lft and rgt values displayed

This should make it far more obvious what the cryptic ‘left’ and ‘right’ nature of those two new columns is. They are, almost literally, points on a line which we can imagine our entries laid out along, from left to right. In particular, the following properties are obeyed:

  • If an item has no parent (e.g. Europe), its lft value is 1
  • If an item is the first child (e.g. England), its lft value is its parent’s lft value + 1
  • If an item is not the first child (e.g. Sweden), its lft value is its previous sibling’s rgt value + 1
  • If an item has children (e.g. United Kingdom), its rgt value is its last child’s rgt value + 1
  • If an item doesn’t have any children (e.g. France), its rgt value is its own lft value + 1

You can think of each record as a box, ‘containing’ all of its descendants — in order — and being ‘contained within’ all of its ancestors. Hence ‘nested sets’.

It turns out these properties even allow us to trivially answer a question such as “How many descendants do I have?” — it turns out this question isn’t particularly useful, but it’s interesting that we can answer it, given that an answer would be unfathomable using the naive approach presented earlier.

Interestingly, we can’t actually easily answer questions such as “how many children do I have?” or even “what are my children?”, but we’ll deal with that issue in due course (don’t worry, it’s easy to fix!). For now, let’s see how we can answer the original questions we promised to.

If you look at the above diagram and take Wales as an example, it should be obvious that you can answer this question conceptually by ‘looking upwards’ and selecting the locations that ‘contain’ Wales. In other words, those whose lft is less than Wales’ lft, and those whose rgt is greater than Wales’ rgt:

SELECT
ancestor.id,
ancestor.name
FROM
location ancestor,
location original
WHERE
original.lft > ancestor.lft
AND original.rgt < ancestor.rgt
AND original.id = 5;

This gives the output:

+----+----------------+
| id | name |
+----+----------------+
| 1 | Europe |
| 3 | United Kingdom |
+----+----------------+
2 rows in set (0.00 sec)

In exactly the opposite direction, descendants are ‘contained within’ a location, so their lft and rgt values must be between the original location’s values. To fetch all of Europe’s descendants:

SELECT
descendant.id,
descendant.name
FROM
location descendant,
location original
WHERE
original.lft < descendant.lft
AND original.rgt > descendant.rgt
AND original.id = 1;

which gives us:

+----+----------------+
| id | name |
+----+----------------+
| 2 | France |
| 3 | United Kingdom |
| 4 | England |
| 5 | Wales |
| 6 | Scotland |
| 7 | Sweden |
+----+----------------+
6 rows in set (0.00 sec)

We can even answer that unusual question with an even more unusual query:

SELECT
(location.rgt - location.lft - 1) DIV 2 AS num_descendants
FROM
location
WHERE
location.id = 3;

This query gives the result:

+-----------------+
| num_descendants |
+-----------------+
| 3 |
+-----------------+
1 row in set (0.00 sec)

The method works because a record’s ‘width’ is directly related to the number of records that fit ‘inside’ it i.e. its descendants.

Follow up

This article has been quite lengthy, and while there’s more to cover, I don’t want to overwhelm with information at this stage. Stay tuned for part two which will be available shortly; it will cover the “what are my children” question and how to keep all those lft and rgt values up-to-date when updating the table.

Supporting code — for MySQL — is available in the ‘nested-sets-part-one’ repository in GitHub. All the table setup is contained in setup.sql, whilst each query in this article is contained in a separate file. See the README if you’re unsure on how to run it.

Technologist & writer, Bobby is currently working on several projects including a management dashboard for static websites and an education portal. bobbyjack.me

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store