Storing hierarchies in a single table, paper

Posts   
 
    
Otis avatar
Otis
LLBLGen Pro Team
Posts: 39613
Joined: 17-Aug-2003
# Posted on: 20-Jun-2008 10:02:31   

I think a lot of you will be interested in this. Storing a hierarchy or tree in a single table is a requirement seen often in a project. SQL however doesn't have constructs to fetch hierarchies, sort on them and filter on them. Several different approaches have been brought forward, see this link to a thread on our own forum: http://www.llblgen.com/tinyforum/Messages.aspx?ThreadID=3208

Today I ran into a paper about new research on this subject: http://arxiv.org/abs/0806.3115

It's full of math, so you've been warned wink , but most of it is used as proof that what's been brought forward actually is correct (so you can just assume that and read about the idea). The basic idea is to use a rational number as the key, based on the left and right key. It's a brilliant idea, and gives very easy SQL to fetch, sort etc. a hierarchy. It also gives math to move subtrees from one node to the other, unfotunately not SQL.

Frans Bouma | Lead developer LLBLGen Pro
psandler
User
Posts: 540
Joined: 22-Feb-2005
# Posted on: 20-Jun-2008 16:50:33   

The problem I have with the nested set approach is that you are basically bastardizing the relational model by storing your entity relationships as metadata. That means that moving a complex tree from one parent to another could require a massive amount of updates to the metadata that stores all the relationships.

In the adjacency model, to move a subtree you only have to update a single value--the FK (parent) of the child you want to move.

To me the adjacency model is more "correct", in that relationships between nodes are stored as actual relationships in the database, and are enforced with referential integrity. It actually suprised/surprises me that Celko advocates this approach, since he is such a purist about the relational model (read his views on the Entity-Attribute-Value model for example).

All that said, I understand that the nested set approach may be more pragmatic in many cases. The concept in the paper you posted certainly looks like an improvement on the concept. The math is way over my head, of course. simple_smile

I do think SQL Server's CTE feature can be used to do adjacency model fetches. A basic example at http://www.4guysfromrolla.com/webtech/071906-1.shtml. I have not tinkered with this at all, so I don't claim that it solves the whole problem (sorting and filtering for example).

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39613
Joined: 17-Aug-2003
# Posted on: 20-Jun-2008 16:59:43   

psandler wrote:

The problem I have with the nested set approach is that you are basically bastardizing the relational model by storing your entity relationships as metadata. That means that moving a complex tree from one parent to another could require a massive amount of updates to the metadata that stores all the relationships.

In the adjacency model, to move a subtree you only have to update a single value--the FK (parent) of the child you want to move.

To me the adjacency model is more "correct", in that relationships between nodes are stored as actual relationships in the database, and are enforced with referential integrity. It actually suprised/surprises me that Celko advocates this approach, since he is such a purist about the relational model (read his views on the Entity-Attribute-Value model for example).

I think that having a single table with a 'ParentID' FK pointing to its own PK is still checked by an FK constraint (so it does fall into the boundaries of a relational model), however it's unsuitable for doing set operations. The tree solution Celko put forward does offer the ability to perform set operations on it, so it's more SQL friendly, however it has problems with updates.

So I think we're not talking about the same thing wink . This paper is about having a hierarchy of data (or better: a semantical hierarchy of data), like the textbook examples: Employees which have an FK to themselves for Manager.). Typically, people store these kind of hierarchies in a single table, but can't do anything with it when it comes to hierarchies: it's a pain to obtain the whole hierarchy below employee X for example.

I think what you're talking about is having a db inside a db, where a couple of tables are defined to be able to add columns etc. on the fly, indeed storing meta-data inside a table, which is a big problem waiting to happen.

All that said, I understand that the nested set approach may be more pragmatic in many cases. The concept in the paper you posted certainly looks like an improvement on the concept. The math is way over my head, of course. simple_smile

Oh, the math is also way over my head too, the beauty about these proofs is that you can skip them, and just assume a math-nerd has checked it already wink .Though admitted, the stuff is still in the Math corner and not ready to be applied onto a DB right away without digging into the material.

I do think SQL Server's CTE feature can be used to do adjacency model fetches. A basic example at http://www.4guysfromrolla.com/webtech/071906-1.shtml. I have not tinkered with this at all, so I don't claim that it solves the whole problem (sorting and filtering for example).

Yes CTEs can be used to fetch hierarchies as well, but of course it's very complicated to setup. Having special crafted PK/FK values is IMHO a simpler approach and it works everywhere wink

Frans Bouma | Lead developer LLBLGen Pro
psandler
User
Posts: 540
Joined: 22-Feb-2005
# Posted on: 21-Jun-2008 15:42:47   

Otis wrote:

I think that having a single table with a 'ParentID' FK pointing to its own PK is still checked by an FK constraint (so it does fall into the boundaries of a relational model), however it's unsuitable for doing set operations. The tree solution Celko put forward does offer the ability to perform set operations on it, so it's more SQL friendly, however it has problems with updates.

So I think we're not talking about the same thing wink . This paper is about having a hierarchy of data (or better: a semantical hierarchy of data), like the textbook examples: Employees which have an FK to themselves for Manager.). Typically, people store these kind of hierarchies in a single table, but can't do anything with it when it comes to hierarchies: it's a pain to obtain the whole hierarchy below employee X for example.

I'm not clear on where you think we're talking about two different things. I'm guessing from your comment above about PK/FK that we understand the model differently. In your understanding of this model, you would store BOTH the PK/FK and the rational numbers in each row, so the table is actually self-referencing (EDIT: changed this to "self-referencing")?

(If that guess is wrong, you can likely skip the rest of this simple_smile ).

The model Celko initially proposed does not use a self-referencing table (at least in the discussion you linked to in the original thread), and this is the design I was basing my comments on. Here is the table he proposed:

Personnel emp lft rgt ====================== 'Albert' 1 12 'Bert' 2 3 'Chuck' 4 11 'Donna' 5 6 'Eddie' 7 8 'Fred' 9 10

So I was under the impression that the model proposes no PK/FK constraint, ONLY the left/right (or rational) numbers that describe the relationship. Doing it without a PK/FK column, I would consider that "storing the relationship as metadata". If there were a PK/FK column, it's probably closer to "denormalization". And if it's denormalization, it comes with all the potential problems of such--every time a row is updated, the tree has to be checked for cascading updates (disclaimer: I don't understand a word of the "Moving Subtrees" section of that paper wink ).

Just to be clear, I'm not poo-pooing the paper or the idea. It's obviously an improvement on the nested set model, and (as I said) it's probably a good choice in many scenarios.

Otis wrote:

I think what you're talking about is having a db inside a db, where a couple of tables are defined to be able to add columns etc. on the fly, indeed storing meta-data inside a table, which is a big problem waiting to happen.

Yes, this is what I meant regarding the Entity-Attribute-Value (EAV) model. I was only mentioning that as an example of where Celko gets on his soapbox about bastardizing the relational model.

Phil

Otis avatar
Otis
LLBLGen Pro Team
Posts: 39613
Joined: 17-Aug-2003
# Posted on: 25-Jun-2008 09:33:50   

Ah! Yes I now understand what you meant. simple_smile Indeed you're right, celko was talking about multiple tables to store a hierarchy of entity instances instead of the classic textbook example of 1 table which references itself.

Frans Bouma | Lead developer LLBLGen Pro