Internet2 Middleware Initiative S. Cantor
Internet2 Draft The Ohio State University
Expires: June 23, 2004 December 23, 2003

RDBMS Hierarchical Data Modeling and Searching
draft-cantor-rdbms-trees-00

Status of this Memo

This document is an Internet2 Draft and is in compliance with relevant Internet2 document standards.

Internet2 Drafts are working documents of Internet2, its areas, and its working groups.

Internet2 Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet2 Drafts as reference material or to cite them other than as "work in progress."

This Internet2 Draft will expire on June 23, 2004.

This document is an individual submission to the MACE-Dir WG of the Internet2 Middleware Initiative. Comments should be sent to the mace-dir@internet2.edu mailing list.

Abstract

This document describes requirements and one solution for modeling and searching hierarchical data structures in a relational database management system.


Requirements

SQL is a set-based calculus that does not map well to hierachical data. In fact, relational databases are typically contrasted with the hierachical database model that preceded it in data management. For most purposes, the relational approach is more powerful and flexible, but one important exception is in algorithms that lend themselves to recursion. Such problems can be very effectively modeled with trees, and conversely, trees can very effectively be processed with recursive algorithms.

Sometimes the data model of a particular problem is necessarily hierarchical. A common example is a file system, which is usually built as a tree of directories and files. It is routine to see this metaphor within an application, for example when providing users with the ability to organize data into a file system-like set of logical folders. Likewise, content metadata often contains vocabularies that are organized into trees of categories, subcategories, etc.

The principal requirement for any approach that maps trees into a relational data model is to make as many tree operations as possible achievable with a single SQL statement instead of requiring a loop or recursion. Some databases support SQL extensions that offer recursive functionality to operate on tree-like data, but these extensions are non-portable and do not really address the fundamental efficiency problem of executing multiple statements.

Two particularly common operations that involve a particular node in a tree are deriving the path of nodes from the root of the tree to a given node, and deriving the children of a node (either the immediate children or all descendants). Another important operation is a comparison operation to determine whether a given node is a descendant of another given node. Taken together, these also permit basic set union and intersection operations on subtrees.


Solution

The following solution permits any of these operations in a single non-recursive SQL statement that operates on indexed numeric values. Insertion to a specific location also takes constant time. Maintaining a sorted tree would require logarithmic time to locate the proper insertion point.

Overview

The basic structure proposed is called a min-max tree, and can be found in many references on data structures and are often suggested in SQL texts. A min-max tree describes each node in terms of a minimum and maximum value, and requires as an invariant property that any given child must have a minimum value greater than or equal to and a maximum value less than or equal to those of its parent. This solution further proposes that a node contain an indicator of its depth in the tree to permit sorting of a path.

Most importantly, this solution requires that the minimum and maximum values be floating point numbers equal to very large or small powers of two. This requirement permits the tree to grow to fairly arbitrary size and depth, while amortizing the cost of insertions by preallocating room for a set number of children per node. Insertion into a parent with room requires a single row insert, while insertion into a parent without room requires that each existing child be updated in a single statement, followed by a single row insert.

Table Structure

The following DDL statements are from a Microsoft Transact-SQL script that creates a tree called stuff. Similar statements with some adjustments should work in any comparable relational database.

CREATE TABLE [dbo].[stuff] (
	[stuff_id] [int] IDENTITY NOT NULL ,
	[parent_id] [int] NULL ,
        [name] [varchar(50)] NOT NULL ,
	[minval] [float] NOT NULL ,
	[maxval] [float] NOT NULL ,
	[tree_level] [tinyint] NOT NULL ,
	[sort_key] [smallint] NOT NULL 
) ON [PRIMARY]
GO

CREATE CLUSTERED INDEX [IX_stuff_parentid_sortkey] ON [dbo].[stuff]([parent_id], [sort_key])
GO

ALTER TABLE [dbo].[stuff] ADD 
	CONSTRAINT [DF_stuff_sort_key] DEFAULT (0) FOR [sort_key],
	CONSTRAINT [PK_stuff] PRIMARY KEY NONCLUSTERED ([stuff_id])
GO

CREATE INDEX [IX_stuff_name] ON [dbo].[stuff]([name])
GO

CREATE INDEX [IX_stuff_minval] ON [dbo].[stuff]([minval])
GO

CREATE INDEX [IX_stuff_maxval] ON [dbo].[stuff]([maxval])
GO

Worthy of note is that we still see the common relational strategy of a parent pointer in each record. This facilitates simple queries for the immediate children of a node, which makes identifying siblings easy.

An optional feature is a sort key that enables children of a node to be inserted in any order but subsequently sorted based on user input. The primary key remains an identity column used for ease of foreign key creation in dependent tables, but the clustered index is placed on the parent pointer and sort key.

Finally, the integral sizes used for tree level and sorting can be adjusted as desired to reflect the expected depth and breadth of the tree. A user-maintained tree would not normally be expected to exceed 255 levels in depth, for example.

Initialization

For insertion to take place as described in the following section, a single initialization row (the root) must be placed into the table. This could of course be done dynamically within the insertion procedure, but inserting the row in advance is advisable to insure that the root record has a key value of 1. This permits simple identification of the root, in case it must be excluded from queries.

The important aspect of the initialization process, apart from the key value, is to properly bootstrap the min-max interval of the tree as a whole. This can be essentially any difference that is a power of two, including a large negative and large positive number, but for illustration, we use zero and a number that approaches without exceeding the capacity of a SQL float value:

insert stuff (name, minval, maxval, tree_level) values (
    "<root>",
    0.0,
    power(convert(float,2.0),200),
    0
)

Note the conversion to floating point within the power function, which is needed to obtain a high precision value instead of a numeric scaled value. The use of a power of two takes advantage of the fact that IEEE floating point values are stored as powers of two. Such values can be multiplied or divided by powers of two without losing any precision, no matter how large or small the result.

Insertion

The only real "trick" to this design is in maintaining the precision of the min/max values during insertion of new nodes. Without this, the basic math behind the tree structure can be maintained, but underflow or overflow occur very rapidly when establishing the sub-intervals within each child node. With this precision, range "splitting" can occur many, many times without requiring any control over the tree's balance. This splitting process is required once the initial number of children allocated to a node is reached, to make room for a new set. The exact number of children allocated per split can be set as needed.

While some programming languages provide IEEE floating point support that would enable this work to occur outside the database, the most portable approach is to use a stored procedure to insert new nodes into the tree. Most scripting environments with database support can invoke stored procedures, and any decent database supports them, as well as cursors, which are needed to implement the insertion algorithm.

It may be possible to use the SQL EXEC function to construct a table-independent stored procedure that can operate without regard for the names of the table or columns. The following Transact-SQL example does not illustrate this, but simply shows how insertion and range splitting is accompished:

CREATE procedure sp_stuff_insert (
    @parent int,
    @name varchar(50),
    @id int output
)
as 
begin
    declare @newmin float,@newmax float,@prevmax float
    select @newmin=null
    select @newmax=null
    declare @parentmin float,@parentmax float,@level tinyint,@childmin float,@childmax float
    select @parentmin=minval,@parentmax=maxval,@level=tree_level from stuff
        where stuff_id=@parent
    select @prevmax=@parentmin
    declare children cursor
        for select minval,maxval from stuff where parent_id=@parent order by minval
        for read only
    open children
    fetch children into @childmin,@childmax
    if (@@FETCH_STATUS=-1) begin
        select @newmin=@parentmin
        select @newmax=@newmin+((@parentmax-@parentmin)/8.0)
    end else begin
        while (@@FETCH_STATUS=0) begin
            if (@childmin<>@prevmax) begin
                select @newmin=@prevmax
                select @newmax=@newmin+(@childmax-@childmin)
                break
            end
            select @prevmax=@childmax
            fetch children into @childmin,@childmax
        end
        if (@newmin is null) begin
            if (@prevmax=@parentmax) begin
                update stuff
                    set minval=@parentmin+((minval-@parentmin)/2.0),
                        maxval=@parentmin+((maxval-@parentmin)/2.0)
                    where minval>=@parentmin and maxval<=@parentmax
                        and stuff_id<>@parent
                select @newmin=@parentmin+((@parentmax-@parentmin)/2.0)
                select @newmax=@newmin+((@childmax-@childmin)/2.0)
            end else begin
                select @newmin=@childmax
                select @newmax=@newmin+(@childmax-@childmin)
            end
        end    
    end
    close children
    deallocate children
    insert stuff (parent_id,name,tree_level,minval,maxval)
        values (@parent,@name,@level+1,@newmin,@newmax)
    select @id=@@IDENTITY
end

The procedure appears somewhat complex, but is efficient because the insertion point is known in advance. The purpose of the algorithm is to determine whether room exists to insert a new child without adjusting other records. The example uses an initial allocation of 8, as seen by the fact that when no children exist, the first child receives an interval size which is one-eighth the parent's interval. Any power of two will do.

Another feature of this procedure is that if nodes have been deleted from the tree, gaps will exist; this process will locate such gaps and repopulate them with the new record, avoiding fragmentation. Only if no gaps exist, and the existing children completely cover the parent's range, will a split happen. A split simply cuts in half the boundary values of each child below the parent, relative to the parent. This doubles the number of children available, rapidly making room for many children when a branch is heavily populated. The cost of this operation is a single update applying to every child node at or below the insertion point, which is an indexed operation, as we will see next.

Searching

All this effort would be wasted except for the efficiency we will find in searching and subsetting the tree. With a traditional parent pointer design, any attempt to query for more than just the immediate children of a node requires a loop to recurse over each level of the tree. This is especially ugly if the whole tree must be retrieved in sorted fashion.

In contrast, notice the simple way that a subtree rooted at a given node (@root) can be isolated, while maintaining an ordering that permits a partially sorted traversal:

select stuff_id,name from stuff
    where minval >= (select minval from stuff where stuff_id=@root)
        and maxval <= (select maxval from stuff where stuff_id=@root)
    order by tree_level,parent_id,sort_key

The primary technique in any subsetting query is to constrain the boundary values of the result set to fall within the range of the root. The subqueries are against the primary key, and the main query is on a pair of indexed columns.


Here is a depth-first traversal that ignores the user-specified sort key:

select stuff_id,name from stuff
    where minval >= (select minval from stuff where stuff_id=@root)
        and maxval <= (select maxval from stuff where stuff_id=@root)
    order by minval,tree_level

Another useful operation that looks similar returns the path to @node from the root, from the top down:

select stuff_id,name from stuff
    where minval <= (select minval from stuff where stuff_id=@node)
        and maxval >= (select maxval from stuff where stuff_id=@node)
    order by tree_level

By reversing the relations, we return the nodes within whose range the candidate node falls, which by definition are all ancestors.


Another useful example shows how to produce a result set that facilitates a compact "view" of the tree, with a single branch partly expanded based on user input. All nodes along the path to the selection, plus all their siblings, are displayed in such a view. Thus, we combine the previous example with a second query to access the siblings at each point along the path:

select stuff_id,tree_level,name from stuff
    where parent_id in (
        select stuff_id,name from stuff
            where minval <= (select minval from stuff where stuff_id=@node)
        and maxval >= (select maxval from stuff where stuff_id=@node)
    order by tree_level ) order by tree_level,sort_key

Author's Contact Information

  Scott Cantor
  The Ohio State University
  1121 Kinnear Road
  Columbus, OH 43212
  US
Phone:  +1 614 247-6147
EMail:  cantor.2@osu.edu