|
This document is an Internet2 Draft and is in compliance with relevant Internet2 document standards.
Internet2 Drafts are working documents of Internet2, its areas, and its working groups.
Internet2 Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet2 Drafts as reference material or to cite them other than as "work in progress."
This Internet2 Draft will expire on June 23, 2004.
This document is an individual submission to the MACE-Dir WG of the Internet2 Middleware Initiative. Comments should be sent to the mace-dir@internet2.edu mailing list.
This document describes requirements and one solution for modeling and searching hierarchical data structures in a relational database management system.
SQL is a set-based calculus that does not map well to hierachical data. In fact, relational databases are typically contrasted with the hierachical database model that preceded it in data management. For most purposes, the relational approach is more powerful and flexible, but one important exception is in algorithms that lend themselves to recursion. Such problems can be very effectively modeled with trees, and conversely, trees can very effectively be processed with recursive algorithms.
Sometimes the data model of a particular problem is necessarily hierarchical. A common example is a file system, which is usually built as a tree of directories and files. It is routine to see this metaphor within an application, for example when providing users with the ability to organize data into a file system-like set of logical folders. Likewise, content metadata often contains vocabularies that are organized into trees of categories, subcategories, etc.
The principal requirement for any approach that maps trees into a relational data model is to make as many tree operations as possible achievable with a single SQL statement instead of requiring a loop or recursion. Some databases support SQL extensions that offer recursive functionality to operate on tree-like data, but these extensions are non-portable and do not really address the fundamental efficiency problem of executing multiple statements.
Two particularly common operations that involve a particular node in a tree are deriving the path of nodes from the root of the tree to a given node, and deriving the children of a node (either the immediate children or all descendants). Another important operation is a comparison operation to determine whether a given node is a descendant of another given node. Taken together, these also permit basic set union and intersection operations on subtrees.
The following solution permits any of these operations in a single non-recursive SQL statement that operates on indexed numeric values. Insertion to a specific location also takes constant time. Maintaining a sorted tree would require logarithmic time to locate the proper insertion point.
Overview
The basic structure proposed is called a min-max tree, and can be found in many references on data structures and are often suggested in SQL texts. A min-max tree describes each node in terms of a minimum and maximum value, and requires as an invariant property that any given child must have a minimum value greater than or equal to and a maximum value less than or equal to those of its parent. This solution further proposes that a node contain an indicator of its depth in the tree to permit sorting of a path.
Most importantly, this solution requires that the minimum and maximum values be floating point numbers equal to very large or small powers of two. This requirement permits the tree to grow to fairly arbitrary size and depth, while amortizing the cost of insertions by preallocating room for a set number of children per node. Insertion into a parent with room requires a single row insert, while insertion into a parent without room requires that each existing child be updated in a single statement, followed by a single row insert.
Table Structure
The following DDL statements are from a Microsoft Transact-SQL script that creates a tree called stuff. Similar statements with some adjustments should work in any comparable relational database.
CREATE TABLE [dbo].[stuff] ( [stuff_id] [int] IDENTITY NOT NULL , [parent_id] [int] NULL , [name] [varchar(50)] NOT NULL , [minval] [float] NOT NULL , [maxval] [float] NOT NULL , [tree_level] [tinyint] NOT NULL , [sort_key] [smallint] NOT NULL ) ON [PRIMARY] GO CREATE CLUSTERED INDEX [IX_stuff_parentid_sortkey] ON [dbo].[stuff]([parent_id], [sort_key]) GO ALTER TABLE [dbo].[stuff] ADD CONSTRAINT [DF_stuff_sort_key] DEFAULT (0) FOR [sort_key], CONSTRAINT [PK_stuff] PRIMARY KEY NONCLUSTERED ([stuff_id]) GO CREATE INDEX [IX_stuff_name] ON [dbo].[stuff]([name]) GO CREATE INDEX [IX_stuff_minval] ON [dbo].[stuff]([minval]) GO CREATE INDEX [IX_stuff_maxval] ON [dbo].[stuff]([maxval]) GOWorthy of note is that we still see the common relational strategy of a parent pointer in each record. This facilitates simple queries for the immediate children of a node, which makes identifying siblings easy.
An optional feature is a sort key that enables children of a node to be inserted in any order but subsequently sorted based on user input. The primary key remains an identity column used for ease of foreign key creation in dependent tables, but the clustered index is placed on the parent pointer and sort key.
Finally, the integral sizes used for tree level and sorting can be adjusted as desired to reflect the expected depth and breadth of the tree. A user-maintained tree would not normally be expected to exceed 255 levels in depth, for example.
Initialization
For insertion to take place as described in the following section, a single initialization row (the root) must be placed into the table. This could of course be done dynamically within the insertion procedure, but inserting the row in advance is advisable to insure that the root record has a key value of 1. This permits simple identification of the root, in case it must be excluded from queries.
The important aspect of the initialization process, apart from the key value, is to properly bootstrap the min-max interval of the tree as a whole. This can be essentially any difference that is a power of two, including a large negative and large positive number, but for illustration, we use zero and a number that approaches without exceeding the capacity of a SQL float value:
insert stuff (name, minval, maxval, tree_level) values ( "<root>", 0.0, power(convert(float,2.0),200), 0 )Note the conversion to floating point within the power function, which is needed to obtain a high precision value instead of a numeric scaled value. The use of a power of two takes advantage of the fact that IEEE floating point values are stored as powers of two. Such values can be multiplied or divided by powers of two without losing any precision, no matter how large or small the result.
Insertion
The only real "trick" to this design is in maintaining the precision of the min/max values during insertion of new nodes. Without this, the basic math behind the tree structure can be maintained, but underflow or overflow occur very rapidly when establishing the sub-intervals within each child node. With this precision, range "splitting" can occur many, many times without requiring any control over the tree's balance. This splitting process is required once the initial number of children allocated to a node is reached, to make room for a new set. The exact number of children allocated per split can be set as needed.
While some programming languages provide IEEE floating point support that would enable this work to occur outside the database, the most portable approach is to use a stored procedure to insert new nodes into the tree. Most scripting environments with database support can invoke stored procedures, and any decent database supports them, as well as cursors, which are needed to implement the insertion algorithm.
It may be possible to use the SQL EXEC function to construct a table-independent stored procedure that can operate without regard for the names of the table or columns. The following Transact-SQL example does not illustrate this, but simply shows how insertion and range splitting is accompished:
CREATE procedure sp_stuff_insert ( @parent int, @name varchar(50), @id int output ) as begin declare @newmin float,@newmax float,@prevmax float select @newmin=null select @newmax=null declare @parentmin float,@parentmax float,@level tinyint,@childmin float,@childmax float select @parentmin=minval,@parentmax=maxval,@level=tree_level from stuff where stuff_id=@parent select @prevmax=@parentmin declare children cursor for select minval,maxval from stuff where parent_id=@parent order by minval for read only open children fetch children into @childmin,@childmax if (@@FETCH_STATUS=-1) begin select @newmin=@parentmin select @newmax=@newmin+((@parentmax-@parentmin)/8.0) end else begin while (@@FETCH_STATUS=0) begin if (@childmin<>@prevmax) begin select @newmin=@prevmax select @newmax=@newmin+(@childmax-@childmin) break end select @prevmax=@childmax fetch children into @childmin,@childmax end if (@newmin is null) begin if (@prevmax=@parentmax) begin update stuff set minval=@parentmin+((minval-@parentmin)/2.0), maxval=@parentmin+((maxval-@parentmin)/2.0) where minval>=@parentmin and maxval<=@parentmax and stuff_id<>@parent select @newmin=@parentmin+((@parentmax-@parentmin)/2.0) select @newmax=@newmin+((@childmax-@childmin)/2.0) end else begin select @newmin=@childmax select @newmax=@newmin+(@childmax-@childmin) end end end close children deallocate children insert stuff (parent_id,name,tree_level,minval,maxval) values (@parent,@name,@level+1,@newmin,@newmax) select @id=@@IDENTITY endThe procedure appears somewhat complex, but is efficient because the insertion point is known in advance. The purpose of the algorithm is to determine whether room exists to insert a new child without adjusting other records. The example uses an initial allocation of 8, as seen by the fact that when no children exist, the first child receives an interval size which is one-eighth the parent's interval. Any power of two will do.
Another feature of this procedure is that if nodes have been deleted from the tree, gaps will exist; this process will locate such gaps and repopulate them with the new record, avoiding fragmentation. Only if no gaps exist, and the existing children completely cover the parent's range, will a split happen. A split simply cuts in half the boundary values of each child below the parent, relative to the parent. This doubles the number of children available, rapidly making room for many children when a branch is heavily populated. The cost of this operation is a single update applying to every child node at or below the insertion point, which is an indexed operation, as we will see next.
Searching
All this effort would be wasted except for the efficiency we will find in searching and subsetting the tree. With a traditional parent pointer design, any attempt to query for more than just the immediate children of a node requires a loop to recurse over each level of the tree. This is especially ugly if the whole tree must be retrieved in sorted fashion.
In contrast, notice the simple way that a subtree rooted at a given node (@root) can be isolated, while maintaining an ordering that permits a partially sorted traversal:
select stuff_id,name from stuff where minval >= (select minval from stuff where stuff_id=@root)
and maxval <= (select maxval from stuff where stuff_id=@root)
order by tree_level,parent_id,sort_keyThe primary technique in any subsetting query is to constrain the boundary values of the result set to fall within the range of the root. The subqueries are against the primary key, and the main query is on a pair of indexed columns.
Here is a depth-first traversal that ignores the user-specified sort key:
select stuff_id,name from stuff where minval >= (select minval from stuff where stuff_id=@root)
and maxval <= (select maxval from stuff where stuff_id=@root)
order by minval,tree_level
Another useful operation that looks similar returns the path to @node from the root, from the top down:
select stuff_id,name from stuff where minval <= (select minval from stuff where stuff_id=@node)
and maxval >= (select maxval from stuff where stuff_id=@node)
order by tree_levelBy reversing the relations, we return the nodes within whose range the candidate node falls, which by definition are all ancestors.
Another useful example shows how to produce a result set that facilitates a compact "view" of the tree, with a single branch partly expanded based on user input. All nodes along the path to the selection, plus all their siblings, are displayed in such a view. Thus, we combine the previous example with a second query to access the siblings at each point along the path:
select stuff_id,tree_level,name from stuff where parent_id in ( select stuff_id,name from stuff where minval <= (select minval from stuff where stuff_id=@node)
and maxval >= (select maxval from stuff where stuff_id=@node)
order by tree_level ) order by tree_level,sort_key
| Scott Cantor | |
| The Ohio State University | |
| 1121 Kinnear Road | |
| Columbus, OH 43212 | |
| US | |
| Phone: | +1 614 247-6147 |
| EMail: | cantor.2@osu.edu |