NSF Middleware Initiative: Released for Public Review

Document:
rpr-nmi-edit-mace_dir-metadirectories_ practices-1.0.html

Expires: October 2002

Richard Jones
University of Colorado

Tom Barton
University of Memphis

 Keith Hazelton
University of Wisconsin

Brendan Bellina
University of Notre Dame

 Eileen Shepard
Boston College

 Ann West
EDUCAUSE/Internet2

Copyright © 2002 by UCAID and/or the respective authors

May/2002

Comments to: nmi-support@nsf-middleware.org
 

 

Metadirectory Practices for Enterprise Directories
in Higher Education


Abstract

This document outlines a set of metadirectory (or meta-directory) issues that should be considered in the deployment of enterprise directories and offers accompanying best practices for higher education.  The model outlined in this document was assembled from the authors’ experiences, discussions among MACE-dir participants, and through interviews with those at other institutions.

 

Table of Contents

Metadirectory Practices for Enterprise Directories in Higher Education

Abstract

Table of Contents

1       Introduction

2       Conventions used in this document

3       Terminology & Usage Examples

I. University of Maryland, Baltimore County

II. University of Alabama, Birmingham

III. Boston College

4       Planning for the Enterprise Directory

4.1        Metadirectory Processes

4.2        The Join: Directory Sources, Identity Matching, and Registry Building

4.2.1     Directory Sources

4.2.1.1     Issues with Directory Sources

4.2.2     Identity Matching

4.2.3     Building the Registry

Recommendations to consider:

4.3        Intelligence

4.3.1     Unique Identifiers

4.3.2     Technical Expression of Institutional Policy and Procedures

4.3.3     Operational Design Requirements

Recommendations to consider:

4.4        Consumers of directory information

4.4.1     Supplying Multiple Consumers

4.4.2     Authentication and Authorization

4.4.3     Support for Microsoft Active Directory

4.4.4     Resource Provisioning

Recommendations to consider:

4.5        Addenda

4.5.1     University of Memphis Finite State Machine Provisioning Model

5       Documents

6       Advice To Implementers

7       Acknowledgments

8       References

9       Contact Information

 

1        Introduction

This document concentrates on the architectural issues that confront the enterprise directory architect, designer, and implementer.  The authors deliberately avoided going down the many tempting side paths, instead attempting to provide ways to think about metadirectory processes that should help to keep the big picture in view while navigating some of the common challenges.  Readers who have done an enterprise directory implementation at least once will certainly recognize the issues, and may find value in comparing their practices with the ones sketched here.

The Burton Group coined the term "meta-directory" in the July 1998 paper "Enterprise Directory Infrastructure: Meta-directory Concepts and Functions" as a technology or class of functionality required to build an enterprise directory infrastructure.  This definition applies directly to the construction of enterprise directories at institutions of higher learning which often requires the collection and transformation of data about resources from systems of record such as human resources information systems and student information systems.


Unlike directories that may be designed to meet the needs of a single application, such as a campus white pages directory or an email directory, enterprise directories are designed to meet both the immediate needs of existing applications as well as the future needs of applications and services yet-to-come.  To that end, enterprise directories require a greater focus on the processes by which data migrates from systems of record into an enterprise directory to be accessed by applications and/or provided to other services, systems, and application directories. A general diagram of an enterprise directory architecture is shown below.


2        Conventions used in this document

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC-2119.

 

3        Terminology & Usage Examples

ACI - Access Control Instruction

The mechanism by which you define access is called access control. When the server receives a request, it uses the authentication information provided by the user in the bind operation, and the access control instructions (ACIs) defined in the server to allow or deny access to directory information. The server can allow or deny permissions such as read, write, search, and compare. The permission level granted to a user may be dependent on the authentication information provided. (Taken from iPlanet Directory Server documentation.)

ACL - Access Control List

A list of Access Control Instructions (ACI’s).

authentication (authN)

Authentication is the process of establishing whether or not a real-world subject is who or what its identifier says it is. Identity can be proven by: Something you know, like a password; Something you have, as with smart-cards, challenge-response mechanisms, or public-key certificates; Something you are, as with positive photo identification, fingerprints, and biometrics. (For more on this topic, see Internet-2 Middleware Authentication website <http://middleware.internet2.edu/core/authentication.shtml>.)

authorization (authZ)

The determination that a request can be honored. (For more on this topic, see Internet-2 Middleware Authorization website <http://middleware.internet2.edu/core/authorization.shtml>.)

CSO Nameserver (aka ph and qi)

A directory server product created at the University of Illinois and widely adopted at higher education institutions for white pages functions. CSO Nameserver has little support for privacy, personalization, and security and was not adopted by vendors except for Qualcomm Eudora, a fairly popular email client.

data providers

Systems of record which provide data to other systems and processes.

data consumers

Systems and processes which retrieve or receive data from the enterprise directory.

directory

A specialized database that may contain information about an institution’s membership, groups, roles, devices, systems, services, locations, and other resources.

enterprise directory

A core middleware architecture which may provide common authentication, authorization, and attribute services to electronic services offered by an institution.  See the "Middleware Business Case" http://middleware.internet2.edu/earlyadopters/draft-internet2-ea-mw-business-case-00.pdf for a thorough discussion of the values provided by enterprise directory services.

enterprise directory infrastructure

The infrastructure required to support and maintain an enterprise directory, this may include multiple directory hardware components as well as the processes by which data flows into and out of the directory service.

ETL

Tools which handle data extraction, transformation, and loading are called ETL tools.  These tools are common in the data warehousing industry, but are not yet commonplace in the directory/identity management industry.  In the directory industry metadirectory products and directory integration products such as Metamerge address this need.

GUID

Globally Unique Identifier.  A guid is a unique id intended to identify a single person for the entire period of their intersection with an institution’s electronic services.  The guid is intended to function as the primary key for an individual within an institution’s enterprise directory.  Guid’s may be assigned according to an algorithm or constructed from source system id’s.  Guid’s should not be changed, reassigned, or retired. (The term "global" in this context means "within an institution", not across all institutions.)

intelligence

The processes that move data: a) from source/owner systems to the registry; and b) from the registry to one or more consumer systems (presumably one of which is a directory server) in conjunction with logic that is applied during the movement of the data, e.g. identity management, business rules, application of data formatting standards, etc.

Join

The Join is the process by which disparate identifiers for multiple source systems are extracted and examined, producing a single master record of identifiers for each individual entity which can be used as a link back to the source system records.

Kerberos

Kerberos is a network authentication system for use on physically insecure networks, based on the key distribution model presented by Needham and Schroeder. It allows entities communicating over networks to prove their identity to each other while preventing eavesdropping or replay attacks. It also provides for data stream integrity (detection of modification) and secrecy (preventing unauthorized reading) using cryptography systems such as DES. (For more on this topic, see the Kerberos FAQ list http://www.cis.ohio-state.edu/hypertext/faq/usenet/kerberos-faq/user/faq.html.)

LDAP directory

A directory that supports Lightweight Directory Access Protocol (LDAP). LDAP is a widely adopted IETF standard directory access protocol well suited to the authentication and authorization needs of modern application architectures. iPlanet Directory Server, OpenLDAP, Novell eDirectory, and Microsoft Active Directory are examples of LDAP directories. (For more on this topic, see RFC1487 http://www.ietf.org/rfc/rfc1487.txt, RFC1777 http://www.ietf.org/rfc/rfc1777.txt, RFC2251 http://www.ietf.org/rfc/rfc2251.txt, and the LDAP Roadmap http://www.kingsmountain.com/ldapRoadmap.shtml.)

metadirectory (also meta-directory)

The processes by which source data is captured, transformed, and presented in an enterprise directory.

RDBMS

Relational Database Management System. Examples of relational databases include Oracle, Microsoft SQL Server, Sybase, and Red Brick.

registry

The system in which identity of resources is resolved. This often refers to a database component of the enterprise directory.

alt. def.: Data taken from multiple source/owner systems to which "intelligence" has been applied in preparation for feed to one or more directories, applications, or other consumer systems. Further "intelligence" may be applied as part of any individual feeding process. Registry data may be housed in a relational database, indexed files, or a directory server.

 

Usage Examples

The following usage examples describe actual implementations using the above terminology.

I. University of Maryland, Baltimore County

Source data systems are HR and SIS systems with data stored in Oracle RDBMS.  Database triggers log updates in a change log which is queried by Perl scripts and updates are applied to records in an iPlanet Directory Server LDAP v3 server.  Perl scripts query the iDS change log for updates and accordingly update Microsoft Active Directory, the Remedy trouble ticket system, and NIS.

In this example, the collection of Perl scripts and database triggers comprise the intelligence function (presumably there is more going on here than merely moving data around). The iPlanet directory server functions as the registry.

II. University of Alabama, Birmingham

Source systems include HR, the student system, and a private Health system. Mainframe programs generate extracts which are dumped to a qi/ph/CSO Nameserver server. Data is then pushed from the qi/ph server to an iPlanet Directory Server LDAP v3 server, and from there to Microsoft Active Directory.  The mainframe programs which generate the extracts and the scripts which update qi/ph, iDS, and Active Directory comprise the intelligence function. The qi/ph server effectively functions as the registry in this example.

III. Boston College

The registry is the source data system for identifiers and the single point of entry for all systems, including PeopleSoft HR. All new BC users are added first to the "corporate database" (VSAM files) which effectively functions as the registry. A set of unique identifiers is generated for use by all systems, obviating the need for identity reconciliation. Student and/or HR systems "activate" the user which  marks the user for inclusion in multiple feeder processes which populate iPlanet Directory Server LDAP v3 server and other consumer systems (email, voicemail, NT, Radius). Certain transactions trigger near-realtime updates whereas others are applied in batch nightly. In most cases, updates are shipped to the consumer systems via FTP and applied by scripts and/or C programs run on those systems. Intelligence is comprised of those programs (batch or online transactions) which enter a user onto the system initially, near-realtime and bulk update routines, and the various programs which feed the data to consumer systems.

 

4        Planning for the Enterprise Directory

4.1    Metadirectory Processes

This model consists of three major processes. First is the process of consolidating data from all systems of record, such as human resources information systems, student information systems, email address tables, UNIX account information, campus telephone directories, physical office locations, etc., and "joining" the information to produce a single master record for each individual.  The identity matching process resolves records that appear to be related to an individual, determining definitively whether they are or are not.  The resultant collection of resolved master records, referred to as a "registry", may be stored in a single data store (database table, indexed file, etc.). In essence, this process reviews all of the relevant institutional sources of data and joins them together.

The second process, termed in this document "intelligence", manages how data is inserted, modified, and deleted from the registry based upon the business rules of the institution. This process is mindful of both the data providing source systems and the applications that will consume the transformed data. 

The third process considers all the applications and systems that will use this enterprise infrastructure ---the consumers of directory information --- and provisions them accordingly.  For example, directory-enabled applications such as calendaring may look to an LDAP directory presentation of the data.  Non-directory-enabled applications may require special presentations, perhaps of just a few of the attributes.  Resource provisioning and account management systems track additions, removals and changes of status and perform tasks accordingly.

With implementation and acceptance of these processes and confidence in the enterprise directory, a metadirectory process can be leveraged to update core business systems.  The ownership of certain data associated with the system of record may migrate to the enterprise directory.  Data owned by the enterprise directory may be pushed back into the systems that provide data to enterprise directory. These issues are just beginning to be addressed at many campuses and many campuses have yet to reach this stage of enterprise directory development.

It should be noted that this document considers issues primarily relating to information about people, the most common starting place when building an enterprise directory infrastructure and where the most experience is and where best practices have been most clearly defined and published.  This document does not consider information such as organizational entities, physical information, computer and network nodes, or other types of resource data which may be stored in enterprise directories.   The implementation described in this document is intentionally "vanilla" and represents the most common practice, given current experience.

 

4.2    The Join: Directory Sources, Identity Matching, and Registry Building

The Join process copies data from institutional information sources, creates a resolved entry for individuals, and moves it into a registry to service application requests.  Because institutions often have multiple systems of entry for people information, and identity policies may not be consistent across all systems, the main difficulty is reconciling records from the multiple sources into a single record for each person.

Some institutions have enacted policies and processes to minimize the need for identity resolution through a join process by providing a single institution id regardless of the type of affiliation.  This proves useful in environments in which people may simultaneously have multiple affiliations such as student and staff or migrate among affiliations over time such as undergraduate student and graduate student.  This does require a central system to supply system-wide identifiers and identity management policies which should be applied to all systems which create people records. A system-wide identifier can then be used to link all systems which contain people information.

 

4.2.1   Directory Sources

Institutions maintain authoritative sources for their data. Two common and important authoritative sources for people data are the human resource (HR) and student information systems. The HR system may provide faculty and staff information including name, job title, Social Security Number (for U.S. institutions), and home address, usually keyed by employee number. (Of course, in reality the U.S. Social Security Administration maintains the authoritative source for SSN.) The student system (whether home-grown, PeopleSoft, SIS, SCT Banner, or something else) generally contains student information such as name, class level, major, and permanent home address, usually keyed by a student ID.

Any system which masters information about people may be a valid source for directory information, including systems which track alumni, vendors, donors, parents, and other affiliates.  Campus housing, telephone, facilities management, and email systems may also provide important contact information which may be useful to have in an enterprise directory.

There may be other sources that are updated occasionally, such as those holding data about campus contractors or loosely-affiliated agencies.  The library system may supply information about general public patrons, and the medical center may provide data about affiliated doctors or even patients.  To provide services such as computing access, recreation center access or meals to summer conference participants or campus guests, administrators may supply information about them as well.

Some directory information may be updated asynchronously, i.e., in near real-time. Information from the types of sources mentioned in the preceding three paragraphs might be provided in batch extracts, or might (also) be available via transactions. In addition, the registry or other parts of the enterprise directory infrastructure themselves authoritatively house certain directory information, and applications to update that information may serve as asynchronous data providers.

The physical implementation of directory sources may include application databases, indexed files, flat-files, operating system directories, and web sites. Any system which stores information about people who are affiliated with your institution should be considered a possible directory source.

 

4.2.1.1     Issues with Directory Sources

Identifying the authoritative campus source for each attribute is a critical component in designing the enterprise directory infrastructure. One of the biggest challenges reported by institutions that have built these infrastructures is that some portion of the data in their source systems are out-of-date, contain mistakes, and/or are not consistently formatted. Most institutions prefer not to fix bad data within the enterprise directory and instead develop a policy stipulating that corrections must be applied at the source.

Another common problem is deciding which pieces of source data to use in the enterprise directory.  For example, one institution surveyed stored thirty four address types in their student information system and had to establish a project to determine which one was appropriate to move to the directory. Additional source data examples include how to order multiple titles, so that the president's information is displayed primarily as "president" or "officer" rather than "faculty".  These require applying knowledge of the business, policies, and the source systems, all of which are different at each institution.

One of the most valuable outcomes of building an enterprise directory infrastructure is simplifying attributes.  For example, determining which address is the appropriate one, adding it to the directory and supplying it to all applications that need it, is a great time saver for developers and application integrators.  In a similar fashion, computed values that reduce many attributes to a few are also useful.  For example, several categories of faculty (associate professor, lecturer, etc.), staff (part-time, professional), or students (graduate, undergraduate) stored in the systems of records can be aggregated under a new attribute, reducing both the complexity and runtime of queries.  Another example involves student fees and deciding whether all the fees are paid to determine library access.  While there may be many different kinds and amounts of student fees tracked in the student information system, the enterprise directory could provide a single attribute summarizing fee status for the library application to use.

It is common to transform data before putting values into a registry.  Standardizing format, attribute contents, and regularizing upper/lower case in names are the most common, in addition to removing duplicate names coming from different systems.  Directory planners should keep in mind that data transformation can require a significant investment in time and energy.

 

4.2.2   Identity Matching

The most challenging problem is matching up a person’s identities and deciding when records from different sources apply to the same or different individuals.  Typically, most overlap occurs between the HR and student systems where a staff member enrolls in a course or a student holds a part-time staff position.  A common strategy is to compile a list of attributes and use them as a basis for comparison.  Attributes that infrequently change are often used, such as SSN, formal name, date of birth, and permanent home address.  While every institution uses its own approach and criteria, the common goal is to automate as much of this as possible avoiding the potentially high costs of reviewing and resolving mismatches manually. David Wasley of the University of California system reports that their matching logic fails to resolve about 1000 out of 200,000 people, having a success rate of 99.5%. Of course, the ability to code matching rules successfully depends on the quality of data you are trying to match. Systems for which data is not entered with care or for which it is acceptable to falsify or leave absent data, will have greater problems trying to reconcile identity.

A variation on the strategy of exact-matching tries to address the problem of non-synchronized identity attributes by using a point matching scheme where each possible attribute match adds points to a cumulative matching total, and if the total is greater than some calibrated number, the two records are considered to be a likely match.  This approach or similar fuzzy-logic techniques may offer some hope for institutions who have not enforced consistent data entry across their systems.

In general the effectiveness of identity matching is controlled by the consistency, quality, and the amount of data.  The more information supplied from each source system, the more successful the coding of matching rules. It is harder to perform good matching with only name and address, easier with the addition of date of birth, gender, SSN, and easiest if they all match.

It may also be important to note what steps were taken to verify the authenticity of identification information when people were entered into source systems. Government-issued identification, such as picture id’s, SSN cards, birth certificates, and passports, often provide an adequate degree of verification, and often student and HR systems have policies to ensure proper identification.  Other campus systems may allow access to services without strong identification verification, for example, the campus library may register patrons with little or no verification. It is particularly important to consider the strength of identity verification policies if the enterprise directory will be used to supply authentication services.

Once two or more records are matched and given a unique identifier, updates can be handled automatically when the source data changes. However, if different sources map to a single attribute in the registry, then programming logic must determine which source is the authoritative one.  On one campus, for example, different sources contained different gender information for the same person.  It is important that for any single attribute in a single registry entry there is a single system of record.

 

4.2.3   Building the Registry

Building the registry entails extracting, transforming, and loading (ETL) the data into the registry.  While many institutions manage these processes with Perl scripts or Java applications, the no-cost license for Metamerge (est. 1998) available to higher education makes this an attractive option to institutions starting now and those who wish to move away from scripted solutions.

The recommended method for storing and managing the registry data is to use a relational database.  The size of the institution and amount of data to be stored in the registry are factors to consider.  Registries can be "fat" or "thin", depending on how much data is put into the registry. If the source systems are capable of being accessed  by a variety of applications (perhaps using LDAP or SQL) and are highly available themselves, then building a thin registry with just enough data to perform the identity resolution might make sense, since the applications or consumers can get the identity from the registry and other data from source datastores.  Most campuses, however,  choose to build a fat registry in order to supply information to consumers. That simplifies  application and consumer requirements and avoids any issues to do with direct accessibility of source systems, be they technological or procedural. Fat directories are more common than thin directories.

Recommendations to consider:

·        The complications of the join function can be mitigated by instituting a common entry point for people. This will require more modifications in data entry systems but may be justified by cleaner registry content.  As more vendor products become directory-enabled a central entry point becomes more feasible.

·        Model the data in both source systems and the registry to ensure that attributes in the registry are populated from the definitive owning source and attributes are not over-loaded.

·        Use aggregate and summary attributes in the registry to support and simplify common queries.

·        Understand the identification, entry, and verification policies and processes used on source entry systems at your institution.

 

4.3    Intelligence

Some aspects of metadirectory operation are not specifically source- or consumer-oriented, but rather are concerned with architecture and operation of the registry itself. These considerations reflect how the institution’s business rules and policies are implemented in the metadirectory, hence the use of the term "intelligence".  Steve Carmody of Brown University identifies this function as the "lizard brain" - a form of primitive intelligence expected to evolve to a more advanced level in the future. 

 

4.3.1   Unique Identifiers

The registry database assigns a key or globally unique identifier (guid) for each entry. [See Early Harvest: Identifiers, Authentication, and Directories: Best Practices for Higher Education <http://middleware.internet2.edu/docs/internet2-mi-best-practices-00.html> for a thorough discussion of related issues.] This process should be carefully architected so that guid’s are never reassigned or revoked, and satisfy the requirement of being unique within a broad scope. In most implementations, the registry creates and owns this guid, typically a long integer or alphanumeric string. However, some campuses employ a unique identifier based on the internal ID’s generated by their administrative system. The important thing is that the guid should be constructed so that it is least likely to have to be changed.

The persistence of the registry entry is also a critical design factor. In the simplest metadirectory model, source data is captured, transformed, and consumer systems are reloaded all at once, effectively making consumers into mirror images of the source systems. A registry need exist only transiently in this model. However this approach is not common and does not scale as the need for additional and more complex or computed data increases. The standard approach is for registry entries to persist for at least as long as the person to which the entry is bound is to be provided services. It may also be the case that a registry entry is deactivated but never deleted so that some history remains to be used in the case that a person who has left returns, to aid in long-term auditing, or to help ensure the uniqueness of guid’s and other identifiers. The registry can also be used to enforce the institutional policy of reusing UID’s and logins, if one exists.

The registry’s structure must be designed to support data flows from sources to consumers, and possibly to other destinations, including flows terminating in the registry itself. At minimum a registry entry must store the guid and all the source and consumer systems’ identifiers for an individual’s entry, so that new information arriving from a source system can be associated with the proper registry entry and be used to update the proper entries in consumer systems. (The co-locating of source system identifiers together in a registry entry is the outcome of performing the join operation discussed in the preceding section.) Keeping all binding information together in the registry enables a metadirectory design that permits all source and consumer systems to maintain their "native" fundamental keys or identifiers. It also does not require the external systems to be extended to include the registry’s guid for binding purposes.

However, if the guid is created in the registry, the initial design might specify that that is the only place it will appear, and that it is only used internally to guarantee uniqueness inside the registry. Experience has shown that new consumers of directory information will request or even demand to include the guid in their representation of metadirectory information or application development.  It is best to plan for this possibility from the start.

Allowing the guid to be used in applications may seem harmless, however, it can create problems.  One problem being that in the event that metadirectory needs require that the format of the guid be modified, such modification will be complicated by the use of the guid in application databases. A second problem is that application usage may create pressure to modify guid values, pressure which could not exist if the guid remained internal to the metadirectory. A third problem is that the release of the guid to application systems means that specific guid values and who they are assigned to may well become known to a larger group of people than just metadirectory administrators, which, because guids are assigned for the lifetime of electronic service, could become a problem in the event that a person requests privacy under FERPA or other privacy frameworks.

One alternative to allowing the guid to proliferate is to promote the use of a different identifier for this purpose. One institution advocates that new system developers use a Publicly Visible Identifier (PVid) as a foreign person key. This, coupled with never releasing guid values, will reduce the chances that a person’s guid will need to be changed over time.  Whichever identifier is chosen, the registry bears a responsibility to maintain an accessible historical change log.  Identifier changes can rarely be completely eliminated.  If there are loosely connected systems using an identifier that changes in the registry, they will need such an historical change table to update their systems when they notice that the old value is no longer in use in the registry.  This underscores the fact that whichever identifier is used, it should not be reassigned, or identity mapping between the registry and the connected systems can become corrupted.

 

4.3.2   Technical Expression of Institutional Policy and Procedures

The registry is the one place in which data is stored in a manner independent of source or consumer system constraints. Source data is transformed when it is added, and consumer data is derived when it is used. For example, standardized telephone numbers, addresses, and names might be stored in the registry once, so that one does not need data tailored for each consumer. This may require, however, storing each discrete part of an address separately so that the complete address can be reformulated properly into the formats required by different consumer systems.

The registry must also contain information needed to support the reconciliation of data from disparate source systems, if one or more of those systems are not available during the registry-update processing. For example, a business rule might mandate that a person’s name is taken from the HR system, if the individual also has an entry in the Student System. However, when the HR system is unavailable, data sufficient to complete the entry must be kept in the registry to offset this. Two solutions might be to (1) maintain an attribute that indicates the registry’s source of the name value or (2) maintain attributes that are copies of the values in the different source systems (SIS-name and HR-name, for example).

More sophisticated metadirectory designs may also address resource provisioning. Registry entries will need to contain data necessary to implement provisioning business rules.  System accounts, email properties, group memberships, and values of certain attributes may all need to be controlled by business rules governing the provisioning policy. Extracts of registry information may also need to be prepared for export to non-live consumer systems within the provisioning umbrella. In addition, notification of the state or impending change of state of resources may need to be sent to account holders or others. Attributes describing the "state" of an entry and its associated resources and attributes containing the independent variables needed to evaluate provisioning rules, such as major affiliation and time at which the entry assumed its present state, will need to be incorporated into the registry. An example of stateful provisioning as implemented at the University of Memphis is detailed in the Addenda section.

 

4.3.3   Operational Design Requirements

Several factors affect the design of how information will flow from the registry to the consumers. In the simplest case, consumers are updated by rebuilding them from scratch. In more sophisticated instances, consumers are rarely rebuilt and, instead, updates flow to them. In these instances, several design options should be considered: should data be pushed, pulled, or should there be a publish-subscribe architecture?; should updates flow asynchronously (as they occur in real-time) or only in pre-scheduled batches?

Closely related to consumer data flow is the capability of the metadirectory to be used to recover from accidents. Systems and live-ware (i.e., people) will fail to operate perfectly from time to time, and the metadirectory may blithely move bad data into consumer holdings, causing a corresponding impact to services from the users’ perspective. The potential for a spectacularly grand failure exists, and the risk must be minimized. A batch flow model could afford a metadirectory operator the chance to review proposed updates before they are released. For example, a threshold can be established so that batch updates are permitted to proceed automatically if the number of changes in the update is less than the configured threshold, otherwise an operator is notified so that manual review can be performed. In asynchronous flow designs, a stateful approach to provisioning can help to smooth over such episodes. This might be accomplished by designing the finite state machine that governs automated provisioning so that transition to states in which users’ service is reduced or eliminated can occur only after a time threshold or grace period is exceeded. That grace period provides a window of time in which to recover from the failure without direct user impact. An even more robust solution would be a design in which all changes to the registry are serialized, a registry changelog is maintained, and these are used to support rollback and replay operations for consumer updating.

The metadirectory infrastructure is yet another in the suite of technologies operated by central IT, and operational needs both ordinary and peculiar to this technology must be anticipated. Activity and exception logging at strategic points in the flow of data into, through, and out of the metadirectory is crucial to backtracking some types of service issues (was the user’s access denied because of policy executed correctly by the metadirectory, or did something misfire?). Helpdesk staff will need general information on metadirectory status in order to anticipate potential changes to integrated services. In addition, they’ll need a tool with which to query details concerning individual users to enable them to help diagnose specific service issues. There may be needs for standard or ad hoc reporting. Standard reports might detail how many of which types of changes to the registry occurred over the previous 24 hours, provide detailed reporting of changes to certain subpopulations of interest (which faculty accounts are in transition to a disabled state?), and circulate exceptions to appropriate operational staff. Ad hoc reporting needs can sometimes be met by relying on a metadirectory consumer’s database, but it may prove most convenient to report from the registry itself for some purposes. In fact, that may be an early indicator of a new consumer needing to become integrated with the metadirectory. For example, certain of the information to print the campus telephone directory may reside in the registry, and it may become only a matter of semantics whether you call the corresponding report a report or a consumer flow of occasional use.

There is one further requirement for metadirectory logging that is peculiar to this technology. The primary function of the metadirectory is to manage the relationship between all of the key identifiers that are associated with people in various databases maintained by the enterprise. Changes to this relationship must be audited. That is, logs must be kept when someone is issued a new identifier or re-issued an old one, or if an identifier is re-assigned from one person to another (your design should likely prohibit this, but it may still be possible to occur). It must always be possible to determine who had which identifier at which point in time, for all time. This information is needed to help maintain the integrity of some consumer systems, and may be needed to maintain the integrity of the registry itself. In addition, this information is required to be able to audit users’ actions across the suite of services provided through consumers integrated with the metadirectory.

Recommendations to consider:

·        The registry id, or guid, SHOULD NOT be reused or changed over time. It is less likely to require changes if: it is sized sufficiently for growth; is not constructed from id’s that are likely to change, such as surname or department; has no embedded meaning; is not a vendor-controlled attribute; and is not provided to applications.

·        A publicly visible id (PVid) MAY be used as a foreign key to the guid for applications to use to prevent the need of publishing or propagating the guid.

·        Maintain a history log of id changes.

·        Understand and plan for the requirements of a provisioning system, defining needed attributes in the registry.

·        Plan for errors in metadirectory processes and use operational thresholds and reporting to minimize risks and impacts of errors.

 

4.4     Consumers of directory information

This section describes consumers of the enterprise directory data and the processes that present the data to them from the registry.  In this sense, it is forward looking to applications that are either directory-enabled now or will be in the future.  It is the area that primarily justifies building an enterprise directory infrastructure and is the payoff for that investment. Because this is the area of most active continuing development in the metadirectory infrastructure, it is the least settled and structured. This section covers the most common, vanilla, consumer functions that have appeared to date.

It should be noted that while earlier sections have discussed the registry concept without  specific focus on implemented technologies, this section is focused specifically on LDAP, which is at this time the clear directory technology of choice.  If an application is said to be "directory-enabled" it means that the application is capable of interfacing with an LDAP directory for at least authentication and possibly as a data store as well.

 

4.4.1   Supplying Multiple Consumers

The most common presentation of the data in the registry is an LDAP directory which is quite often initially used for white pages lookups and later for other directory-enabled applications.  Initially, it was commonly thought that in building a white pages directory, the institution was also building the enterprise directory infrastructure.  Now it is seen as only one consumer of enterprise directory information, although almost certainly the first to be deployed, and usually the one used to justify the initial investment.  People information is the usual starting point for both the registry and the LDAP directory because it has immediate applications such as web searches and underpins other kinds of data.  The success of deploying an LDAP directory with people information will naturally give rise to demands for other related kinds of data in the registry.

Why isn’t a single LDAP directory presenting the data in the registry enough?  As directory-enabled applications have come along, questions of whether to extend the schema of a master directory or build a special purpose directory tailored to the application have arisen.  No one answer fits all situations, but most agree that some extensions are appropriate for the master directory and others work best with a special purpose version.  Questions of what the application wants to store in the directory play into this.  Directories are a natural place to store user preferences, so adding schema extensions to hold user preferences for the interface of, for example, a calendaring application would usually be seen as appropriate, but to put the actual calendar data in the directory would not, and most calendar products follow this division.  Data that is frequently updated and/or application-specific is probably best stored in an application specific directory or application database, rather than an enterprise directory.

Similarly, depending on the design of directories, a special purpose version might be needed to hold different access permissions (ACL’s) -- perhaps to allow users to update fields, while the master directory remains "read-only."

Directory replication strategies are often implemented to address performance requirements.  How directory data is replicated influences directory designs, both in terms of what campus information is to be loaded into the replicants and the replication features of the particular LDAP software used.  It may also make sense to have a directory replicant outside of the campus firewall that contains only publicly accessible contact information, that way even if the replicant is maliciously hacked private contact information would not be compromised.

It is important to make certain that private information is not propagated into replicant directories, application directories, or operating system directories (such as Microsoft Active Directory) without also providing the necessary safeguards to prevent the information from being accessible to inappropriate persons. The campus policies and procedures used to protect private data in database systems may need to be extended to address directories as well. Campus data stewards/owners should participate in the review of projects which involve consumers of registry data.

 

4.4.2   Authentication and Authorization

How the campus handles authentication (authN) can dictate choices in directory deployment.  Although an LDAP directory can store the user password and thus be used to authenticate against, many institutions choose to build on technologies designed specifically for authentication, such as Kerberos. Since many applications may be directory-enabled but not Kerberos-enabled, it is convenient to use an LDAP directory authentication plug-in to pass authentication credentials to a Kerberos domain rather than storing them in the directory, in that way users can authenticate against the LDAP directory or the Kerberos domain and in both cases the Kerberos server validates the credentials.  Several schools, including the University of Notre Dame and Georgetown University, have written such plug-ins, and work is currently being done in coordination with Sun to develop a more generic plug-in to be offered freely to higher education iPlanet Directory Server users.  Other ways to authenticate, such as the WebISO project, also make an independent authN database, such as Kerberos, more attractive.  Indeed, from the metadirectory infrastructure, the Kerberos realm can be viewed as just another consumer of metadirectory information (the Kerberos principal attribute, if not the password).

Authorization (authZ) is usually initially thought of in terms of LDAP directory groups and roles, although it is broader than that and includes elective provisioning discussed below.  For an excellent discussion of directory groups as well as good advice, see MACE Best Practices for Directory Groups <http://middleware.internet2.edu/dir/groups/draft-internet2-mace-dir-groups-best-practices-01.html>.  "All faculty," "all students" and similar groups can be created via metadirectory processes once the more or less arduous work of agreeing on business rule-based definitions for them has been completed.  Such coarse-grained affiliation groups will likely find use in a broad range of consumer systems for authorization decisions and other purposes.

 

4.4.3   Support for Microsoft Active Directory

For most sites, implementation of Microsoft’s Active Directory will heighten awareness of the need for metadirectory functionality.  AD cannot (at least at present) be referred to another LDAP directory for user account data. In principle, the idea of two or more LDAP directories with the same set/subset of user information seems to be at odds with the idea of an enterprise directory. However, as this paper makes clear, reality dictates that the need to maintain multiple consumer directories will exist for the foreseeable future. That said, most campuses will not want Active Directory to be the principal directory or authoritative source for people information, but would rather have it integrated into the enerprise directory infrastructure as another consumer system.

The initial load of Active Directory will in many cases come from an upgrade process where NT user accounts are migrated to AD. Unlike feeding AD from scratch from the registry, this allows sites to preserve a user’s SID history and liberates NT administrators from having to re-ACL permissions on shares in conjunction with the migration. An initial process to perform reconciliation with AD user entries and registry entries will need to occur. This should be a one-time effort after which future adds/deletes/modifies would be fed from the registry. One way of performing this initial reconciliation process would be to produce LDIF output of both the user registry and the Active Directory and write scripts to identify discrepancies and apply updates as needed.  Metamerge is another tool of choice for feeding updates to Active Directory, both because of the power of scripting available within it and because there is an AD connector.

One technique worth considering for assuring reliable mapping between registry entries and AD entries is to track the Active Directory GUID (Global Unique Identifier) in the registry.  Active Directory generates a GUID for all objects in the directory, including user objects, at the point the object is created in the directory. The GUID is guaranteed to be unique and is never changed, even if the userid or DN is changed (see previous discussion of GUID).  Tracking this field would require additional scripting to extract the GUID from AD at the point a user is created.

In keeping with the recommendation that Active Directory be treated as another consumer system, campuses should consider loading only minimal people information to the Active Directory, e.g. common name, uid, etc. Bear in mind that Windows 2000 desktop systems will query the Active Directory by default for "Find People" lookups. FERPA regulations and local privacy policies should be kept in mind when deciding what data about users should be included in AD.

A major issue with AD integration is what to do about passwords -- how to connect passwords in the LDAP directory or Kerberos database with the AD password.  Some campuses use their Kerberos realm to control logging into Active Directory, using Microsoft’s passthrough authentication capability, and set a random password in AD that is unknown to the person, and unused. This works only if there are no down-level clients to be supported.  Other campuses develop strategies to sync the password in AD and the principal password, whether in Kerberos or another LDAP directory, by adding code to the agent used when a user changes a password.

The MIT Project Pismere site http://web.mit.edu/pismere contains additional technical information and documentation on Microsoft Active Directory issues.

 

4.4.4   Resource Provisioning

Resource provisioning is the automated handling of the tasks associated with the establishment, modification, and deletion of resources and entitlements provided to people as they join or leave an organization or undergo changes in affiliation or status.  Since most institutions have already built pieces of this kind of infrastructure, higher education institutions may choose to extend and improve their existing processes by leveraging the enterprise directory information, rather than considering commercial provisioning products.  Indeed, the Burton Group predicts a convergence of resource provisioning products and general metadirectory products.  For campuses, the challenge is to directory-enable existing procedures, such as creating email accounts following the creation or update of a directory entry by the student information system.

Campuses considering these resource provisioning functions are addressing the problems of how to code and use the campus business rules.  That is, when a person becomes a "student," what services should he/she receive, including computer accounts, disk quota sizes, home page location, etc., and how do the various services get provisioned? Institutions that have already addressed this have typically used Perl scripts, but campuses starting now are investigating using Metamerge software.

Architecturally, there are some common functions that need to be implemented for various provisioning tasks:  Create user, delete user, change name, enable/disable account.  One key element is correctly identifying the consumer entry you are dealing with, since the consumer directory may not store the registry guid.  More generally, how do you accommodate binding entries between the metadirectory (or registry) and consumers to insure you are dealing with the same person in both?

Next is what to use to code the business rules.  While it has been suggested that XML is now the method of choice, no one is known to have an XML-based provisioning system in production at the time of this writing.  In contrast to aggregating the rules and policy in one place so it can be used by various agents, using Metamerge typically means the logic is contained in the scripting of the connectors of the Metamerge assembly lines.

In higher ed, many services are elective, based on status.  Thus faculty and staff may be eligible to use a campus calendaring system, but not required.  So directory information such as status must be presented to a consumer. For example, at the time a web application is used to activate an end-user’s calendar, the necessary schema-extension attributes must be added to the user’s directory entry and perhaps copied into the registry.

Recommendations to consider:

·        While white-pages applications may be the initial purpose of an enterprise directory, it is rarely the end. Plan ahead.

·        Develop a policy for determining when extending the enterprise directory schema is appropriate and when it is not.  Be aware that application developers may want to store attributes in the enterprise directory inappropriately at times, as well as store attributes in an application directory that should be in the enterprise directory.  Policies developed beforehand with the approval of the CIO or IT-head can help to navigate these issues.

·        With the growth of the enterprise directory it will become necessary to copy directory information into replicants, application directories, and operating system directories (such as Microsoft Active Directory). Work with campus data stewards/owners to institute policies and processes to protect the privacy rights of campus members.

·        While LDAP directories can be used for authentication, consider the use of authentication technologies such as MIT’s Kerberos. Authentication plug-in’s to allow authentication credentials to be passed from an LDAP directory to a Kerberos domain MAY be used to support non-Kerberos-enabled applications that are directory-enabled.

·        Understand and plan for the requirements of a provisioning system.

 

4.5    Addenda

The following addenda are provided as aids and are not necessarily examples of "best practices".

4.5.1   University of Memphis Finite State Machine Provisioning Model

Provided by Tom Barton, University of Memphis

Basic account provisioning is guided by a finite state machine with 9 states. The model relies on the following data: state, substate, date the present state was reached, date by which the present state might end (called the expiration date), major affiliation (faculty/staff or student), and a multivalued attribute holding the identifiers of resources being managed for this account. Managed resources include shell accounts, IMAP/POP/HTTP mailbox service, campus-wide computing cluster access, and a variety of directory enabled application and web services that use an LDAP directory for access control, or that use the LDAP directory to determine eligibility for service.

The basic state meanings and transition rules are as follows.

A. Inception. New accounts are transiently in this state while being created.

1. Expected. A manually built account for a person not yet appearing in source systems which is permitted to exist only for a specified period of time. They better show up before then.

2. Active. The typical state of an account of a member of the community in good standing.

3. Grace. A period of 180 days after all affiliation ceases during which all resources and access persist. At three points during this state the person is emailed about their impending loss of access. The substate attribute is used to track how many notifications have been sent and so when another is needed. 

4. Limbo. A 30 day period in which nothing changes. Transition into limbo is accomplished by disabling all access, but leaving all resources intact. This is a failsafe against deleting the resources of someone who believes they still need service, or who has been an exception of some type. 

5. Slide. A variable length period during which resources are deleted. Slide ends upon notification of deletion of the last resource reaching the metadirectory. Among the resources deleted are all but a specified set of attributes appearing in consumer directories.

6. Shelf. A period of 2 years during which the netid is held so that it can't be reassigned.

7. Death. The netid is removed from the registry entry, leaving just fundamental identifiers for posterity.

8. Exempt. An account in state 8 is not touched by the automated provisioning process. 

Possible state transitions are depicted below.

        1 --------+

        |         |

        v         v

   A -> 2 -> 3 -> 4 -> 5 -> 6 -> 7

        ^    |    |    |    |

        |    v    v    v    v

        +----+    |    |    |

        +---------+    |    |

        +--------------+    |

        +-------------------+

A -> 2 or 6 -> 2 or entering state 1 have resources automatically built for them, including computer cluster account, mailbox, and attributes making them eligible for various services depending upon their major affiliations. Self-serve web and card-swipe applications mediate enrollment in eligible services.

1 -> 2 occurs when the expected person appears in a source system before the expiration time for the registry entry.

1 -> 4 occurs if the expected person fails to appear before the expiration time for the registry entry.

2 -> 3 occurs when the last major affiliation is removed from the registry entry. 

3 -> 4 occurs 180 days after state 3 was attained. Disable messages are sent to shell account server and NOS directory. LDAP Directory ACL’s hide objects in state 4 from most service DN’s so that access to those services is denied.

4 -> 5 occurs 30 days after state 4 was attained. Delete messages are sent to shell account server and NOS directory, script on mailbox server removes mailbox, many attributes vanish from LDAP entry.

5 -> 6 occurs after the last resource (shell account, NOS account, mailbox) is removed from the entry’s resource list.

3 -> 2 occurs if the person shows up in a source system before the expiration date for the entry.

4 -> 2 occurs if the person shows up in a source system before the expiration date for the entry. Enable messages are sent to the shell account server and NOS directory.

5 -> 2 occurs if the person shows up in a source system before the last resource is removed from their list. Create messages are (supposed to be) sent to supported systems not in their account list. This transition occurs rarely and is presumed to remain buggy.

 

5        Documents

Identifiers, Authentication, and Directories: Best Practices for Higher Education <http://middleware.internet2.edu/docs/internet2-mi-best-practices-00.html>

MACE Best Practices for Directory Groups <http://middleware.internet2.edu/dir/groups/draft-internet2-mace-dir-groups-best-practices-01.html>

 

6        Advice To Implementers

Internet2 Middleware (MACE) website <http://middleware.internet2.edu/>

Internet2 MACE Authentication website <http://middleware.internet2.edu/core/authentication.shtml>

Internet2 MACE Authorization website <http://middleware.internet2.edu/core/authorization.shtml>

 

7        Acknowledgments

The authors wish to thank the members of the MACE-Dir group for their tireless review and valuable input to the structure and content of this document.

 

8        References

Internet2 MACE Authentication website <http://middleware.internet2.edu/core/authentication.shtml>

Internet2 MACE Authorization website <http://middleware.internet2.edu/core/authorization.shtml>

 

9        Contact Information

Brendan Bellina (Editor)
University of Notre Dame
Email: bbellina@nd.edu

Richard Jones
University of Colorado

Tom Barton
University of Memphis
Email: tbarton@memphis.edu

Keith Hazelton
University of Wisconsin

Brendan Bellina
University of Notre Dame

 Eileen Shepard
Boston College

Ann West
EDUCAUSE/Internet2