NSF Middleware Initiative

Robert Banz

draft-internet2-mace-dir-inter-domain-data-exchange-00.html

University of Maryland,  Baltimore County

Copyright © 2002 by UCAID and/or the respective authors

October 2002

Comments to: nmi-support@nsf-middleware.org

 

Development of this document was supported with funding from the University of Maryland, Baltimore County, Internet2, and the NSF Middleware Initiative (Cooperative Agreement No. ANI-0123937).

 

Inter-Domain Data Exchange

Abstract

 

This paper is an attempt to record thoughts on a type of problem in which data elements must be transported, stored, or used across independent administrative domains. Numerous issues arose in discussions of this problem space within the Internet2 MACE-Dir working group. We have tried to present those deliberations in this paper.

For additional information and related topics and resources see the following sites:

Internet2 Middleware Initiative:

http://middleware.internet2.edu/

MACE:

http://middleware.internet2.edu/MACE/

EDIT:

http://www.nmi-edit.org/

NMI:

http://www.nsf-middleware.org/

 

 

Table of Contents

1          Introduction.. 3

2          Terminology.. 4

3          Scenarios. 5

3.1       Multi-Campus University System... 5

3.2       Funding Agencies. 5

3.3       Visiting Medical Professional Records Access. 6

3.4       Personal Address Book.. 6

3.5       Third-Party Authorizations. 6

4          Requirements. 7

4.1       Object Identity Mapping.. 7

4.1.1        Intra-Domain Identity Mapping. 7

4.1.2        Inter-Domain Identity Mapping. 8

4.2       Data Plus Associated Metadata.. 10

4.3       Data Transport. 10

4.4       Shared Language. 11

4.5       Federated Access Control. 11

4.6       Resource Discovery.. 12

4.7       Data Integrity.. 12

5          Techniques and Alternative Solutions. 13

5.1       Batch Updates. 13

5.2       Real-Time Joins. 14

5.3       Federated Access Control Methods. 15

5.3.1        Shibboleth. 15

5.3.2        Liberty Alliance. 15

5.3.3        WS-Security. 16

5.4       X.501 Knowledge References. 16

5.5       Representing Attribute Meta-Data.. 16

5.6       DNS Service Records. 17

5.7       The Data Router.. 17

5.8       Techniques for "Zero Knowledge" Identity Mapping.. 17

6          Documents. 19

7          Advice To Implementers. 19

8          Acknowledgments. 19

9          References. 20

10        Contact Information.. 20

A         The Data Router.. 21

A.1      Assumptions about our Environment. 24

A.1.1       An Object-Relational View.. 24

A.1.2       Transport: Another Assumption. 25

A.2      The Duties of a Data Router.. 25

A.2.1       "And other duties as assigned." 25

B          Stitched Directories Proposal. 27

 

1        Introduction

In today’s interconnected world, the need to access data across multiple administrative domains is increasing.  Students and faculty alike require the appearance of a seamless environment between local and remotely hosted services. Grant funding organizations require updated contact information for research faculty.  Recently, the INS requires universities to keep them apprised of enrollment data regarding international students.  Campuses wish to pool resources, a step that requires federated identity management that crosses existing administrative lines.

Many of these problems require techniques beyond the typical metadirectory processes that are currently in use within campus infrastructures.  While time is taken to briefly discuss some of these, readers are encouraged to become familiar with Metadirectory Practices for Enterprise Directories in Higher Education (http://middleware.internet2.edu/dir/metadirectories/internet2-mace-dir-metadirectories-practices-200210.htm) as many concepts discussed in depth in that document are built upon here.

The discussion within the MACE-Dir working group from which this paper is abstracted was initiated by Michael Gettes’ “Stitched Directories” proposal. The original “Stitched Directories” proposal is included here as Appendix B.

There is also quite a bit of overlap between this problem space and what is called Federated Identity Management. The Burton Group defines this term as “the use of agreements, standards and technologies to make identity and entitlements portable across autonomous identity domains” [DB02].

Several scenarios involving inter-domain data exchange are presented in section 3 below. Issues bearing on the requirements that a solution must satisfy are collected in section 4. In section 5 some techniques are described that help to further illuminate the problem space. 

The MACE-Dir Working Group considers this a work in progress and warmly invites readers to participate in its further development. The first step would be to send an expression of interest to mace-dir-comments@internet2.edu.  This version of the document is published as an Internet2 Draft and will expire in April 2003.

 

2        Terminology

AuthN:  Authentication

AuthZ:  Authorization

Data Sink:  An endpoint of data flow, opposite of a Data Source.

Data Source:  The source of a data flow, opposite of a Data Sink.           

Local Identity:  The attributes, entitlements, etc., which are associated with an identity specific to a single administrative or functional domain.

Metadirectory:  An architectural element that executes Metadirectory Processes.

Metadirectory Processes:  The processes by which source data is captured, transformed, and presented in an enterprise directory. [TB02]

 

3        Scenarios

3.1    Multi-Campus University System

This scenario describes a multi-campus situation where each campus is independent in so far as their computing facilities and identity management procedures are concerned.  However, they face growing demands to support integrated services, they share some facilities, have a shared library system, and have a high degree of multiple appointments, both faculty and student.

The shared library system directors, in particular, have expressed their desire for a single interface for accessing and retrieving the various bits of identity information for the members of this multi-institution community regardless of their originating campus. Historically, the ‘branches’ of the library system located at each campus have worked out interfaces to their campus business systems.  The form of these linkages varies from automated data feeds to manual input.

At first glance, this is a classic metadirectory scenario.  In fact, most of the required procedures are those that you would use in a single institution to map identifiers and ‘join’ identities (see the object identity mapping requirement in Section 4 below).  Two features of this scenario take it beyond a classic metadirectory problem.  First, existing campus-level infrastructures remain complete and independent of each other. Second, there is a need to capture metadata reflecting the originating sources of the data being provided and associating the person and their roles to the appropriate member institution(s) (see the data plus associated metadata requirement in Section 4).  As with all inter-domain data exchanges, there is a requirement to specify a transport mechanism and a need for parties at the data sources and the data sinks to have a shared language covering both information syntax and semantics (see the transport and shared language requirements).

3.2    Funding Agencies

Many funding agencies, such as the National Institutes of Health, maintain databases of current and potential Primary Investigators (PIs) who may be applying for or currently working on projects funded by that institution.  Other organizations provide services to notify potential researches of pending solicitations.  Both of these functions require a way of searching current research faculty information at hundreds of universities. Of particular interest are contact information, current projects, and research interests of potential grant recipients.  In addition to sharing all the requirements of the multi-campus system scenario, this scenario requires that the agency’s access to institutional data be appropriately controlled and limited (see the federated access control requirement).

3.3    Visiting Medical Professional Records Access

This scenario describes access to resources including protected health information by visiting health care providers, where they would not have or need a local network user ID at the visited facility. For example, a visiting physician (or other provider such as a nurse or laboratory technician) is given access to specific patient records at a health care facility. 

The visiting care provider (visitor) needs temporary network access at the visited facility to securely log into their home facility for identification and authorization.  The home facility consults the business agreement with the visited facility and present the visitor’s access authorization based upon the rules contained in the agreement.  The visited facility grants appropriate access based upon the visitor’s presented credentials and records details of all the visitor’s transactions and data access for later audit if required.

3.4    Personal Address Book

Most people have some sort of "personal address book," either stored in their desktop email client, on a PDA or scribbled on little scraps of paper stuffed into their wallet.  This is not a situation we commonly think about in the directory space, nor would we go about solving this problem with typical metadirectory functionality.  However, the problem is an interesting one, and it invites exploration of techniques to manage affiliations with data sinks that spend much of their time in an off-line state.

The "holy grail" for personal address book management would be to have the capability for on demand synchronization of personal information that may have its authoritative source housed in various organizations’ directories.  The usage scenario would be similar to that of "syncing" a PDA with a desktop system, except that instead of a single desktop machine, the PDA syncs with various directories scattered across the Internet.

The requirement unique to this scenario is the need for resource discovery.  That is, there needs to be some mechanism by which the PDA client can discover where to go to find authoritative, up-to-date contact information for the individual entries in its address book.

3.5    Third-Party Authorizations

Jane Doe is a faculty member at Foo.edu. She is also a member of a professional organization, Bar.org. Both Foo and Bar have contracts with vendor X providing their members with access to specific licensed resources. Foo.edu has licenses for products A, B, and C for use by its faculty, staff and students. Bar.org has licenses for products B, C, and D for its members. When Jane sits at her desk and accesses vendor X, she would like to be given access to all of the products to which she is entitled, A, B, C and D.  The content provider may wish to receive authZ information only (for example, patron anonymity is important in library-like situations.)

This scenario requires a method for federated access control, and a method of associating attributes from multiple identity authorities.  First, a system needs to allow the content provider X to receive authentication and/or authorization assertions from Foo.edu.  The special challenge of this scenario lies in how to get a trustworthy assertion of the person’s membership in Bar.org to the content provider (see the data integrity requirement). Two directions can be taken here, one where the authorization assertion is made by Foo.edu, provided there is a mechanism where Bar.org is providing Foo.edu with the information, and trusts Foo.edu to make such assertions on their behalf.  The second approach would rely on Foo.edu systems knowing that to access X, Jane’s authZ process needs to make a pass through Bar.org to pick up additional authZ assertions before hitting site X.

 

4        Requirements

4.1    Object Identity Mapping

Object identity mapping, for the sake of this discussion, is the procedure by which it is determined that two or more digital objects represent information relating to the same real-world subject.  In current implementations, the subject being referred to is most often a person.  However, there will be situations where the subjects being mapped between directories are not people, but may be computers or "data objects" such as those stored in a digital library.  For each type of object, there will be identifier attributes used to uniquely indicate a particular instance.  For example, a person’s name can be used to assert who someone is.  Of course one name, such as "John Smith," will map to any number of people. 

4.1.1   Intra-Domain Identity Mapping

From experience in dealing with identity mapping in reasonably manageable populations, such as a college campus, we know that a name is not a good identifier.  A good identifier has the property of being globally unique across the population and, even better, persistent through time.  As discussed in Identifiers, Authentication and Directories: Best Practices for Higher Education (http://middleware.internet2.edu/docs/internet2-mi-best-practices-00.html), and Metadirectory Practices for Enterprise Directories in Higher Education http://middleware.internet2.edu/dir/metadirectories/internet2-mace-dir-metadirectories-practices-200210.htm), it is clear that most of these "good" identifiers tend to exist only in one database or another which makes them generally bad candidates for any kind of inter-domain mapping.  Some campuses have taken to the practice of issuing a "globally unique identifier" for members of their population generated by their enterprise directory. However, such an identifier is only for use within the campus community, and holds no meaning outside of it. Thus it cannot be used for mapping when moving outside of the institution.

In practice, intra-campus mapping has traditionally relied on using such identifiers such as a "Student ID" number, or "Social Security Number"[1], and while the appropriateness of using such identifiers for mapping can be argued, for most purposes it has been sufficient.  Sufficient being defined as: it works most of the time, and for the few times it doesn’t work the resulting mis-mapping can either be easily fixed or completely ignored.  However, some institutions have gone further than using these identifiers to do mapping, and check other attributes of an individual, such as their name, address, date of birth, or phone number to verify the strength of their mapping decision, or potentially reject it.

4.1.2   Inter-Domain Identity Mapping

Before discussing the intricacies of inter-domain identity mapping, there are some basic questions that need to be asked:

·        Does our directory need identity mapping at all?

·        What is the scope requirement for identity mapping, all identities, or a subset?

·        Does the identity mapping need to be automatic?

·        Do data source and data sink share identifying information?

The first question simplifies everything, if it is unnecessary to provide any identity mapping function on the target directory, all of these problems can be ignored.  Many situations may not need any mapping functions at all.  Scenarios exist for which it is acceptable, even potentially desirable, that multiple entries be made for the same individual if they enter the directory from different institutions.  This is the tack that the Directory of Directories for Higher Education (http://middleware.internet2.edu/dodhe/) has taken.  The benefits gained from not mapping include simplifying the data processing needed and adding to the privacy of the individual by not identifying their potential multiple affiliations.  On the other hand, providing a mapping function could be integral for assuring an individual is given all of the resources that they are entitled to, if the directory is to be used for access control purposes; or to provide a one-stop-shop for finding all of an individual’s contact information.

The second question is a follow-up to the first.  The "scope" in which you are performing identity mapping may also vary depending on your need.  For example, you may only need to identity-map faculty and staff, and students may not need federated identities.  Additionally, the "scope" could be defined as "Folks from these institutions get their identities mapped, but if they are from any other, they don’t."

A potential solution to this mapping problem for the multi-campus system scenario would be to create a "bridge" identity management system that would map identities existing at the institutional level to a single person object, creating a multi-campus person registry.  This would provide a single-point for campuses to do lookups against when situations arise where ‘campus A’ is receiving, or requesting data from ‘campus B’, in which the primary identifiers at the campuses differ.[2]  By also exposing some basic affiliation data in this registry, the library system would have a single data source to feed their identity management system (and also find the most current billing information for someone with overdue fines!)

When it is decided that some identity mapping needs to be done, the question of making the mapping an automatic process is now on the table.  Some of the same advantages and disadvantages of doing mapping altogether are also applicable to this decision.  One direction to take with regards to the choice to make the mapping manual is to assign this task to some number of humans at an organization to apply appropriate mapping logic to each entry.  For obvious reasons, this is probably not a feasible option, as the staff costs could be overwhelming, and even more error could be introduced to the mapping decision.  However, a potentially more appropriate use for "human driven" mapping is one where the individual being "mapped" would be involved in the decision.  This is the road taken by the Liberty Alliance in implementing their "reduced sign-on" system.[3]  Simplified, this approach requires the individual to authenticate themselves to both data sources, "proving" that the identities listed there represent him/her, allowing backend processes to "link" these identities.  The privacy benefits of this are obvious, as the decision to link identities is completely in the individual's hands, and in the Liberty Alliance case, there are procedures in the design for "breaking" the mapping at the request of the individual.  For applications such as reduced-sign-on and the scenario where one may want to augment their institutional entitlements to resources with those provided by other affiliations, the voluntary mapping described here would fit well.

When it comes to automating an identity mapping process, strong drivers to the success are policy matters, which may also intersect with personal privacy and governmental regulations.  The key factors are the policies regarding what identifying information is available in the data sources, and what restrictions are in place on making this data available to the mapping process.  As the attributes that are most useful for mapping, such as an individuals’ SSN, date of birth, etc. are also typically the ones that are most closely guarded by an individual and her affiliated institutions, this may be problematic.  However, if the situation inter-institutional directory being constructed is of the nature that lends itself to this information being available, the best practices that are in place in the intra-institutional environment may be applied successfully in this space.  However, when this information is not directly available, some alternative methods of mapping may need to be considered.  One alternative is described later, in section 5.8.

4.2    Data Plus Associated Metadata

For data items to be successfully exchanged and used by multiple domains, it is often insufficient to send a raw attribute-value pair.  The data item may need to be bundled with associated metadata that provides the context for and characteristics of the information conveyed.  In the multi-campus scenario, to use a simple example, the affiliations of faculty and student need to be associated with the campus that asserts them.

4.3    Data Transport

It is difficult to decide which task is more daunting in exchanging data between differing administrative domains, the policy aspects, or the technical hurdles.  While the managers of the data can argue policy questions until the end of the world, it is often assumed that the technical folks will be able to hash out a solution in minutes.  Of course this is rarely the case. To make matters worse, the enterprises and systems that may be exchanging data are quite disparate in their nature, add to that the potentially infinite combinations of these relationships that a single site may need to be involved in.  All in all, you face a significant management challenge.  Until the day comes that there is a single standard for everything, all that can be offered are recommendations as to how such problems may be made more manageable.

The method of transports available to get data from point "A" to point "B" is highly dependant upon the specific problem to be solved.  Solutions can range from the simple FTP-ing a file from one site to another on a daily basis, to complex message-oriented exchange techniques to implement close-to-real-time data updates, or to relying on the remote repository to request updates as it needs them.  To add one more decision in to the mix, the question arises of whether to "populate" the remote directory at all or alternatively, to use techniques that allow the user/application to gather the information in "real time" from the various data sources.

The transport may also be responsible for providing security for the data, including basic access control, transport level encryption, and mutual authentication of the transport endpoints.  There seems to be very little that can be done to limit the methods of data transport that may be required, however, we recommend leveraging existing methods described in the Metadirectories document, as well as some techniques that are discussed in section 5.

4.4    Shared Language

By "shared language," we refer to both the syntax and meaning of an attribute/value combination.  This is a difficult but solvable problem.  With work done with regards to LDAP, XML schemas, and other such information interchange standards, there most likely exists work that spells out a way to represent at least some of the data you wish to share with others.  It is our recommendation that you rely upon relevant standards when representing your attributes to external organizations, and resist the temptation to make data available directly in the format that best suits their needs. 

Failing that, make it a community of interest project to define a shared language, for example, by collaboratively defining a directory schema with specifications for the semantics and intended uses of the attributes and values.  This is precisely what has been done within Internet2 with such object classes as eduPerson and commObject. 

By going this route, standards will be strengthened through use, and data transformation code that may need to be written for a single relationship may be reusable, instead of being a one-off effort.  The receiving party of the data should then be able to massage the data into any site/application specific format that is suitable, or, hopefully, choose to use the data as sent, in a standards-compliant form.

When exchanging person-related data, which is likely to be most of what is to be exchanged, we suggest as a first step following the recommendations conveyed in the eduPerson specification, which make suggestions as to the use, syntax and meaning of common attributes including those from inetOrgPerson.

4.5    Federated Access Control

In the funding agency scenario detailed above, the home institutions of Principal Investigators will not want to make all the data they hold about such people available to funding agencies.  There will be a select, negotiated set of attributes to which the funding agency systems should be granted access.  The visiting medical professional scenario also requires that the visitor’s home institution give only selective access to its records about the subject.  The federated access control solutions discussed in section 5 address this requirement.

Some of the affiliation scenarios described in this paper revolve around controlling access to content or procedures provided inter-institutionally through web-based services.  There are multiple methods either in use or in development that can be utilized in these situations.  However, some scenarios, such as third-party authorizations (3.5) may require as-yet undefined solutions.  All of the methods described in section 5 below rely on certain key technologies and standards, such as the Security Assertions Markup Language (SAML), but each has unique aspects.

4.6    Resource Discovery

Resource discovery refers to the ability, in some automated or algorithmic fashion, to determine where and how to access specific services or information.  In this case, the services that need to be accessed are those revolving around retrieving data updates.  For a situation such as the personal address book scenario (Section 3.4), the ability to "find" the LDAP server at an organization without necessarily knowing its IP address may be helpful.  Techniques such as DNS SRV records, which are described in section 5 of this document, could be used for this purpose.  However, other resource discovery techniques such as UDDI (http://www.uddi.org) are more general purpose.

4.7    Data Integrity

The key to the usefulness of the data available in any directory is its correctness.  With that said, being able to provide the correct information in a directory that is a compendium of data from multiple sources can be quite a challenge.  Some of the issues that first come to mind are:

·        Keeping in mind the data modification policies of the data sources.

·        Having the most up-to-date copy of a source’s data.

·        For identity-mapped objects, what happens when attribute values conflict?

Received data is only as good as it is at its source.  When accepting or using data that is not under your control, you need to have some assurance that the data is correct.  This assurance could come by several means, the simplest of which is an out-of-band agreement with the custodians of that data as to the nature of the assurance they can make regarding its integrity.  It is important as well that the minimum level of integrity of the data available in the compiled directory be communicated to, and understood by, its consumers.  The level of assurance needed by a consumer may vary greatly. A directory that is controlling access to medical data will require a high level of assurance between all parties, as there may be legal and financial risk associated with data being wrong.  However, the level of assurance required to update an email address in someone’s personal address book may be minimal, as the risk associated may be very small.  It may also be advantageous to have a method of representing to a user the source of the data, as some applications may require it.

Another vexing task in directory management is keeping data up-to-date.  Depending on the techniques in use to populate the directory, this may or may not be as tough an issue to tackle.  From an architectural standpoint, one of the most desirable techniques for populating and updating would be on a change-at-a-time basis.  As this can be implemented so that data is updated in a “real-time” fashion, assuming everything is working correctly the target directory should have the most up-to-date information at any given time.  However, when it doesn’t work changes stop coming or some changes are missed.  When this occurs, we need some mechanism in place to either note how old something is so that an application may make the decision that the information provided to it may be stale and do “the right thing,” or provide the processes managing the directory a basis for requesting an update to, or potentially pruning data that has reached the limit of freshness.  A method of a data source “recommending” an expiration time for a data element may also be of some help, as the originator of the data may have some knowledge (not known to the recipient) of how often this information changes.  Some of these techniques are in use as part of DNS, whereby zones on secondary name servers have intervals as to when they’re to be checked for updates, and maximum time-to-live (TTL) values for how long a zone will be held if requests for updates fail.  Of course, robust inter-domain message queuing services would be an ideal fit for this problem.

An important policy decision when creating a system that is to provide identity mapping is what will be done when attributes for an object where there are conflicting values for an attribute that occurs in two of the sources that have data mapped to the same object.  In the intra-domain situation, these rules may have been very cut-and-dried, as institutional policies, or other strong drivers may dictate what happens. For example, when dealing with conflicting information between a Human Resources system and a Student Information System, the data that is in Human Resources may automatically be chosen to win-out, as it is a reasonable assumption that folks probably want to have their paycheck delivered to the correct address.  When bringing together data from multiple external sources, the choices may not be as obvious the data since sources are basically peers. So some other method of resolving these discrepancies must be found.  One, which may not seem so obvious at first, is simply to present all of the options to the user of the directory and let them choose.  However, the usefulness of this may leave a little to be desired.  A second simple method is to choose the value that is newer. Of course, the problem here is that many data sources do not record the time an attribute was last changed.  There is no easy answer to this problem. What will be done in any situation will vary based on the nature of the relationships between the data providers, and the meaning and intended use of the attributes that are in question.

Further discussion on this can be found in Section 5.5 (Representing Attribute Meta-Data), as how and what information can be asserted about an attribute is directly related to the integrity of the attribute itself.  Stitched Directories (Appendix B), also makes an attempt to tackle basic data integrity issues.

 

5        Techniques and Alternative Solutions

5.1    Batch Updates

For situations that require that data be transferred to a central directory, the simplest, and probably the most popular method, consists of large batch updates of the data.  This technique is probably in use, in some form, in everyone’s architectures, and may consist simply of FTPing a file in a specific format from the source system to the consumer.  While this technique doesn’t go far in keeping data as fresh as possible (your data is only as good as the frequency of your last batch update), for many needs it is sufficient.

We recommend, however, that while it is important to keep things simple, do not simplify them so much that data security is ignored.  FTP, unless used in a very controlled environment, such as a VPN, is not a secure method of transport.  Simply using scp (part of OpenSSH, or other Secure Shell implementations) instead of FTP is a substantial upgrade in the security of your transport.[4]  If scp is not available, encrypting the data with an application such as GnuPG or PGP is also an option.

5.2    Real-Time Joins

A “real-time join” refers to the coalescing of data from multiple sources at the time that it is to be used.  This covers an entire class of techniques, such as those that rely on an application to contact multiple data sources to produce the ‘full picture’ of an object, to those that may use a middleman, or a broker, to accomplish this task.


Assuming that identity mapping functions are provided in some manner, all of the appropriate information about an object MAY be gathered by executing LDAP queries across the services provided by the source institutions.  The application may even discover the locations of such servers through their DNS service records (cf. section 5.6).

However, putting this functionality in a user application may not be appropriate.  Placing these functions in a service that acts as a broker, and allowing a user application to access it via a protocol such as LDAP, or as a web service application, may also be done.


 

5.3    Federated Access Control Methods

5.3.1   Shibboleth

Shibboleth is a software package and specification built upon SAML that enables a user who is authenticated by a site (foo.edu,  the origin site) to have specific attributes asserted by a party at foo.edu, and have that assertion understood, and trusted by bar.edu (the target site) another site participating in a Shibboleth “club.”  More information regarding the Shibboleth project and its specifications can be found at http://middleware.internet2.edu/shibboleth/.  In its current version, Shibboleth is not able to aggregate assertions about a single user from multiple attribute authorities so it is not a standalone solution to the third-party authorization scenario described above.

Shibboleth concepts could provide the basis for an access control mechanism for the PDA address book scenario.  A potential method would be to allow the issuing of entitlement attributes which grant access to specific directory objects and which could be revoked at any time by the subject.  An architectural element such as the Data Router (5.7) may be the appropriate mechanism to provide access to data through such means, as LDAP (without extension) would most likely not be sufficient.

5.3.2   Liberty Alliance

The Liberty Alliance is an industry consortium founded in part to develop methods for federated identity management.  More information, and specifications for the Liberty Alliance protocols can be found at http://www.projectliberty.org.  Currently, the protocols revolve around procedures to manage bindings between multiple network identities existing on Liberty Alliance-enabled web sites, through trusted parties known as “identity providers”.  Authenticating to an identity provider provides you a single-sign-on environment between multiple sites.  Liberty does not currently address sharing attributes of an identity between sites. This functionality is currently slated to appear in a future release of their specifications.  However, what is well-described are methods in which affiliations between identities at multiple locations can be created, managed, and potentially broken, under the control of the person in question.

5.3.3   WS-Security

WS-Security (Web-Services Security) is a specification developed primarily by Microsoft and IBM that has now been turned over to OASIS for finalization.  Its focus is to specify a standard method of transport for security credentials in a web services environment.  Currently, the encapsulation and transport of Kerberos and SAML credentials are covered.  While it does not provide methods for establish trust in a cross-realm environment, it will be an emerging method of transport for higher-level authN and authZ assertions.

5.4    X.501 Knowledge References

The term knowledge, as used in the X.501 documents refers to “DSA [Directory Service Agent] operational information held by a DSA that it uses to locate remote entry or entry-copy information” [IT01].  A knowledge reference is “Knowledge which associates, either directly or indirectly, a DIT [Directory Information Tree] entry or entry-copy with the DSA in which it is located” [IT01].

In summary, a knowledge reference is a representation of where a piece of knowledge may be found.  Knowledge references make up part of, and rely on, a complete X.500 environment to function, as discussed in the specification.  However, the terminology used with regards to the types of references discussed in Section 10 of ITU-T Recommendation X.501 and the model on which they are based may suggest new approaches to a standards-based inter-domain data exchange model.  This is an area of active investigation by the MACE-Dir Working Group.

5.5    Representing Attribute Meta-Data

Many of the requirements derived from the scenarios above require that certain other information regarding an attribute, or collection of attributes, be made available to a data consumer, or the processes managing a database itself.  The information that may need to be represented includes, but is not limited to:

Currently, there is no one method for representing all of this data on an attribute-level basis.  An attributeIntegrityInfo attribute exists in the X.500 specification [IT01]; however, its use is minimal, if not non-existent.  It describes a method of storing a digital signature of an attribute, a collection of attributes, or all of the attributes that make up an entry.  As there is very little if any use of this, and no support in current client software, this may not be a realistic alternative.  The syntax of the attribute is described as an ASN.1-encoded string. Current tastes tend to favor encoding methods such as XML.

The Stitched Directories proposal (Appendix B) begins to tackle this problem by proposing using digital signatures of attributes such as those specified by attributeIntegrityInfo to augment the information available to an end-user application or consumer directory, giving it just a bit more information to make decisions as to what to do with the data it is accessing.  However, in discussions it became clear that digital signatures were not the only augmentation that was needed.

An XML string, or multiple strings, could be used to encapsulate all relevant metadata related to an attribute in an expandable, and easily parsed format.  In an LDAP environment, this data could be stored in a single attribute, and given its text-only format, could be easily used in non-LDAP environments as well.

5.6    DNS Service Records

When programmatic location of LDAP services is required, in situations like those described in Section 4.6, DNS service records can be used.  The IETF has current work in dealing with the specifics of this procedure within the LDAPExt working group (the most recent Internet draft regarding this topic is available from the IETF).  These techniques can be used to find a server containing an LDAP DN, provided that domain component (DC) naming has been used.  Using DC-naming is recommended in A Recipe for Configuring and Operating LDAP Directories (http://www.georgetown.edu/giia/internet2/ldap-recipe/), and is currently considered best-practice.

5.7    The Data Router

Yet another approach to many of these exchange problems would be to implement and deploy another piece of architectural software to handle many of the requirements that are common to these applications.  We explore this potential solution in Appendix A, presenting the functionality for such a software application, which provides a place to house data translation and policy enforcement,  allowing the “data router” to handle the transport issues and other maintenance tasks.

 

5.8    Techniques for “Zero Knowledge” Identity Mapping

As described in section 4.1, identifiers that can be conveniently used in an identity mapping process are also, by most data stewards, considered personal and confidential. This data may not be something that an organization, by policy or common sense, would want to make available to a third party, even in pursuit of such noble goals as identity mapping.  What is being proposed is a method to use cryptographic hashes to assist in the mapping process, so that the parties involved do not produce, in the clear, any of the identifying data that is of concern. 

Consider this as an example.  Databases A & B each contain an object that refers to a single person, and the objects, for better or worse, both contain the person’s Social Security Number.  Let us also say that a strong, one-way has function exists, hash(), that returns a string that is unique, within reason, to the input value.[5] A two-way communication channel between databases A & B also exists.

So, what happens?  Here’s a step by step process where the object affiliation between two repositories is established, without exposing the key.  “SSN” represents the value of the key.

Hey B, I’ve got a new object, and it has value ssn = hash(SSN)

 

 

B checks it’s objects, and sees that it has one whose hash(SSN) is equal to the data just provided.[6]

A, I’ve got that one, here’s my response:

SSN.

If the value provided from B is correct, then both exchange the handles that they will use to affiliate these two objects (or whatever they decide to do)

 

This provides for a two-way proof that they both have objects that should be affiliated, and would allow for the enforcement of policies such as “A will only share data about objects that it knows B already has”.  However, if the desired effect would be to just create a new object in B with the data provide from A, even if it didn’t already have a matching object there, the process would simply stop at the second step with a request to exchange affiliation handles with A, and A could rest assured that it’s protected information had not been exposed.

As we know from our experiences even within single administrative domains that matching on one key, such as an SSN, can be problematic.[7]  So, when extending our models to span multiple administrative domains, we must create an environment where more complex mapping functions can be supported.  The model above can be extended to support multiple challenge-response requests when attempting to map a single object.  The values themselves can also be massaged via algorithms such as Soundex to make up for typical mistyping and misspelling errors that may have occurred at one end of the potential mapping or the other.  The data source, or the system managing it, should keep the hashes of the values used in this process pre-calculated and stored in an indexed manner as to make the mapping processes efficient.

 

6        Documents

In addition to the documents listed below in the References section, the authors also recommend:

A Recipe for Configuring and Operating LDAP Directories (http://www.georgetown.edu/giia/internet2/ldap-recipe/)

Shibboleth (http://middleware.internet2.edu/shibboleth/)

A Brief Guide to OIDs (Object Identifiers) (http://middleware.internet2.edu/docs/a-brief-guide-to-OIDs.doc)

 

7        Advice To Implementers

Internet2 Middleware Website (http://middleware.internet2.edu/)

NMI-Edit Website (http://www.nmi-edit.org/)

NSF Middleware Initiative (http://www.nsf-middleware.org)

 

8        Acknowledgments

Sections of this document have been authored, in whole or in part, by Robert Banz, Tom Barton, Keith Hazelton and Michael Gettes.  We would like to thank the membership of the MACE-Dir group for their valuable contributions of ideas and discussions during the drafting of this document.

 

9        References

[TB02] Tom Barton, ed. Metadirectory Practices for Enterprise Directories in Higher Education, Internet2, October 2002

[DB02] Blum, Dan and Kobielus, James. Toward Federated Identity Management, The Burton Group, 23 Aug 2002

[IT01] ITU-T Recommendation X.501, ITU, Feb 2001

 

10   Contact Information

Robert Banz (Editor)
University of Maryland, Baltimore County
Email: Robert.Banz@umbc.edu


A      The Data Router

Building and managing multiple data exchange systems can be difficult, especially when the requirements of each may vary in the data translations, policies, and techniques necessary to manage the exchange so to meet their individual requirements.  An alternative solution is to introduce a new piece of infrastructure to handle the various exchange techniques, and provide a framework in which to manage various translation and policy decisions.  The data router sits in the position in a data distribution network similar to that of an  internetwork router, providing format translation, access control and transport services to specific data elements.  Just as a router sits in front of a data network, routing traffic in and out of your institution’s internal network, the data router sits in front of a data source, providing much the same functionality.  An assumption made through this discussion is that we may be dealing with multiple types of data sources, from Oracle databases to LDAP directories, each with different access methods.


 


The most simplistic view of this would be that there would be a one-to-one correspondence between data routers and databases.  However, in some instances, it may be more efficient to place one data router in front of multiple databases, or, potentially in front of none and simply have it talk to other data routers.  While all of these are possibilities, for the sake of keeping the discussion sane, we will assume a one-to-one relationship between a data router and a database (whatever they are.)

In order to get a better grasp on how it is envisioned that a data router be used, we’ll consider using it in a model that is better understood.  In this instance, Human Resources and Student Administration data are being merged into a “Person Registry”, which is then published in the enterprise LDAP directory.  In many implementations, the data feeds going to the “metadirectory” system are nightly text dumps of the databases, which are then processed by a bunch of scripts populating the registry and LDAP directory.

 


To get a better idea of how a data router would be used, the below diagram shows this common architecture implemented with data routers.

 


 

Figure 1: Architecture with Data Routers

As you can see, the central metadirectory has disappeared, being replaced by multiple pieces of architecture, which together implement the metadirectory functionality that our central metadirectory was responsible for.  The first reaction may be that this is a bad idea, replacing one thing with four.  However, there are other benefits to the architecture that will hopefully become clear.

Using the data routers, we could very well implement a logical architecture that is identical to the one displayed in Figure 2.  Our data routers at the Human Resources and Student Administration systems would have a scheduled event, creating a stream of information to the data router to the router placed in front of our Person Registry.  That data router would, in turn, send data to the Enterprise LDAP directory to update the data there as necessary.  The Person Registry’s data router has three policies it is working on, two that receive data from the data routers in front of the legacy systems and a third governing its conversations with the Enterprise Directory’s data router.  Corresponding policies exist on the data routers at the other end of these conversations.

While having this many points of control in a tight environment such as the one above may be more work (which we try to avoid), it shows the benefits of this architecture as it would apply to a more distributed data sharing environment.  One party decides how they are going to share their data and represent it while the other decides how they are going to use it.  The tough work (how to get it from point A to point B) should be considered taken care of by SOAP and its chosen transport.

A.1   Assumptions about our Environment

In order to make getting this data routing architecture off of the ground, there are at least two assumptions we have to make about our environment.  The first is that we have an “Object-Relational” view of our world, and the second is that there is some transport available to bind our SOAP methods to that can offer security, and mutual authentication.

A.1.1  An Object-Relational View

As we have been doing our work in the directory world, we have a view that our life that looks at things as “objects”.  An object represents something that we can clearly define: a person, an account, the CS101 class taught in fall 2002.  Each of these objects has attributes that further describe it. A person may be described as a student, and as taking the CS101 class.  The account may be described as having a username, a numeric UID, and an owner such as the person just referred to.  The CS101 class has an instructor, a course description and a list of students, one of which refers to the person above.  However, our back end data stores are typical RDBMSs (Relational Database Management Systems), which do not quite have this same view of the world.  One of the primary purposes of our current metadirectory processes is to join data from the many tables and sources that are at our disposal and build an object view of this world.

Where does this fit in with the data router?  The key element of the data router that makes it useful is what policy and data translation rules it is given to process. Those policy rules stem from out-of-band discussions that occur between the individuals and organizations whom the data will be used by and belongs, defining what data in their back-end system defines an object, and how that data may be mapped into attributes for that object.  Not to mention the black hole of defining syntax for attributes -- which is another can-of-worms altogether.[8]  The end result being data should be represented at the interface of a data router should be independent of the underlying structure, meaning, it should respond to the question of “what is the value for ‘classlist’ for object ‘X’” by doing the appropriate work on the back end and returning the appropriate value, in the agreed upon syntax.  This could be a 5-table join to the back-end database, but it should be a simple question-answer to the user of the data router.

A.1.2  Transport: Another Assumption

We’ll assume that the transport layer for our SOAP requests can provide strong mutual authentication between two data routers.  This is because there is a seemly obvious requirement that the providing data router should be able to verify the identity of the requestor / where the data is going, and for the recipient where the data is coming from.  Such a transport could be SOAP over HTTPS, with client & server side certificates signed by a trusted authority.  In a message queuing architecture, it may be necessary to both sign the data with the source’s private key and encrypt it with the recipient’s public key.  For now, however, we’ll assume that they are there, and that we can rely upon them to secure transport, and give assurances to either end of the conversation about who or what they have been talking to.

A.2   The Duties of a Data Router

The data router is responsible for executing code that implements policy agreements made by producers and consumers of the data.  It is responsible for communicating with the underlying data storage mechanism(s), and translating the information to and from objects, the contents of which are defined by the policy agreements.  It may also map these objects to “object handles” which may be stored in a data store that is internal to the data router, or, in some instances it may be possible to store these handles in the existing data store.  The data store should support the reading, writing, modification, and deleting attributes or entire objects, as policy dictates.  It shall expose its functionality to the outside world via SOAP methods, which may be bound to a variety of transports.  A sub set of these methods are to be responsible for managing the data transfer to and from the data router.  Other methods may deal with the establishment of “object affiliations”, identity mappings between objects stored in different data stores which may correspond to the same thing: a person, a class, a network node, etc.  There should also be provisions made for the data router being able to do rudimentary data maintenance tasks, such as periodically checking it’s providers for updates to data for which it had been previously given.[9]

The data transfer methods should support those that may be required by a variety of architectures, conditions, and policies.  Particularly, it should support both consumer driven (pull), and producer driven (push) environments.  Support should exist for batch feeds, change based, and incremental updates.[10]

A.2.1  And other duties as assigned

Given a flexible policy engine, the data router could do almost anything we wished.  One of the duties we may wish it to do is to handle metadata that may be associated with attributes.  In section 4.2 and 5.5, the concepts of “attribute signatures” and other attribute-specific metadata are introduced.  The data router is in a unique position to both manage and utilize much of this information.  One such example involves using digital signatures to be to verify authenticity of a piece of information once it has traveled from its original domain.  The benefit being, as data leaves foo.edu’s data store, it is signed with foo.edu’s signing key.  When the data is ends up in bar.gov’s directory of academics, the user of that data can verify that the phone number listed for Professor Smith is, in fact, what foo.edu thinks it is by comparing the digital signature to the data provided.

Some other examples of metadata, which would be appropriate for the data router to manage, are:

·        Update Timestamps

·        Expiration Timestamps

·        Access Control Recommendations

 

 


An attribute could be enriched with these data points on its way into or out of a data router, influenced by the policy agreement, of course.  The data router could also use some of this attribute “metadata” to influence its own decision-making.  Receiving an attribute with an expiration time that has already passed is one simple example.  The digital signature itself could be used to also influence policy.  Take this example, where blatz.edu rejects an attribute change that has percolated through a web of data routers.  One hypothetical reason for this could be that blatz.edu is not satisfied with foo.edu’s data update policies on phone numbers.

Of course, the opposite could have been true, and blatz.edu may trust data coming from foo.edu so much that it will then push the data out to its core business systems.  This can be done safely, as it knows that bar.edu has not altered the attribute, and, it can verify its source.


B      Stitched Directories Proposal

Michael Gettes, author

This document proposes a way to maintain attributes and values in one LDAP directory that reflect their authoritative value as found in a second LDAP directory. The proposal offers ways to maintain currency, authenticity and integrity of the values of the attribute as well as representing the authoritative directory home for each "stitched" attribute.

Problem: Affiliated Directories, Application Specific Directories, Associated Directories -- they all go by many names but have similar properties. One directory wants to be either linked to, or have some of the same data as another directory. Some don't understand this need and vehemently disagree with the concept. So, by way of a simple example, let's try to demonstrate the utility of a solution to this problem.

Imagine that Peter Alterman is a member of the US Government, which he is, and he has an entry in the http://directory.gov white pages service. Imagine that he is also on the faculty at Georgetown University, which he is NOT, and he is also to be found in the Georgetown white pages. Let's also assume that both white pages services are based on LDAP. Would it not be useful to be able to search for Alterman at Georgetown University and in displaying his entry the directory would have a link to the government entry for Peter, display that info, maybe in summary form, and then provide an integrity check that can comfortably state that the data in both directories, in fact, are directly related. Then take things a step further and apply this to locating video capabilities for Peter Alterman whereby the video capabilities themselves are stored in another directory maintained by ViDe. Is this not useful?

So, assuming this is a useful ability and the simple use case described, here are some initial thoughts on how to stitch directories together.

Definitions: dA and dB are 2 directories. dA(y) references a specific object in dA.

This discussion will NOT attempt to resolve the following issues:

(1) policy issues of what data elements are made available.

(2) how data is made available from dA to dB and vice-versa. This means that issues of replication and synchronization are not considered here.

dA contains a special object in dA, which describes dB as having some association with dA. This entry might be called an SDL, Stitch Directory Link. An SDL object dA(SDL=dB) would contain an X.509 certificate (or something similar) that belongs to dB. There would also be a list of the attributes that dB is supplying to dA (could be any legitimate attribute available in an legitimate objectclass known by dA and will be present as an "objectclass" attribute in dA(x) ) like CN, SN, eduPersonPrincipalName (EPPN), etc.

Let's assume the EPPN will be made available from dB(mrg) to dA(gettes). This would be written as dB(mrg)EPPN => dA(gettes)EPPN. So, for dB(mrg)EPPN to be made available then dB(mrg)->SIGN(EPPN+VALUE(EPPN)) where SIGN will digitally sign this material using the dB private key and the dB[Certificate] is available in dA(dB)Certificate. The attributes and the signed material are sent to dA and stored in dA(gettes). The attributes are transmitted as is. The signed material is sent in a special multi-valued possible attribute called SDLstuff (bad name). SDLstuff is made up of the following:

Signature blob, (list of attributes involved in the signature, if present), (DN(dA(SDL=dB))

From the above, one could determine which attributes are involved in the signature, the signature and which distinguished name is used to link to the SDL of dB in dA. Then one could verify that the attributes in the list, if present, were originally sent by, or contain the same data as originally sent by dB. There should probably be some concept of time either in the SDLstuff attribute or in the dA(SDL=dB) entry.



[1] Sadly, these are sometimes the same.

[2] In the case where these campuses decide not to share their data by exposing the overused SSN as the key"

[3] http://www.projectliberty.org/specs/liberty-architecture-overview-v1.0.pdf

[4] In addition, scripting transfers with ‘scp’ is usually much simpler than scripting with ‘ftp’ as anyone that has tried it will know.

[5] For those who insist in concrete examples in their theoretical models, "md5" may be a suitable drop-in.

[6] B probably keeps a copy of these hash values pre-computed and indexed for this purpose.

[7] "even though that’s what many of us do anyhow, and its "good enough for government work."

[8] We’ll get to this later, suffice to say you should leverage pre-defined standards for attribute naming and syntax whenever you can"

[9] We never said that the transport had to be reliable!

[10] "Here’s a change that just happened" and "Here’s everything that’s happened since I last talked to you" are very similar, though.