| NSF
Middleware Initiative |
Robert
Banz |
| draft-internet2-mace-dir-inter-domain-data-exchange-00.html |
University
of Maryland, Baltimore County |
| Copyright
© 2002 by UCAID and/or the respective authors |
October
2002 |
| Comments
to: nmi-support@nsf-middleware.org |
|
|
Development of this document was supported with funding from the
University of Maryland, Baltimore County, Internet2, and the NSF Middleware
Initiative (Cooperative Agreement No. ANI-0123937).
|
|
This paper is an attempt to record thoughts on a type of problem in which
data elements must be transported, stored, or used across independent administrative
domains. Numerous issues arose in discussions of this problem space within
the Internet2 MACE-Dir working group. We have tried to present those deliberations
in this paper.
For additional information and related topics and resources see the following
sites:
| Internet2 Middleware Initiative: |
|
| MACE:
|
|
| EDIT:
|
|
| NMI: |
|
| |
|
3.1
Multi-Campus University System... 5
3.3
Visiting Medical Professional Records Access
3.5
Third-Party Authorizations
4.1.1 Intra-Domain
Identity Mapping
4.1.2 Inter-Domain
Identity Mapping
4.2
Data Plus Associated Metadata
5
Techniques and Alternative Solutions
5.3
Federated Access Control Methods
5.4
X.501 Knowledge References
5.5
Representing Attribute Meta-Data
5.8
Techniques for "Zero Knowledge" Identity Mapping
A.1
Assumptions about our Environment
A.1.1 An Object-Relational
View
A.1.2 Transport: Another
Assumption
A.2
The Duties of a Data Router
A.2.1 "And other
duties as assigned
B
Stitched Directories Proposal
In today’s interconnected world, the need to access data across multiple
administrative domains is increasing. Students and faculty alike require
the appearance of a seamless environment between local and remotely hosted
services. Grant funding organizations require updated contact information
for research faculty. Recently, the INS requires universities to keep
them apprised of enrollment data regarding international students. Campuses
wish to pool resources, a step that requires federated identity management
that crosses existing administrative lines.
Many of these problems require techniques beyond the typical metadirectory
processes that are currently in use within campus infrastructures. While
time is taken to briefly discuss some of these, readers are encouraged to
become familiar with Metadirectory Practices for Enterprise Directories
in Higher Education (http://middleware.internet2.edu/dir/metadirectories/internet2-mace-dir-metadirectories-practices-200210.htm)
as many concepts discussed in depth in that document are built upon here.
The discussion within the MACE-Dir working group
from which this paper is abstracted was initiated by Michael Gettes’ “Stitched
Directories” proposal. The original “Stitched Directories” proposal is
included here as Appendix B.
There is also quite a bit of overlap between this problem space and what
is called Federated Identity Management. The Burton Group defines this
term as “the use of agreements, standards and technologies to make identity
and entitlements portable across autonomous identity domains” [DB02].
Several scenarios involving inter-domain data exchange are presented in section
3 below. Issues bearing on the requirements that a solution must satisfy are
collected in section 4. In section 5 some techniques are described that help
to further illuminate the problem space.
The MACE-Dir Working Group considers this a work in progress and warmly invites
readers to participate in its further development. The first step would be
to send an expression of interest to mace-dir-comments@internet2.edu.
This version of the document is published as an Internet2 Draft and will expire
in April 2003.
AuthN: Authentication
AuthZ: Authorization
Data Sink: An endpoint of data flow, opposite of a Data Source.
Data Source: The source of a data flow, opposite of a Data
Sink.
Local Identity: The attributes, entitlements, etc., which are
associated with an identity specific to a single administrative or functional
domain.
Metadirectory: An architectural element that executes Metadirectory
Processes.
Metadirectory Processes: The processes by which source data
is captured, transformed, and presented in an enterprise directory. [TB02]
This scenario describes a multi-campus situation where each campus is independent
in so far as their computing facilities and identity management procedures
are concerned. However, they face growing demands to support integrated
services, they share some facilities, have a shared library system, and have
a high degree of multiple appointments, both faculty and student.
The shared library system directors, in particular, have expressed their
desire for a single interface for accessing and retrieving the various bits
of identity information for the members of this multi-institution community
regardless of their originating campus. Historically, the ‘branches’ of the
library system located at each campus have worked out interfaces to their
campus business systems. The form of these linkages varies from automated
data feeds to manual input.
At first glance, this is a classic metadirectory scenario. In fact,
most of the required procedures are those that you would use in a single institution
to map identifiers and ‘join’ identities (see the object identity mapping
requirement in Section 4 below). Two features of this scenario take
it beyond a classic metadirectory problem. First, existing campus-level
infrastructures remain complete and independent of each other. Second, there
is a need to capture metadata reflecting the originating sources of the data
being provided and associating the person and their roles to the appropriate
member institution(s) (see the data plus associated metadata requirement in
Section 4). As with all inter-domain data exchanges, there is a requirement
to specify a transport mechanism and a need for parties at the data sources
and the data sinks to have a shared language covering both information syntax
and semantics (see the transport and shared language requirements).
Many funding agencies, such as the National Institutes of Health, maintain
databases of current and potential Primary Investigators (PIs) who may be
applying for or currently working on projects funded by that institution.
Other organizations provide services to notify potential researches of pending
solicitations. Both of these functions require a way of searching current
research faculty information at hundreds of universities. Of particular interest
are contact information, current projects, and research interests of potential
grant recipients. In addition to sharing all the requirements of the
multi-campus system scenario, this scenario requires that the agency’s access
to institutional data be appropriately controlled and limited (see the federated
access control requirement).
This scenario describes access to resources including protected health information
by visiting health care providers, where they would not have or need a local
network user ID at the visited facility. For example, a visiting physician
(or other provider such as a nurse or laboratory technician) is given access
to specific patient records at a health care facility.
The visiting care provider (visitor) needs temporary
network access at the visited facility to securely log into their home facility
for identification and authorization. The home facility consults the
business agreement with the visited facility and present the visitor’s access
authorization based upon the rules contained in the agreement. The visited
facility grants appropriate access based upon the visitor’s presented credentials
and records details of all the visitor’s transactions and data access for
later audit if required.
Most people have some sort of "personal address book," either stored
in their desktop email client, on a PDA or scribbled on little scraps of paper
stuffed into their wallet. This is not a situation we commonly think
about in the directory space, nor would we go about solving this problem with
typical metadirectory functionality. However, the problem is an interesting
one, and it invites exploration of techniques to manage affiliations with
data sinks that spend much of their time in an off-line state.
The "holy grail" for personal address book management would be
to have the capability for on demand synchronization of personal information
that may have its authoritative source housed in various organizations’ directories.
The usage scenario would be similar to that of "syncing" a PDA with
a desktop system, except that instead of a single desktop machine, the PDA
syncs with various directories scattered across the Internet.
The requirement unique to this scenario is the need for resource discovery.
That is, there needs to be some mechanism by which the PDA client can discover
where to go to find authoritative, up-to-date contact information for the
individual entries in its address book.
Jane Doe is a faculty member at Foo.edu. She is also a member of a professional
organization, Bar.org. Both Foo and Bar have contracts with vendor X providing
their members with access to specific licensed resources. Foo.edu has licenses
for products A, B, and C for use by its faculty, staff and students. Bar.org
has licenses for products B, C, and D for its members. When Jane sits at her
desk and accesses vendor X, she would like to be given access to all of the
products to which she is entitled, A, B, C and D. The content provider
may wish to receive authZ information only (for example, patron anonymity
is important in library-like situations.)
This scenario requires a method for federated access control, and a method
of associating attributes from multiple identity authorities. First,
a system needs to allow the content provider X to receive authentication and/or
authorization assertions from Foo.edu. The special challenge of this
scenario lies in how to get a trustworthy assertion of the person’s membership
in Bar.org to the content provider (see the data integrity requirement). Two
directions can be taken here, one where the authorization assertion is made
by Foo.edu, provided there is a mechanism where Bar.org is providing Foo.edu
with the information, and trusts Foo.edu to make such assertions on their
behalf. The second approach would rely on Foo.edu systems knowing that
to access X, Jane’s authZ process needs to make a pass through Bar.org to
pick up additional authZ assertions before hitting site X.
Object identity mapping, for the sake of this discussion, is the procedure
by which it is determined that two or more digital objects represent information
relating to the same real-world subject. In current implementations,
the subject being referred to is most often a person. However, there
will be situations where the subjects being mapped between directories are
not people, but may be computers or "data objects" such as those
stored in a digital library. For each type of object, there will be
identifier attributes used to uniquely indicate a particular instance.
For example, a person’s name can be used to assert who someone is. Of
course one name, such as "John Smith," will map to any number of
people.
From experience in dealing with identity mapping in reasonably manageable
populations, such as a college campus, we know that a name is not a good identifier.
A good identifier has the property of being globally unique across the population
and, even better, persistent through time. As discussed in Identifiers,
Authentication and Directories: Best Practices for Higher Education (http://middleware.internet2.edu/docs/internet2-mi-best-practices-00.html),
and Metadirectory Practices for Enterprise Directories in Higher Education
http://middleware.internet2.edu/dir/metadirectories/internet2-mace-dir-metadirectories-practices-200210.htm),
it is clear that most of these "good" identifiers tend to exist
only in one database or another which makes them generally bad candidates
for any kind of inter-domain mapping. Some campuses have taken to the
practice of issuing a "globally unique identifier" for members of
their population generated by their enterprise directory. However, such an
identifier is only for use within the campus community, and holds no meaning
outside of it. Thus it cannot be used for mapping when moving outside of the
institution.
In practice, intra-campus mapping has traditionally relied on using such
identifiers such as a "Student ID" number, or "Social Security
Number"[1],
and while the appropriateness of using such identifiers for mapping can be
argued, for most purposes it has been sufficient. Sufficient being defined
as: it works most of the time, and for the few times it doesn’t work the resulting
mis-mapping can either be easily fixed or completely ignored. However,
some institutions have gone further than using these identifiers to do mapping,
and check other attributes of an individual, such as their name, address,
date of birth, or phone number to verify the strength of their mapping decision,
or potentially reject it.
Before discussing the intricacies of inter-domain identity mapping, there
are some basic questions that need to be asked:
·
Does our directory need identity mapping at all?
·
What is the scope requirement for identity mapping, all identities,
or a subset?
·
Does the identity mapping need to be automatic?
·
Do data source and data sink share identifying information?
The first question simplifies everything, if it is unnecessary to provide
any identity mapping function on the target directory, all of these problems
can be ignored. Many situations may not need any mapping functions at
all. Scenarios exist for which it is acceptable, even potentially desirable,
that multiple entries be made for the same individual if they enter the directory
from different institutions. This is the tack that the Directory
of Directories for Higher Education (http://middleware.internet2.edu/dodhe/)
has taken. The benefits gained from not mapping include simplifying
the data processing needed and adding to the privacy of the individual by
not identifying their potential multiple affiliations. On the other
hand, providing a mapping function could be integral for assuring an individual
is given all of the resources that they are entitled to, if the directory
is to be used for access control purposes; or to provide a one-stop-shop for
finding all of an individual’s contact information.
The second question is a follow-up to the first. The "scope"
in which you are performing identity mapping may also vary depending on your
need. For example, you may only need to identity-map faculty and staff,
and students may not need federated identities. Additionally, the "scope"
could be defined as "Folks from these institutions get their identities
mapped, but if they are from any other, they don’t."
A potential solution to this mapping problem for the multi-campus system
scenario would be to create a "bridge" identity management system
that would map identities existing at the institutional level to a single
person object, creating a multi-campus person registry. This would provide
a single-point for campuses to do lookups against when situations arise where
‘campus A’ is receiving, or requesting data from ‘campus B’, in which the
primary identifiers at the campuses differ.[2] By also exposing some basic affiliation
data in this registry, the library system would have a single data source
to feed their identity management system (and also find the most current billing
information for someone with overdue fines!)
When it is decided that some identity mapping needs to be done, the question
of making the mapping an automatic process is now on the table. Some
of the same advantages and disadvantages of doing mapping altogether are also
applicable to this decision. One direction to take with regards to the
choice to make the mapping manual is to assign this task to some number of
humans at an organization to apply appropriate mapping logic to each entry.
For obvious reasons, this is probably not a feasible option, as the staff
costs could be overwhelming, and even more error could be introduced to the
mapping decision. However, a potentially more appropriate use for "human
driven" mapping is one where the individual being "mapped"
would be involved in the decision. This is the road taken by the Liberty
Alliance in implementing their "reduced sign-on" system.[3]
Simplified, this approach requires the individual to authenticate themselves
to both data sources, "proving" that the identities listed there
represent him/her, allowing backend processes to "link" these identities.
The privacy benefits of this are obvious, as the decision to link identities
is completely in the individual's hands, and in the Liberty Alliance case,
there are procedures in the design for "breaking" the mapping at
the request of the individual. For applications such as reduced-sign-on
and the scenario where one may want to augment their institutional entitlements
to resources with those provided by other affiliations, the voluntary mapping
described here would fit well.
When it comes to automating an identity mapping process, strong drivers to
the success are policy matters, which may also intersect with personal privacy
and governmental regulations. The key factors are the policies regarding
what identifying information is available in the data sources, and what restrictions
are in place on making this data available to the mapping process. As
the attributes that are most useful for mapping, such as an individuals’ SSN,
date of birth, etc. are also typically the ones that are most closely guarded
by an individual and her affiliated institutions, this may be problematic.
However, if the situation inter-institutional directory being constructed
is of the nature that lends itself to this information being available, the
best practices that are in place in the intra-institutional environment may
be applied successfully in this space. However, when this information
is not directly available, some alternative methods of mapping may need to
be considered. One alternative is described later, in section 5.8.
For data items to be successfully exchanged and used by multiple domains,
it is often insufficient to send a raw attribute-value pair. The data
item may need to be bundled with associated metadata that provides the context
for and characteristics of the information conveyed. In the multi-campus
scenario, to use a simple example, the affiliations of faculty and student
need to be associated with the campus that asserts them.
It is difficult to decide which task is more daunting in exchanging data
between differing administrative domains, the policy aspects, or the technical
hurdles. While the managers of the data can argue policy questions until
the end of the world, it is often assumed that the technical folks will be
able to hash out a solution in minutes. Of course this is rarely the
case. To make matters worse, the enterprises and systems that may be exchanging
data are quite disparate in their nature, add to that the potentially infinite
combinations of these relationships that a single site may need to be involved
in. All in all, you face a significant management challenge. Until
the day comes that there is a single standard for everything, all that can
be offered are recommendations as to how such problems may be made more manageable.
The method of transports available to get data from point "A" to
point "B" is highly dependant upon the specific problem to be solved.
Solutions can range from the simple FTP-ing a file from one site to another
on a daily basis, to complex message-oriented exchange techniques to implement
close-to-real-time data updates, or to relying on the remote repository to
request updates as it needs them. To add one more decision in to the
mix, the question arises of whether to "populate" the remote directory
at all or alternatively, to use techniques that allow the user/application
to gather the information in "real time" from the various data sources.
The transport may also be responsible for providing security for the
data, including basic access control, transport level encryption, and mutual
authentication of the transport endpoints. There seems to be very little
that can be done to limit the methods of data transport that may be required,
however, we recommend leveraging existing methods described in the Metadirectories
document, as well as some techniques that are discussed in section 5.
By "shared language," we refer to both the syntax and meaning of
an attribute/value combination. This is a difficult but solvable problem.
With work done with regards to LDAP, XML schemas, and other such information
interchange standards, there most likely exists work that spells out a way
to represent at least some of the data you wish to share with others.
It is our recommendation that you rely upon relevant standards when representing
your attributes to external organizations, and resist the temptation to make
data available directly in the format that best suits their needs.
Failing that, make it a community of interest project to define a shared
language, for example, by collaboratively defining a directory schema with
specifications for the semantics and intended uses of the attributes and values.
This is precisely what has been done within Internet2 with such object classes
as eduPerson and commObject.
By going this route, standards will be strengthened through use, and data
transformation code that may need to be written for a single relationship
may be reusable, instead of being a one-off effort. The receiving party
of the data should then be able to massage the data into any site/application
specific format that is suitable, or, hopefully, choose to use the data as
sent, in a standards-compliant form.
When exchanging person-related data, which is likely to be most of what is
to be exchanged, we suggest as a first step following the recommendations
conveyed in the eduPerson specification, which make suggestions as to the
use, syntax and meaning of common attributes including those from inetOrgPerson.
In the funding agency scenario detailed above, the home institutions of Principal
Investigators will not want to make all the data they hold about such people
available to funding agencies. There will be a select, negotiated set
of attributes to which the funding agency systems should be granted access.
The visiting medical professional scenario also requires that the visitor’s
home institution give only selective access to its records about the subject.
The federated access control solutions discussed in section 5 address this
requirement.
Some of the affiliation scenarios described in this paper revolve around
controlling access to content or procedures provided inter-institutionally
through web-based services. There are multiple methods either in use
or in development that can be utilized in these situations. However,
some scenarios, such as third-party authorizations (3.5) may require as-yet
undefined solutions. All of the methods described in section 5 below
rely on certain key technologies and standards, such as the Security Assertions
Markup Language (SAML), but each has unique aspects.
Resource discovery refers to the ability, in some automated or algorithmic
fashion, to determine where and how to access specific services or information.
In this case, the services that need to be accessed are those revolving around
retrieving data updates. For a situation such as the personal address
book scenario (Section 3.4), the ability to "find" the LDAP server
at an organization without necessarily knowing its IP address may be helpful.
Techniques such as DNS SRV records, which are described in section 5 of this
document, could be used for this purpose. However, other resource discovery
techniques such as UDDI (http://www.uddi.org) are more general purpose.
The key to the usefulness of the data available in any directory is its correctness.
With that said, being able to provide the correct information in a directory
that is a compendium of data from multiple sources can be quite a challenge.
Some of the issues that first come to mind are:
·
Keeping in mind the data modification policies of the data sources.
·
Having the most up-to-date copy of a source’s data.
·
For identity-mapped objects, what happens when attribute values conflict?
Received data is only as good as it is at its source. When accepting
or using data that is not under your control, you need to have some assurance
that the data is correct. This assurance could come by several means,
the simplest of which is an out-of-band agreement with the custodians of that
data as to the nature of the assurance they can make regarding its integrity.
It is important as well that the minimum level of integrity of the data available
in the compiled directory be communicated to, and understood by, its consumers.
The level of assurance needed by a consumer may vary greatly. A directory
that is controlling access to medical data will require a high level of assurance
between all parties, as there may be legal and financial risk associated with
data being wrong. However, the level of assurance required to update
an email address in someone’s personal address book may be minimal, as the
risk associated may be very small. It may also be advantageous to have
a method of representing to a user the source of the data, as some applications
may require it.
Another vexing task in directory management is keeping data up-to-date.
Depending on the techniques in use to populate the directory, this may or
may not be as tough an issue to tackle. From an architectural standpoint,
one of the most desirable techniques for populating and updating would be
on a change-at-a-time basis. As this can be implemented so that data
is updated in a “real-time” fashion, assuming everything is working correctly
the target directory should have the most up-to-date information at any given
time. However, when it doesn’t work changes stop coming or some changes
are missed. When this occurs, we need some mechanism in place to either
note how old something is so that an application may make the decision that
the information provided to it may be stale and do “the right thing,” or provide
the processes managing the directory a basis for requesting an update to,
or potentially pruning data that has reached the limit of freshness.
A method of a data source “recommending” an expiration time for a data element
may also be of some help, as the originator of the data may have some knowledge
(not known to the recipient) of how often this information changes.
Some of these techniques are in use as part of DNS, whereby zones on secondary
name servers have intervals as to when they’re to be checked for updates,
and maximum time-to-live (TTL) values for how long a zone will be held if
requests for updates fail. Of course, robust inter-domain message queuing
services would be an ideal fit for this problem.
An important policy decision when creating a system that is to provide identity
mapping is what will be done when attributes for an object where there are
conflicting values for an attribute that occurs in two of the sources that
have data mapped to the same object. In the intra-domain situation,
these rules may have been very cut-and-dried, as institutional policies, or
other strong drivers may dictate what happens. For example, when dealing with
conflicting information between a Human Resources system and a Student Information
System, the data that is in Human Resources may automatically be chosen to
win-out, as it is a reasonable assumption that folks probably want to have
their paycheck delivered to the correct address. When bringing together
data from multiple external sources, the choices may not be as obvious the
data since sources are basically peers. So some other method of resolving
these discrepancies must be found. One, which may not seem so obvious
at first, is simply to present all of the options to the user of the directory
and let them choose. However, the usefulness of this may leave a little
to be desired. A second simple method is to choose the value that is
newer. Of course, the problem here is that many data sources do not record
the time an attribute was last changed. There is no easy answer to this
problem. What will be done in any situation will vary based on the nature
of the relationships between the data providers, and the meaning and intended
use of the attributes that are in question.
Further discussion on this can be found in Section 5.5 (Representing Attribute
Meta-Data), as how and what information can be asserted about an attribute
is directly related to the integrity of the attribute itself. Stitched
Directories (Appendix B), also makes an attempt to tackle basic data integrity
issues.
For situations that require that data be transferred to a central directory,
the simplest, and probably the most popular method, consists of large batch
updates of the data. This technique is probably in use, in some form,
in everyone’s architectures, and may consist simply of FTPing a file in a
specific format from the source system to the consumer. While this technique
doesn’t go far in keeping data as fresh as possible (your data is only as
good as the frequency of your last batch update), for many needs it is sufficient.
We recommend, however, that while it is important to keep things simple,
do not simplify them so much that data security is ignored. FTP, unless
used in a very controlled environment, such as a VPN, is not a secure method
of transport. Simply using scp (part of OpenSSH, or other Secure Shell
implementations) instead of FTP is a substantial upgrade in the security of
your transport.[4]
If scp is not available, encrypting the data with an application such as GnuPG
or PGP is also an option.
A “real-time join” refers to the coalescing of data from multiple sources
at the time that it is to be used. This covers an entire class of techniques,
such as those that rely on an application to contact multiple data sources
to produce the ‘full picture’ of an object, to those that may use a middleman,
or a broker, to accomplish this task.

Assuming that identity mapping functions are provided in some manner, all
of the appropriate information about an object MAY be gathered by executing
LDAP queries across the services provided by the source institutions.
The application may even discover the locations of such servers through their
DNS service records (cf. section 5.6).
However, putting this functionality in a user application may not be appropriate.
Placing these functions in a service that acts as a broker, and allowing a
user application to access it via a protocol such as LDAP, or as a web service
application, may also be done.

Shibboleth is a software package and specification built upon SAML that enables
a user who is authenticated by a site (foo.edu, the origin site) to have specific attributes
asserted by a party at foo.edu, and have that assertion understood,
and trusted by bar.edu (the target site) another site participating
in a Shibboleth “club.” More information regarding the Shibboleth project
and its specifications can be found at http://middleware.internet2.edu/shibboleth/.
In its current version, Shibboleth is not able to aggregate assertions about
a single user from multiple attribute authorities so it is not a standalone
solution to the third-party authorization scenario described above.
Shibboleth concepts could provide the basis for an access control mechanism
for the PDA address book scenario. A potential method would be to allow
the issuing of entitlement attributes which grant access to specific directory
objects and which could be revoked at any time by the subject. An architectural
element such as the Data Router (5.7) may be the appropriate mechanism to
provide access to data through such means, as LDAP (without extension) would
most likely not be sufficient.
The Liberty Alliance is an industry consortium founded in part to develop
methods for federated identity management. More information, and specifications
for the Liberty Alliance protocols can be found at http://www.projectliberty.org.
Currently, the protocols revolve around procedures to manage bindings between
multiple network identities existing on Liberty Alliance-enabled web sites,
through trusted parties known as “identity providers”. Authenticating
to an identity provider provides you a single-sign-on environment between
multiple sites. Liberty does not currently address sharing attributes
of an identity between sites. This functionality is currently slated to appear
in a future release of their specifications. However, what is well-described
are methods in which affiliations between identities at multiple locations
can be created, managed, and potentially broken, under the control of the
person in question.
WS-Security (Web-Services Security) is a specification developed primarily
by Microsoft and IBM that has now been turned over to OASIS for finalization.
Its focus is to specify a standard method of transport for security credentials
in a web services environment. Currently, the encapsulation and transport
of Kerberos and SAML credentials are covered. While it does not provide
methods for establish trust in a cross-realm environment, it will be an emerging
method of transport for higher-level authN and authZ assertions.
The term knowledge, as used in the X.501 documents refers to “DSA
[Directory Service Agent] operational information held by a DSA that it uses
to locate remote entry or entry-copy information” [IT01]. A knowledge
reference is “Knowledge which associates, either directly or indirectly,
a DIT [Directory Information Tree] entry or entry-copy with the DSA in which
it is located” [IT01].
In summary, a knowledge reference is a representation of where a piece
of knowledge may be found. Knowledge references make up part of, and
rely on, a complete X.500 environment to function, as discussed in the specification.
However, the terminology used with regards to the types of references discussed
in Section 10 of ITU-T Recommendation X.501 and the model on which
they are based may suggest new approaches to a standards-based inter-domain
data exchange model. This is an area of active investigation by the
MACE-Dir Working Group.
Many of the requirements derived from the scenarios above require that certain
other information regarding an attribute, or collection of attributes, be
made available to a data consumer, or the processes managing a database itself.
The information that may need to be represented includes, but is not limited
to:
Digital signatures, to assert authenticity
Access control policies
Expiration dating
Timestamp of when the data was last updated
Pointer to the authoritative source of the data
Currently, there is no one method for representing all of this data on an
attribute-level basis. An attributeIntegrityInfo attribute exists
in the X.500 specification [IT01]; however, its use is minimal, if not non-existent.
It describes a method of storing a digital signature of an attribute, a collection
of attributes, or all of the attributes that make up an entry. As there
is very little if any use of this, and no support in current client software,
this may not be a realistic alternative. The syntax of the attribute
is described as an ASN.1-encoded string. Current tastes tend to favor encoding
methods such as XML.
The Stitched Directories proposal (Appendix B) begins to tackle this
problem by proposing using digital signatures of attributes such as those
specified by attributeIntegrityInfo to augment the information available
to an end-user application or consumer directory, giving it just a bit more
information to make decisions as to what to do with the data it is accessing.
However, in discussions it became clear that digital signatures were not the
only augmentation that was needed.
An XML string, or multiple strings, could be used to encapsulate all relevant
metadata related to an attribute in an expandable, and easily parsed format.
In an LDAP environment, this data could be stored in a single attribute, and
given its text-only format, could be easily used in non-LDAP environments
as well.
When programmatic location of LDAP services is required, in situations like
those described in Section 4.6, DNS service records can be used. The
IETF has current work in dealing with the specifics of this procedure within
the LDAPExt working group (the most recent Internet draft regarding this topic
is available from the IETF). These techniques can be used to find a
server containing an LDAP DN, provided that domain component (DC) naming has
been used. Using DC-naming is recommended in A Recipe for Configuring
and Operating LDAP Directories (http://www.georgetown.edu/giia/internet2/ldap-recipe/),
and is currently considered best-practice.
Yet another approach to many of these exchange problems would be to implement
and deploy another piece of architectural software to handle many of the requirements
that are common to these applications. We explore this potential solution
in Appendix A, presenting the functionality for such a software application,
which provides a place to house data translation and policy enforcement,
allowing the “data router” to handle the transport issues and other
maintenance tasks.
As described in section 4.1, identifiers that can be conveniently used in
an identity mapping process are also, by most data stewards, considered personal
and confidential. This data may not be something that an organization, by
policy or common sense, would want to make available to a third party, even
in pursuit of such noble goals as identity mapping. What is being proposed
is a method to use cryptographic hashes to assist in the mapping process,
so that the parties involved do not produce, in the clear, any of the identifying
data that is of concern.
Consider this as an example. Databases A & B each contain an object
that refers to a single person, and the objects, for better or worse, both
contain the person’s Social Security Number. Let us also say that a
strong, one-way has function exists, hash(), that returns a string
that is unique, within reason, to the input value.[5] A two-way communication channel between
databases A & B also exists.
So, what happens? Here’s a step by step process where the object affiliation
between two repositories is established, without exposing the key. “SSN”
represents the value of the key.
| Hey B, I’ve got a new object, and it has
value ssn = hash(SSN) |
|
| |
B checks it’s objects, and sees that it has
one whose hash(SSN) is equal to the data just provided.[6] A, I’ve got that one, here’s my response: SSN. |
| If the value provided from
B is correct, then both exchange the handles that they will use to affiliate
these two objects (or whatever they decide to do) |
|
This provides for a two-way proof that they both have objects that should
be affiliated, and would allow for the enforcement of policies such as “A
will only share data about objects that it knows B already has”. However,
if the desired effect would be to just create a new object in B with the data
provide from A, even if it didn’t already have a matching object there, the
process would simply stop at the second step with a request to exchange affiliation
handles with A, and A could rest assured that it’s protected information had
not been exposed.
As we know from our experiences even within single administrative domains
that matching on one key, such as an SSN, can be problematic.[7] So, when extending our models to span
multiple administrative domains, we must create an environment where more
complex mapping functions can be supported. The model above can be extended
to support multiple challenge-response requests when attempting to map a single
object. The values themselves can also be massaged via algorithms such
as Soundex to make up for typical mistyping and misspelling errors that may
have occurred at one end of the potential mapping or the other. The
data source, or the system managing it, should keep the hashes of the values
used in this process pre-calculated and stored in an indexed manner as to
make the mapping processes efficient.
In addition to the documents listed below in
the References section, the authors also recommend:
A Recipe for Configuring and Operating LDAP Directories (http://www.georgetown.edu/giia/internet2/ldap-recipe/)
Shibboleth (http://middleware.internet2.edu/shibboleth/)
A Brief Guide to OIDs (Object Identifiers) (http://middleware.internet2.edu/docs/a-brief-guide-to-OIDs.doc)
Internet2 Middleware Website (http://middleware.internet2.edu/)
NMI-Edit Website (http://www.nmi-edit.org/)
NSF Middleware Initiative (http://www.nsf-middleware.org)
Sections of this document have been authored, in whole or in part, by Robert
Banz, Tom Barton, Keith Hazelton and Michael Gettes. We would like to
thank the membership of the MACE-Dir group for their valuable contributions
of ideas and discussions during the drafting of this document.
[TB02] Tom Barton, ed. Metadirectory Practices for Enterprise Directories
in Higher Education, Internet2, October 2002
[DB02] Blum, Dan and Kobielus, James. Toward Federated Identity
Management, The Burton Group, 23 Aug 2002
[IT01] ITU-T Recommendation X.501, ITU, Feb 2001
Robert Banz (Editor)
University of Maryland, Baltimore County
Email: Robert.Banz@umbc.edu
A The
Data Router
Building and managing multiple data exchange systems can be difficult, especially
when the requirements of each may vary in the data translations, policies,
and techniques necessary to manage the exchange so to meet their individual
requirements. An alternative solution is to introduce a new piece of
infrastructure to handle the various exchange techniques, and provide a framework
in which to manage various translation and policy decisions. The data
router sits in the position in a data distribution network similar to that
of an internetwork router, providing
format translation, access control and transport services to specific data
elements. Just as a router sits in front of a data network, routing
traffic in and out of your institution’s internal network, the data router
sits in front of a data source, providing much the same functionality.
An assumption made through this discussion is that we may be dealing with
multiple types of data sources, from Oracle databases to LDAP directories,
each with different access methods.

The most simplistic view of this would be that there would be a one-to-one
correspondence between data routers and databases. However, in some
instances, it may be more efficient to place one data router in front of multiple
databases, or, potentially in front of none and simply have it talk to other
data routers. While all of these are possibilities, for the sake of
keeping the discussion sane, we will assume a one-to-one relationship between
a data router and a database (whatever they are.)
In order to get a better grasp on how it is envisioned that a data router
be used, we’ll consider using it in a model that is better understood.
In this instance, Human Resources and Student Administration data are being
merged into a “Person Registry”, which is then published in the enterprise
LDAP directory. In many implementations, the data feeds going to the
“metadirectory” system are nightly text dumps of the databases, which are
then processed by a bunch of scripts populating the registry and LDAP directory.

To get a better idea of how a data router would be used, the below diagram
shows this common architecture implemented with data routers.

Figure 1: Architecture with Data Routers
As you can see, the central metadirectory has disappeared, being replaced
by multiple pieces of architecture, which together implement the metadirectory
functionality that our central metadirectory was responsible for. The
first reaction may be that this is a bad idea, replacing one thing with four.
However, there are other benefits to the architecture that will hopefully
become clear.
Using the data routers, we could very well implement a logical architecture
that is identical to the one displayed in Figure 2. Our data routers
at the Human Resources and Student Administration systems would have a scheduled
event, creating a stream of information to the data router to the router placed
in front of our Person Registry. That data router would, in turn, send
data to the Enterprise LDAP directory to update the data there as necessary.
The Person Registry’s data router has three policies it is working on, two
that receive data from the data routers in front of the legacy systems and
a third governing its conversations with the Enterprise Directory’s data router.
Corresponding policies exist on the data routers at the other end of these
conversations.
While having this many points of control in a tight environment such as the
one above may be more work (which we try to avoid), it shows the benefits
of this architecture as it would apply to a more distributed data sharing
environment. One party decides how they are going to share their data
and represent it while the other decides how they are going to use it.
The tough work (how to get it from point A to point B) should be considered
taken care of by SOAP and its chosen transport.
In order to make getting this data routing architecture off of the ground,
there are at least two assumptions we have to make about our environment.
The first is that we have an “Object-Relational” view of our world, and the
second is that there is some transport available to bind our SOAP methods
to that can offer security, and mutual authentication.
As we have been doing our work in the directory world, we have a view that
our life that looks at things as “objects”. An object represents something
that we can clearly define: a person, an account, the CS101 class taught in
fall 2002. Each of these objects has attributes that further describe
it. A person may be described as a student, and as taking the CS101 class.
The account may be described as having a username, a numeric UID, and an owner
such as the person just referred to. The CS101 class has an instructor,
a course description and a list of students, one of which refers to the person
above. However, our back end data stores are typical RDBMSs (Relational
Database Management Systems), which do not quite have this same view of the
world. One of the primary purposes of our current metadirectory processes
is to join data from the many tables and sources that are at our disposal
and build an object view of this world.
Where does this fit in with the data router? The key element of the
data router that makes it useful is what policy and data translation rules
it is given to process. Those policy rules stem from out-of-band discussions
that occur between the individuals and organizations whom the data will be
used by and belongs, defining what data in their back-end system defines an
object, and how that data may be mapped into attributes for that object.
Not to mention the black hole of defining syntax for attributes -- which is
another can-of-worms altogether.[8]
The end result being data should be represented at the interface of a data
router should be independent of the underlying structure, meaning, it should
respond to the question of “what is the value for ‘classlist’ for object ‘X’”
by doing the appropriate work on the back end and returning the appropriate
value, in the agreed upon syntax. This could be a 5-table join to the
back-end database, but it should be a simple question-answer to the user of
the data router.
We’ll assume that the transport layer for our SOAP requests can provide strong
mutual authentication between two data routers. This is because there
is a seemly obvious requirement that the providing data router should be able
to verify the identity of the requestor / where the data is going, and for
the recipient where the data is coming from. Such a transport could
be SOAP over HTTPS, with client & server side certificates signed by a
trusted authority. In a message queuing architecture, it may be necessary
to both sign the data with the source’s private key and encrypt it with the
recipient’s public key. For now, however, we’ll assume that they are
there, and that we can rely upon them to secure transport, and give assurances
to either end of the conversation about who or what they have been talking
to.
The data router is responsible for executing code that implements policy
agreements made by producers and consumers of the data. It is responsible
for communicating with the underlying data storage mechanism(s), and translating
the information to and from objects, the contents of which are defined by
the policy agreements. It may also map these objects to “object handles”
which may be stored in a data store that is internal to the data router, or,
in some instances it may be possible to store these handles in the existing
data store. The data store should support the reading, writing, modification,
and deleting attributes or entire objects, as policy dictates. It shall
expose its functionality to the outside world via SOAP methods, which may
be bound to a variety of transports. A sub set of these methods are
to be responsible for managing the data transfer to and from the data router.
Other methods may deal with the establishment of “object affiliations”, identity
mappings between objects stored in different data stores which may correspond
to the same thing: a person, a class, a network node, etc. There should
also be provisions made for the data router being able to do rudimentary data
maintenance tasks, such as periodically checking it’s providers for updates
to data for which it had been previously given.[9]
The data transfer methods should support those that may be required by a
variety of architectures, conditions, and policies. Particularly, it
should support both consumer driven (pull), and producer driven (push) environments.
Support should exist for batch feeds, change based, and incremental updates.[10]
Given a flexible policy engine, the data router could do almost anything
we wished. One of the duties we may wish it to do is to handle metadata
that may be associated with attributes. In section 4.2 and 5.5, the
concepts of “attribute signatures” and other attribute-specific metadata are
introduced. The data router is in a unique position to both manage and
utilize much of this information. One such example involves using digital
signatures to be to verify authenticity of a piece of information once it
has traveled from its original domain. The benefit being, as data leaves
foo.edu’s data store, it is signed with foo.edu’s signing key.
When the data is ends up in bar.gov’s directory of academics, the user
of that data can verify that the phone number listed for Professor Smith is,
in fact, what foo.edu thinks it is by comparing the digital signature
to the data provided.
Some other examples of metadata, which would be appropriate for the data
router to manage, are:
·
Update Timestamps
·
Expiration Timestamps
·
Access Control Recommendations
|
|
|
|
|
|
An attribute could be enriched with these data points on its way into or out
of a data router, influenced by the policy agreement, of course. The
data router could also use some of this attribute “metadata” to influence
its own decision-making. Receiving an attribute with an expiration time
that has already passed is one simple example. The digital signature
itself could be used to also influence policy. Take this example, where
blatz.edu rejects an attribute change that has percolated through a
web of data routers. One hypothetical reason for this could be that
blatz.edu is not satisfied with foo.edu’s data update policies
on phone numbers.
Of course, the opposite could have been true,
and blatz.edu may trust data coming from foo.edu so much that
it will then push the data out to its core business systems. This can
be done safely, as it knows that bar.edu has not altered the attribute,
and, it can verify its source.
Michael Gettes, author
This document proposes a way to maintain attributes and values in one LDAP
directory that reflect their authoritative value as found in a second LDAP
directory. The proposal offers ways to maintain currency, authenticity and
integrity of the values of the attribute as well as representing the authoritative
directory home for each "stitched" attribute.
Problem: Affiliated Directories, Application Specific Directories, Associated
Directories -- they all go by many names but have similar properties. One
directory wants to be either linked to, or have some of the same data as another
directory. Some don't understand this need and vehemently disagree with the
concept. So, by way of a simple example, let's try to demonstrate the utility
of a solution to this problem.
Imagine that Peter Alterman is a member of the US Government, which he is,
and he has an entry in the http://directory.gov white pages service. Imagine
that he is also on the faculty at Georgetown University, which he is NOT,
and he is also to be found in the Georgetown white pages. Let's also assume
that both white pages services are based on LDAP. Would it not be useful to
be able to search for Alterman at Georgetown University and in displaying
his entry the directory would have a link to the government entry for Peter,
display that info, maybe in summary form, and then provide an integrity check
that can comfortably state that the data in both directories, in fact, are
directly related. Then take things a step further and apply this to locating
video capabilities for Peter Alterman whereby the video capabilities themselves
are stored in another directory maintained by ViDe. Is this not useful?
So, assuming this is a useful ability and the simple use case described,
here are some initial thoughts on how to stitch directories together.
Definitions: dA and dB are 2 directories. dA(y) references a specific object
in dA.
This discussion will NOT attempt to resolve the following issues:
(1) policy issues of what data elements are made available.
(2) how data is made available from dA to dB and vice-versa. This means that
issues of replication and synchronization are not considered here.
dA contains a special object in dA, which describes dB as having some association
with dA. This entry might be called an SDL, Stitch Directory Link. An SDL
object dA(SDL=dB) would contain an X.509 certificate (or something similar)
that belongs to dB. There would also be a list of the attributes that dB is
supplying to dA (could be any legitimate attribute available in an legitimate
objectclass known by dA and will be present as an "objectclass"
attribute in dA(x) ) like CN, SN, eduPersonPrincipalName (EPPN), etc.
Let's assume the EPPN will be made available from dB(mrg) to dA(gettes).
This would be written as dB(mrg)EPPN => dA(gettes)EPPN. So, for dB(mrg)EPPN
to be made available then dB(mrg)->SIGN(EPPN+VALUE(EPPN)) where SIGN will
digitally sign this material using the dB private key and the dB[Certificate]
is available in dA(dB)Certificate. The attributes and the signed material
are sent to dA and stored in dA(gettes). The attributes are transmitted as
is. The signed material is sent in a special multi-valued possible attribute
called SDLstuff (bad name). SDLstuff is made up of the following:
Signature blob, (list of attributes involved in the signature, if present),
(DN(dA(SDL=dB))
From the above, one could determine which attributes are involved in the
signature, the signature and which distinguished name is used to link to the
SDL of dB in dA. Then one could verify that the attributes in the list, if
present, were originally sent by, or contain the same data as originally sent
by dB. There should probably be some concept of time either in the SDLstuff
attribute or in the dA(SDL=dB) entry.
[1]
Sadly, these are sometimes the same.