Thoughts and Status of the Directory of Directories for Higher Education
11-23-2000
Michael R. Gettes, Georgetown University
Project to date
By no means did this project start on this date. In February,
2000, related to work happening at Georgetown regarding the sale of the
Georgetown University Hospital, it was believed that we would need the
ability to distribute clients to the University community configured for
the Enterprise Directory, but we did not wish to have to reconfigure all
these clients, a major support issue, should the buyers of the hospital
wish to deploy its own LDAP enterprise directory and want to integrate
it with our own. So, some investigation demonstrated that use of
"smart referrals" in the Netscape Directory Server (v. 4.1) would achieve
the goal of defining the University community for whitepages searches (not
just a web page, but email clients as well) could be done at the server
while the client configurations remained the same. Since all this
is based on referrals, the clients become very sensitive to the reliability
of all the directories referenced by the enterprise directory.
In essence, all the "community" directories, in this case, the hospital,
would become as critical as the enterprise directory itself since the client
would be referred to each of the community directories. The clients,
web servers and email clients, all tend to search the referrals serially
based on the response from the initial directory query. Should any
one directory be ill, that would appear to affect the overall performance
of the service. This is not optimal. So, work proceeded to
see how hard it would be to develop a web page employing a different search
technique which might influence future development of the email clients.
All the above work was undertaken by Michael R Gettes, Lead Application
Systems Programmer at Georgetown University from February to April of 2000.
The MACE group was kept informed about all this work and its developments.
A Perl script was developed to potentially act as a search mechanism
for a web service. Perl was chosen because I was lazy and didn't
have a lot of time to spend developing the string manipulation abilities
that Perl has, so it was believed Perl was the right tool for the job for
proof of concept. The PerLDAP module was employed to link the Perl
language with the LDAP manipulation necessary. Most of the standard
PerLDAP interfaces are based on synchronous calls to the directory.
The Perl script awaits a reply (if there is to be one) and is blocked.
It was deemed necessary to use the asynchronous LDAP API calls that are
exposed by PerLDAP (these are the same API calls used by the OpenLDAP and
Netscape SDK distributions). By making the async calls, we can now
perform parallel searches. It was realized that the LDAP libraries
were really designed for making multiple async requests to the same directory
server and not so much to many different LDAP servers. While there
exists some code in the guts of the LDAP libraries to handle overlapping
requests of multiple servers, this code has clearly not been well exercised
or, maybe, even tested since this feature just doesn't work. On the
other hand, overlapping requests to the same directory server is a heavily
used technique by many vendors to great success. Some, small, amount
of effort was invested to see if the LDAP libraries could be fixed and
after a few days I decided to move around this level of the problem.
To avoid handling multiple threads in Perl, remember this was just supposed
to be a proof of concept, busy waits were used around the async LDAP calls
and the necessary work was performed to initiate N LDAP connections for
N LDAP servers. I then put a request out to the Common
Solutions Group membership for the names of the institutional LDAP
servers and their respective search roots. I got responses from 9
schools. One of the first things I noticed was the variations in
the search roots. As we discussed in MACE, there was a desire to
get people to use standardized Distinguished Names and there was no operational
standard at this point. DomainComponent naming was the preferred
naming scheme by MACE and we realized we needed to get the word out.
Additionally, as I began some individual searches of the 9 schools, it
was realized there was also no standard regarding the use of the "standard"
LDAP schema in the person, organizationalPerson and inetOrgPerson
objectclasses. A couple of schools simply selected the attributes
they liked (like CN, SN, MAIL and so on) and created a new local objectclass.
This made it harder to understand the intended use of the attributes and
we realized, again, that we need to get the word out. University
of Colorado, Boulder was reported to have said "Well, just tell us how
to configure the directory and we will do it". Based on that, Ken
Klingenstein suggested that I write a kind of cookbook regarding the directory
deployment at Georgetown and Princeton that I had completed. So,
like a fool, I said "yes". The LDAP-Recipe
is still an active document and attempts to stimulate discussion, ideas
and methodologies for configuring and operating LDAP directories.
While this recipe is intended for use by academia, it could also be reasonably
employed by any corporate enterprise service deployment.
Initial tests had shown that, by multiplying the 9 schools sufficiently
and using that as a testbed, we could search a few hundred schools and
get back several thousand responses in under 30 seconds. As I began
to view this more and more as an interesting challenge, I began to learn
how to handle threading in Perl and I was able to increase the performance.
All this work was done on my personal workstation, a Sun Ultra-10 (single
processor). I had also changed from using referrals from the directory
to simply performing one initial search on the directory to get back the
list of schools to search. Then, should any of those schools return
referrals, then the LDAP libraries would automatically chase them down.
Without this change, parallel searches on the initial set would not be
possible since the LDAP libs would chase the referrals and my code would
never regain control until the referrals were processed sequentially.
This work was presented at the Spring 2000 Internet2 Members meeting during
the Middleware 201 workshop. At the Spring meeting I was able to
speak with Mark Smith of iPlanet (Netscape Servers). Mark is one
of the original LDAP developers from the University of Michigan along with
Tim Howes and crew. I asked Mark about making some small modifications
to the Netscape Directory Server Gateway, which is just a web interface
to the directory that comes with the Netscape DS product. The unmodified
DSGW web interface is in active use at Georgetown for both the whitepages
service and for handling priv'd access and modifications. The changes
we discussed would be little tweaks that would allow DSGW to call an external
program to handle searching. Mark agreed to do this work and
was implemented a month or so later. This allowed me to widen the
audience of presenting this as a service that others could see and not
just some output from a unix program. As more people saw this prototype
service, some would get really excited at the prospect and others would
get "freaky". I believe the "freakiness" came from the X.500 deployments
back in the early 1990's when X.500 was trying to achieve the same goal
as the DoDHE. But, back then, computing horsepower was far slower,
networks were far slower and X.500 was considered a bit of a pig process.
What the prototype seemed to show was the world has significantly changed
and that consideration of the work from several years ago using a lighter
protocol, LDAP, is a reasonable investigation. Some also believed
that performing parallel searches against institutional directories is
a waste of resources at the institutional level. Why should there
be a search against University X for someone who may not be at University
X? Bob Morgan, University of Washington, and Paul Hill, MIT, believed
that a central deposit was a better way to go. After quite a bit
of discussion, the current plan is to do both, handle parallel searches
against insitutional directories and a central deposit and let the school
decide how it wants to handle its data and how it should be searched.
I believe that parallel searches of multiple central deposits will prove
necessary as well as a link to other communities, like the international
sector and other communities of interest to higher education.
12-7-2000
Michael R Gettes, Georgetown University
European perspectives
Bob Morgan, University of Washington, pointed out some time ago (I think
it was around April, 2000), that the DoDHE should consider using alternate
(or LDAP independent) indices for the data to be searched in the central
deposit. I believe the reasoning was/is:
-
Don't re-invent the wheel. Previous work developed by Roland Hedberg
and should seek his expertise and understand the applicability.
-
Include the European community in our efforts
-
Architecturally speaking, the central deposit should be directory independent.
Not built against a particular technology like LDAP.
The following text is email from Roland Hedberg giving his thoughts on
this project.
Lets start with some math. Internet2 consists of ~100 schools
assuming 20.000 staff+students per school that's 2.000.000 persons.
Further assume that these persons will use the directory for white
pages
lookups twice a week. That would give a total of 4 M queries per week,
or ~7 queries/second. The distribution of queries over the day is
probably not evenly distributed so I'd guess that 90 % of the queries
will appear during the normal working hours, hence during those hours
you will get a mean of 12-13 queries/second. Surely there will be
peaks that are a lot bigger.
Now in your testbed it took 30 seconds to get back the answers for
some query. 10 queries/second x 30 seconds/query that ammounts to
300 simultaneous queries x 100 schools = 30.000 simultaneous
open outgoing connections from one machine unless you do something
very sofisticated like keeping a couple of connections open all the
time to all the LDAP servers and just distribute the queries over the
pipes.
Still I think you would have to find a decent machine to be able to
cope with that. And then you still have the machines on the other
end, who has to deal with a continus load of 10 q/s and peeks
up to 100 queries/second .
This is the background for my thinking, I don't believe in letting
every server having to deal with every query. It simply doesn't scale.
It might just work for the present size of I2, but if I2 increases
with a factor of ten ...
Another belief I have is that users wants to use the directory
for whitepages queries when they are using a certain application, like
a mailclient. Since a lot of mailclients today include a LDAP client
they will probably find it reasonable that they should be able to use
it for doing the query. This is a mayor pain because the client
state-of-the-art is so bad, still you/we will have to deal with it.
There is no way they will be satisfied with a web interface. So whatever
solution I2 chooses it has to work with common LDAP clients.
So the system I'm imaging is a system based on distributed LDAP servers,
loosely held together by the use of referrals.
It's built on a server hierarchy. At the top you have a set of
index servers that cooperate to guide the LDAP client to the
right LDAP server to query. And below the index server you have all
the schools LDAP servers, all of them containing superior references
to
one or more of the index servers.
If some schools have more than one LDAP server (masters and slaves)
or
if they use a central LDAP server as the slave or even their master
is
irrelevant for the design as such. That it has a great impact on
accessability and reliability is a completely other thing.
This way a LDAP client connecting to one schools LDAP server can find
all the other LDAP servers and using the indexes it will also
find the subset that might have the information it is looking for.
If along side this one would like to use SRV records, SLP
or other means of finding a LDAPserver it is absolutely OK.
Granted this is rather schetchy and need to be worked out in detail.
This does not preclude the usage of a WEB interface like the one
you have done, I'd only like to see it use the index servers
before going to the schools LDAP servers.
A guy here in Norway has done a first cut as such a web interface
http://www.katalog.uninett.no/ldap/finn4/
You wan't be able to read the text as it is in Norwegian :-)
but if you type "leif johansson" and do a search ( hit the
button after the input field ) you will se from the email addresses
that the information comes from 6 different universities.
What this interface does is that it first queries the index server
ldap://gids.catalogix.se:3891
and then follows the referrals.
I'm not sure if he is serializing or doing it in parallell.
We have an added complication here in Europe which is that we still
have some old LDAPv2 servers in use and they use other characterset
then UTF-8, most commonly ISO-8859-1 and T.61. So we have to use
proxies that do character set translations.
In I2 you might have to use proxies to do chaining on behalf of
LDAPv2 clients who don't know how to handle references. For instance
I think Eudora still contains a LDAPv2 client, and Netscape has a
LDAPv2.5 client. If you have some clout at Netscape please get them
to do a proper LDAPv3 client implementations.
-- end of Roland's email
1-11-2000
Michael R Gettes, Georgetown University
A powerpoint presentation showing current status
and an architectural view of the DoDHE was posted to the http://middleware.internet2.edu/dodhe
web site.
|