|
Minutes From The 10/9/03 And 10/13/03 Bof |
|
Date
| Agenda
| Presented At
| Given By
|
10/13/03
|
- define a regular conference call schedule
- review first draft initial charter
- review first draft of year one expectations
- discuss if the core members are a correct fit
- and if we should augment
- review year one activity roadmap
- compile an agenda for the October BoF and
- who should we solicit to attend?
|
Combined Conference call and Internet2 Fall 2003 Member Conference - MW-E2ED BoF
|
Chas DiFatta
|
Participants on 10/13/03 Conference Call
- Chas DiFatta - CMU(chair)
- Steven Carmody - Brown
- Renee Frost - Michigan/Internet2
- Scott Cantor - OSU
- Mark Poepping - CMU
- Von Welch - NCSA
- Nate Klingenstein - Internet2(scribe)
Discussion
The scribe apologizes for missing the first thirty minutes of the
member meeting session due to scheduling conflicts.
Architectures & Bounds
The first thing the group tried to do was establish bounds on the
project by trying to understand what models are in use on campuses so
that the diagnostics are well-designed for standard architectures, so
setting some scope of what components will need to be measured. More
importantly, the group evaluated what would be an appropriate thing to
measure. The project aims to eventually create a set of diagnostic
tools that can function across realms and administrative domains. This
would allow for testing of new inter-domain applications, such as
Shibboleth.
Mark is concerned that trying to measure anything in terms of
performance details, such as response speed or server load, will
immediately open a huge set of ratholes, both in terms of technical
difficulties and in terms of other projects trying to do similar
things. Performance analysis is a different game; it implies a
continuing effort to tune things, which is outside the scope of
monitoring and diagnosis.
Steven shared the same concerns, preferring to build a set of
baselines "and a crude alarm system." Dealing with anomalies is a more
critical and applicable goal in the short-term. The group warned that
the term "performance" is overly broad and can be misinterpreted, and
Chas offered to refine the charter and other documents based on this.
Michael Gettes of Duke at the member meeting asked which campus
components would be monitored, suggesting he'd like information for
individual applications, how DNS is integrated, the network itself, and
several other pieces of information. In response, Russ Hobby and Matt
Zekauskas of Internet2 will be joining these calls as representatives
of a similar measurement effort in the Internet2 End-to-End initiative.
Federated Diagnostics
A project facilitated by Internet2 which is now on indefinite hold is
the DoDHE, or Directory of Directories for Higher Education. This
project aimed to build a centralized repository for LDAP queries about
individuals at institutions leveraging data that was already public and
widely available. This data could be centrally stored or pulled
dynamically in a distributed fashion. However, extreme hurdles were
faced when trying to get institutions involved as people asked how
public this information really should be.
There may be lessons to learn here when it comes to the storage and
accessibility of log files, which may sometimes contain sensitive
information. It's possible to sanitize these files, query them
dynamically with access controls, or use various other techniques to
protect data, but this will be a fundamental hurdle the project will
face as it moves forward into the federated world.
An approach that Chas suggested to this during the member meeting was
to have events and data move along different streams of logging,
allowing one set of information to be used for generally accessible
event information and federated diagnostics, possibly stored centrally
or included as part of a web of diagnostic information, while another
more detailed set were used for internal diagnostic work. The storage
of log files in a centralized or distributed fashion is one of the
central questions to this project given that there will be distributed
systems and applications to query even in an intra-realm scenario.
This drew some comments from the crowd about the need for separate
streams of information, but by response, this is partially necessary
due to the nature of the data itself; for help-desk purposes and for
allowing application users to look more at relevant logs, this
distinction seemed useful. While the tools produced don't necessarily
have to be used in an inter-realm fashion, this is one of the primary
reasons for the work.
Active, Passive & Event-Driven
The biggest discussion at the member meeting was whether the
monitoring tools should make use of active, passive, or event-driven
techniques, listed here in order of increasing complexity. Active
monitoring would initiate its own actions and measure the performance
dynamics of the service responding to those actions; passive monitoring
would instead look at the logs or other information of these services
to watch for errors or similar problems.
An event-driven system would be difficult; saying there's a need for
all the information related to an event that happened at some point is
difficult, requiring many shims on the backend. Chas said this isn't a
"large hammer you need to throw into your infrastructure;" but there
needs to be investigation of the back-end threading necessary to
determine whether this is a feasible approach or not. Filtering
events and tagging them somehow would allow for more sophisticated
forensics, reporting, and analysis and potentially better support the
needs of multiple applications without presenting an overload of
information.
The decisions made here will hinge on central questions about how much
information should be logged and whether most errors will be considered
reproducible or not: Chas's categorical answer to that question was
that the goal is not to gather the absolute largest amount of
information possible, but instead to gather specific information that
will likely be most relevant to the diagnosis at hand.
Steven cited an anecdote where a Shibboleth site had asked him once,
"what went wrong last night between 11:30 PM and 2:00 AM?" That sort
of problem can't be feasibly replicated due to the sheer number of
variables involved and given the sort of uptime sought by these
relatively critical services, that sort of question may be important to
be able to answer to diagnose and patch systems.
|