|
Minutes From The 6/3/04 Bimonthly Meeting |
|
Agenda
| Participants
|
- Survey - comments from group
- Russ Hobby - network area comments
- George Brett - comments on other tool pointer efforts
- Scenarios from network area and Shibboleth status
- Steven Carmody - report from last Shib meeting
- Russ Hobby - finding a candidate for the network scenario
- Report from "diagnostics big and small" session
- Pilot Progress
|
- George Brett - Internet2 (scribe)
- Steven Carmody - Brown
- Chas DiFatta - CMU (chair)
- Russ Hobby - Internet2
- Steve Olshansky - Internet2 (flywheel)
- Mark Poepping - CMU
|
Discussion
Chas began the call with information about a conversation he had earlier
with Eric Boyd, E2E Performance Initiative (E2Epi), how do you map
end-to-end performance into the events arena. It was a good session.
It was helpful to better understand each other's architecture. E2E
basically has two camps: active measurements and passive measurements.
Eric's work in more in the area of active measurement. The stuff in the
performance are in challenge-response, stimulus response. It will be
generating lots of interesting data. To get the data into the e2ed
domain, work must be done to design a process to analyze at the performance data,
make some decisions, and then inject an event into the MW-Diagnostic Backplane. Russ
agreed with this observation. Chas pointed out that there needs to
be input from passive monitoring from tools such as NetFlow. Russ
said another reason for passive monitoring is to keep the network
from being consumed by active tests.
Survey - comments from the group
Russ Hobby -- network area comments
Russ said he had looked over the network area of the survey and that he had some
difficulties understanding how they differ from the
tools listed on the E2Epi page.
Chas replied that they had started out looking at tools specifically for
middleware and found that there are a lot of tools that many middleware
tools branch out and include network. So they added them to the list,
ending up with tools that were more network specific. Russ suggested having
a pointer on the e2ed page to the E2Epi tools.
Chas said that they had discovered in doing the survey that many network
tools are now include in application tools as well, even low level stuff like
DNS. Microsoft has released MOM (Microsoft
Operations Manager) and over time they will integrate with network based events.
CMU is considering deploying MOM on all its desktops. There is a new movement
from Microsoft to include processes that collect SNMP based events and other
analysis protocols like Cisco NetFlow.
George Brett - comments on other tool pointer efforts
George described recent activity with group of Internet2 staff about identifying
troubleshooting documents (e.g., FAQs, how to's, troubleshooting guides, etc.).
This activity is one that should work with Middleware e2e Diagnostics. He pointed
out there is a wiki being used
to collect and discuss this information. This is a living document, wiki's
are designed to be edited by multiple folks. There was brief discussion about
figuring out what people will use as a resource.
George described the issue of how to maintain such a resource once it has started.
Chas agreed that such an activity takes resources and human effort. He suggested that
Internet2 have an area where Big Tools that are supported by Internet2 and other groups
are listed, but also have a more public area where people can freely add information
about new tools that might be added to the Big Tools list. An further suggestion was
to incorporate a ratings process similar to Amazon or CNet that would help better
identify the better tools. This led to a discussion about the value or lack of value
of anonymous ratings which most agreed that only people committed to the tools would
be likely to rate or comment on them. Chas said he could see Internet2 becoming the UL
(Underwriter's Laboratory) of application / network based tool. It's possible that doing
such reviews would add stature to the reviewers.
George asked for feedback and suggestions from the group.
In closing it was suggested that web stats from the troubleshooting wiki would be helpful.
George will see how to get this information.
Scenarios from network area and Shibboleth status
Steven Carmody - report from last Shib meeting
Steve reported that Shibboleth has just released a new version and now they're
talking about its features and functionality. Shib is at a point to start seeing
production level deployments on campuses and at vendor sites. The Shib team has enough
experience with big complex apps that they know they'll need to provide tools
to the campuses and vendors to support the applications -- helpdesk and back room.
OCLC already saying they'll need tools before moving past level 1. It's time to
start looking how to modify or enhance aspects of Shib code (logging) to record
useful, helpful information to be used by diagnosis people. Shib is setup to log
up to nine levels deep. It can be set to record lots of info, but problem right
now is that the information being logged was chosen to help developers. Now we
need to figure out what do help desks and 2nd & 3rd level people need to figure
out issues. We don't have much experience with distributed system like this.
There will be issues about access to certain materials in the logs.
We have lot of experience helping people installing Shib, but little experience
in production environments. There are not many. Those folks learned enough about Shib
from doing the install that they don't have needs in production. But, once there is
wide spread deploy new problems will no doubt emerge.
At this weeks Shib call, Chas and the MW E2ED team at CMU began to explore with the Shib
development team what can be done over next 6 months to reach a stage 1. Action Items include:
- Shib folks developing Shib focused MW-E2ED based scenarios
- Talk with people at sites running Shib production networks for a perspective from the operation of the service
- Incorporate new error messages will be more helpful to include in the logs
He went on to say that one problem is that (like other projects) Shib depends on other peoples' code.
The down side of this is that there are log files all over the place. There are files from
Apache, ModSSL, other libraries. Each of these components writing to their own log files --
logging decisions based on how the library would be used. ModSSL logs based on original use
cases - people with browsers that access web server with SSL.
Shib uses this very differently and therefore error messages are not helpful. Suggestion
that some one go and fix error messages in ModSSL. He said that one Shib programmer is
going to spend significant time to improve information of logs, ways to thread the log
messages to better identify Shib transaction.
That's where Shib stands, it's just being kicked off, and he'll be reporting back on regular
basis.
Chas commented on the actions that came out of the conversation. He said that Steve will get
developer to talk to write a scenario. A Very good results - the four levels of users we talk
about Developer, Operator, Help Desk, Users - definitely verify they're in this camp. Renee Shuey,
Penn State Univ., has one of first production has already identified people for Chas to talk to
develop scenario for this aspect of the community and help raise some of the questions they'll
need answered.
Russ Hobby - finding a candidate for the network scenario
Russ had talked with Brent Sweeny at the Abilene NOC who felt this was a good idea. He volunteered
either himself or one of his staff to write a scenario. They have good operational experience,
so should produce a good document. He noted that they may need some persuasion to complete it.
Next step is getting it pulled out of them.
Chas commented that we're seeing two dimension in network space: Active vs Passive and
Long-haul WAN vs Short-haul LAN and that it will be helpful to have people write scenarios that
fit each of the four camps. He pointed out that Abilene folks have very different view than people
at local campuses. We need information from the Passive measurement (NetFlow)
side as well as the Active measurement side (E2Epi). Russ agreed and asked if there was a template
available to give to people to fill in with their scenarios. Chas said he has one and will send
it to Russ.
[AI] Chas and Russ will email each other with how to 1) follow up with folks identified and
2)engage new folks in other camps.
Report from "diagnostics big and small" session
Chas said it's been really hard to get people to participate, Matt, Russ, Eric been really busy.
But, we need to get back to Cheryl & Ken very soon to come up with a road map to see if diagnostics
are outside middleware. If this is so, the question is how to coordinate with specific groups to
keep from reinventing wheel to leverage as much value as possible.
In discussion with Eric today - 3 points to make with respect to E2Epi:
- Every 3-6 months have meeting with the two groups to discuss current activities and how they
might fit.
- Agreed that the efforts are loosely coupled - but there are touch points
- There is a whole other camp in networking diagnostics - passive Since Eric and E2Epi are in
the Active measurement camp, the question is how to engage people in passive side as well.
George mentioned that E2Epi has worked with the NLANR Measurement
Network Analysis Group and they might be a good contact.
Pilot Progress
Chas updated the group about progress on the pilot. The work study developer is now full time, and
the team now includes a have a seasoned developer. They had a meeting of the
last week in Pittsburgh and are now studying two tools to use as a foundation for development so
they can concentrate on the goals of the pilot, and not reinvent the wheel if others had done so. The
candidates are NetLogger
from NBL (Brian Tierney) and AirCert from then CERT, both
are very interesting. AirCert log files have provisions for NetFlow (Cisco's)
and are tied to a database. Should have a decision by next week to go with one of them with out much
modification, augment it, or do we just take small high value pieces of them and mostly roll our own.
He said they are coming to grips with defining the event record further. It still same concept
as of the metatag that holds the correlation data, but raw event on the other end will have 5 schemas -
- Applications (Shibboleth, DNS)
- Network based flow events or passive measurement (NetFlow)
- System oriented events (re-boot, memory error, userD messages)
- Security (intrusion detection systems, needed access)
In order to kick off pilot and scaling issues will not be addressed at this point. To keep things simple,
the event information from log files will be kept in XML form so it can be operated upon easily and quickly.
Once within the backplane, one requirement is that the event information has to have the ability to be
de-compiled from XML and returned back to its raw form. Scott Cantor said the
same thing on the Shib call this week, i.e. "Have to keep raw data the same as it came in."
The next check point in couple weeks. Chas will update us development milestones at that time.
There was a brief discussion about the details of the schema and how the backplane would log system
configuration changes such as on a router. Chas said that would be a system event, where a router can
be looked as a host with a specialized application running on it. Mark said we'd have
separate event from errors. Chas said that this is a first exercise to run log files through to see
how works and then search against them to find what was difficult.
In other business Russ talked about a proposed session for Fall Internet2 Member Meeting. There are
aspect that pertain the MW e2e Diagnostic group. He said that he had no one specially in mind yet,
but this came out of Applications Strategy Council. That is, what characteristics are important and
how to design for them. He will be sending the proposal to the MW E2E Diagnostics list for discussion.
[AI] Chas and George will work out the list of action items.
|