Internet2
Site Index | Internet2 Searchlight |
Membership | Communities | Services | Projects | Tools | Events | Newsroom | About
 | Internet2 Home > Middleware

Middleware

>Home
>Middleware
   Overview
(PDF)
>Mailing Lists


Minutes From The 12/16/04 Bimonthly meeting

 


Agenda
Participants
  • Interview with Dan Pritts
  • George Brett - Internet2 (scribe)
  • Chas DiFatta - CMU (chair)
  • Mark Poepping - CMU
  • Russ Hobby - Internet2
  • Dan Pritts - Internet2

Chas began by introducing and thanking Dan Pritts for participating in the call and interview. Chas referred the call participants to Dan's written responses to the top ten problems. The interview began covering the top three problems: 1) end to end network problems, 2) eDial telephone conference appliance, and 3) spam.

Chas asked if video was the main end to end network problem? Dan replied that video is most typical when there are end to end issues. Typically there will be a video conference between two points, say Ann Arbor, MI and Monterey, CA. He will be told that video is horrible on the other end and you have to fix it. This is hard to do given the current tools. In terms of networking applications video seems to be the only on people really complain about.

The second part of the first problem are local area networks and switches. Dan said the problem is when you look at a router you get all sorts of counters and error rates in bad frames. Typically the HP switches do not give visibility into what is happening inside the switch. One time they didn't have spanning tree enabled and when they plugged in two switches it took down the network. They finally discovered the problem by look at the wall jacks. There was no other way to figure it out.

Another local area network problem example is between two switches when an Iperf note comes in that something is going on in network between the 1rst and 2nd floor of the building after a video event. They tried to break the network into smaller pieces based on the floors. Iperf reports loss, but the switch didn't report any errors on the link. What was the problem? It could have been a kernel not passing information. They're working with both Linux and Windows and information may not be passing well between each other. The Application is intolerant to error where one side says there are error and the other does not.

Moving on to eDial, what sort of problems does it present? Dan said that eDial runs on an SSL web server, Apache on Linux. It's well known that they recently switched SSL Certificate vendors. This changed the installation of eDial from a single cert to requirement for intermedia cert. Both had to be installed properly. Now eDial is un-willing to fix it. They do not respond when requesting root authorization to fix it

Another eDial example was when he had to do an upgrade and was trying to do backups of the systems. He had tens of gigabytes of system copied off 2 or 3 times. This caused the upgrade process to expand from a 4 hour process to one of 6 or 8 hours. The work required was really on 30-45 minutes but the rest of the time was waiting for file copies.

Then there was the eDial example related to phone lines. It has a pair of T1 lines coming in. The dial in pin codes were not being recognized. There is no way to diagnose what is happening on the T1 cards. The eDial finally figured it out that the problem was a bad port on the card. Evidently the tones were not being read correctly which then meant the pin entry failed. Dan had no such info to base any conclusions on.

The third major problem area is spam. Dan said that last year he spent at least one man month or two dealing with spam and virus control by way of email. He said he was not sure this is a diagnostics issue but it is a big problem. Chas asked if this was part of security management and response? Dan said it is part of it. He said that he spends time dealing with spam when he should be doing traditional security activities. But he said that so far we've been luck and there have not been many big problems. He knows some from what he reads in trade magazines such as if he sees "failed connect" notices from particular addresses that they should be blocked. Similarly if we get 1,000 email viruses from one address that it should be blocked.

Chas then asked about data base access, if the issue is discovery here or who is using what? Dan agreed and said it is a combination of enterprise data bases being used along with multiple little data bases on various web servers. The challenge comes when people try to have data combined from different data bases. This is mostly a problem for other staff when they try to use multiple SQL queries. This tends to be both an organizational and a data management problem.

The next area of discussion was backup and archiving. Dan said that hand holding is often required when there are problems. He said there is pretty good software to do the Windows back ups. But if a file is open when the system crashes the user gets a report that it wasn't done right. This causes most calls.

The server backups are done using NetVeritas. Recently they had problems when the network data backup failed. They learned that there are two tapes that need to be manually configured, you can not just add tapes to the pool. If you do it keeps trying to write to one with media errors. Chas asked if there is automated error detection? Dan said there is by email if the failure is do to a media write error. You'd think to just add another tape and it should figure that out, but the software doesn't do that. It had to be configured manually with the tape number to be used.

Chas asked how do you know you have a problem with backup, is it reactive or proactive? Dan said the software logs the file and then sends email. Every backup sends email log where it is either successful or not. He said he gets a lot of email, one message for each server each night plus a summary. In a particular case of a failure he will get an email that the backup failed. Chas asked if this was too much information, too noisy. Dan said no, the information is necessary and moderately useful.

Chas asked about email left on the server issue, is this because users are not able to get email from the server?. Dan said that they instruct users to set their email client's option to leave email on the server. This was if the laptop email application fails they still can get access to their email. We don't use IMap since it does not work well in disconnected mode with the Eudora email client. Every once in a while Eudora client decides to download all mail from the server and mark it as new, essentially a big dump of email. Chas asked if this was logged to show when this happens on the client or the server. Dan replied that the server logs a successful connection with X number of downloads. He doesn't usually deal with this, but fairly smart people have no idea why this happens. It's random, but the general consensus is that it's a pop mail client problem.

Dan said that better diagnostics would help with Unix performance and troubleshooting. An example is when some process goes crazy and begins sucking up virtual memory and causes the server to run out of virtual memory. Under Solaris this would happen and there would be no other way to fix it other than to re-boot the system. The understanding was that if they sent a kernel dump report to Sun, they could tell what was running and what the system looked like. But what it didn't have was even with system accounting was such and such a command was running and it exited. It's all right if it runs well, but it's not logged if a crash happens.

There was a recent problem, we use RT for trouble ticketing. RT is a resource hog, it's open source and uses lots of Perl with MySQL back-end. When we first brought it up we that a was a simple application that didn't need much. It was running on a Pentium. A single ticket with email diary took 5 - 10 seconds. This gets old real fast. He went to the RT support mailing list with the problem and found out that it was the database.

Chas asked Dan what he thought were the most expensive activities listed to have fail. Which take up most of your time. Dan said spam takes the most time, then network problems, and then eDial. RT performance tuning took a lot of time initially, but it's more of web application troubleshooting with Linux, MySQL and Perl. The rest of the problems are about equal.

Are the network problems result of hardware, software or network behavior? Dan said all of the above. It begins with application saying the network is down. Like streaming video hiccups in a visible way or person on other end says video looks bad. Where's the problem? Is it in the building, with Merit (regional network), or somewhere else? Once we decided it was in the building after doing lots of local testing with routes, Iperf, etc.

Out to the problems listed which is the most expensive to the organization when it fails. Dan said spam is number 1. eDial is moderately expensive when the appliance fails because then they have to use a more expensive telephone conferencing system. He pointed out that the commercial system cost the organization $100,000 per year before. eDial paid for itself within 8 months of operation. So, this can add up to big expense quickly. With eDial its the little problems that quickly become big problems. Financial expenses build quickly and there's the difficulty of getting right information out to people who are used to using eDial about changes.

End to end network problems are expensive too and if we can not get network running, who can? We are Internet2 after all.

So far the ranking is #1 spam, #2 network issues, and #3 eDial. What proactive or reactive tools do you have to solve problems related to these?

Dan said that spam and viruses are two sides of the same problem. We do spam in a reactive way by using rule based filter application, Spam Assassin, to keep it out. There is constant tuning to keep the rules current. Proactively use spam "black lists" or get involved with the industry process to something about it. On the policy side our CEO is working with other leaders to work on this.

Is this happening on the client or the server side? Dan replied this is mostly happening on the server side. The point is to make sure the end user is less impacted by spam. A Gartner Group report says that everyone spend 10-15 minutes on spam which leads to millions of dollars spent or wasted. Any cleaning up before hand is a benefit.

Are you looking for outgoing spam? Dan said not really. We have a reverse spam problem. Mail sent to non-existent account. We get the mail and process it later. Then it bounces back to the sender. There are similar problems with the Sympa list server software sending auto-respond messages. We try to mitigate with Spam Assassin. We don't want to drop the mail or send it on. Chas asked if the problem is with the Spam Assassin server not keeping up? Dan said no, it is more that the spammers change and you have to try to keep up with them. How do you know if you are being effective? Dan replied that it is a manual process. He knows the current status is OK because he keeps an eye the mail server to see if it slows down or mail stops going. As for anti-virus applications, there are two checkers running. If one checker goes down, the other virus checker will begin to time out. The long time out begin to slow down the mail servers. About the only way to deal with this is to restart the process.

Chas asked Dan if development resources were unlimited, what would be the best the solution to deal with spam? Dan said he'd like a better filter application that would give statistics with X messages containing Y viruses. He'd like to know the current rate of processing is. What the back log is. He'd like it easily presented and quickly produced. He would like to be notified if the back log is of a certain threshold over a period of time, say 10 minutes, when a console message would notify him.

Is there any information that your current tools give you now, but is deep to see it? Dan said yes, the processing rate. He doesn't have scripts to figure it out. The problem of X messages per minute with a backlog of Y messages waiting. But he does have some reporting cobbled together to report how many viruses were caught each day per user.

Do you use logs a lot? Dan replied that he only used them when things break. He's seeing more spam and viruses problems in the past year than before.

What about how to tell who is getting more spam than others are? Dan was not really sure what works best for this. There really are not any statistics. What if you had a "This is spam" button which reports back to you? Dan said yes, that would help. It's being done for web based mail services, but not on clients like Eudora.

Chas asked Dan to describe reactive and proactive tools that he uses for networks. Dan said the only proactive tool he knows of is to run MRTG and NTop to give different graphs. He can see the router interfaces of Merit and sometimes Abilene, but can not see what's happening on the other side. He just does packet counts. He tried other counts with Juniper equipment, but couldn't get it working well.

Dan said currently he's not using much information other than raw usage of bits in and out. It would be helpful to have more than that. He said it would be nice to have mapping of the switch ports to hosts. MRTG can generate configuration that cross switches with 48 port but can't tell which hosts are hooked up to that.

Chas asked about reactive solutions. Dan said he's using port sniffers and packet sniffers. So, when I'm getting multicast flooding I can check it from my desk. He has used Ethereal and tell it to dump all IDs or NTop. He does have Snort running, but it produces too much data which makes it useless. He said the security person is looking at commercial products to make that better. He doesn't want to know every detail.

Arborworks has good product but is expensive. They do netflow analysis and keep an eye on traffic pattern. They note the level on port 25 traffic and what is normal. If port 25 from another host spikes, it indicates you have have a local person sending spam. Pings would indicate a denial of service (DOS) attack that needs to be blocked at the edge or upstream. Arborworks looks wonderful.

What Dan would like is when a video stream looses 3 packets to have the software tell where they dropped. But then again if the packets traveled outside the building and the problem is somewhere else the instrumenting of this might be very difficult.

Have you tried baseline tests? Dan replied that he had, but they really had not been helpful in the long run.

To recap, you'd like to receive data on you network about flows, but want it filtered? Dan agreed, but he'd like to have just the bad anomalies reported. He would like an easier way to identify negative anomalies.

How about a tool that will use all flow data but raises to the top anomalies which are then sent to a reporting tool. Yes, that would be a good thing to have.

Do you find yourself doing forensics, drilling down into problems? Yes, Dan, said he does that when he sees a problem such as some on sourcing a large multicast. If there's no local listener on the net, then IGP snooping on the net will flood the network. All of a sudden all devices are having 30MB streams thrown at them. The answer is to fire up Ethereal and identify where the flood is coming from. Another case is like figuring out what with Ethereal knows traffic between hosts or port typically begins adding exclusions that we don't want to see in the packet trace.

Do you keep network logs or records for historical purposes? Dan said he'd like to but doesn't have the time. Chas asked if he could, what would he do with them. Dan said he'd watch the level of traffic. For example a remote office wanted to increase it's bandwidth. It would be helpful to be able to check records to see if the increase would be warranted.

What tools would be helpful with eDial? Dan said anything to look inside the system would be very helpful. Even documentation would be a great improvement. eDial doesn't do either.

Chas thanked Dan again for his participation in the interview. He suggested that he and Dan could have some further discussion after the call about some potential tools that might help Dan.

The call was ended.

 


Top Ten Problems -- Responses [top]

As discussed in the prior mail, this is the interviewee's top 10 problems that he spends time diagnosing. We will be using these for the call.
... cd

1.0 - end-to-end network problems. in perhaps a classic case of not eating our own dog food, i often don't have good visibility into network problems that we encounter. a recent example: a demo of streaming vbrick video from Monterey, CA, to our DC office had troubles, and i got bad diagnostic information from the users.
I had no visibility into the network at the source, which is where i believe the problem was.

1.1 - local network concerns. diagnostic capability of ethernet switches leaves something to be desired.

1.2 - don't trust the tools i *do* have. iperf, for instance, doesn't necessarily give consistent results.

2 - eDial. Our eDial appliance is a linux-based server supporting conference calling with a web UI. When it has troubles the administrative UI is very limited. the vendor sells and supports it as a black box, so even though i know linux/apache better than their support folks do, i cannot fix my problems. If they fixed my problems reliably this would not be as big a problem.

3 - spam,spam,spam. It's never-ending. spam assassin, sympa (mailing list manager) and its auto-replies generating reverse spam, viruses, server load problems due to all of the above.

4 - security management & response. THere are a million things i know we could do better. Never enough time. an example: Recently our system log-file analyzers are pointing to ssh brute-force attempts coming in - sure would be nice if events like these could be automatically shut down across my entire network. Similarly, blacklisting IPs that send email viruses.

5 - database assets. our organization has a thousand little nooks and crannies holding one form of database or another. What's in each one? how do we coordinate the data across them? nobody knows.

6 - backup, restore, archival. We support disaster recovery backup which works well (fingers crossed, everything i have wanted to restore has always been there) but requires a moderate amount of hand-holding. Concurrently, there is some sense that the organization should be archiving information and hope that somehow my backup tapes will fix that problem with no effort on anyone's part (except mine).

7 - printing. Not really a big problem today, but periodically we have had issues with users accidentally reconfiguring the ip addresses of printers, breaking printing for everyone else. bad software design allows this. In a former life i've had an interesting time trying to handle printer accounting.

8 - a simple problem with no easy fix. Eudora, POP, and "leave mail on server." we tell people to leave mail on the server so that they're protected against data loss on their PCs. IMAP's not a good solution for disconnected operation, which is important (at least not with Eudora, which is entrenched at our site). unfortunately eudora occasionally decides it hasn't seen the mailbox and re-downloads hundreds or thousands of messages.

9 - unix performance tuning & troubleshooting. Not a problem for me in my current situation but in the past i've had to deal with performance problems that were not well understood. this is perhaps the classic example of a case where better understood diagnostics and instrumentation would make a huge difference.

 
© 1996 - 2008 Internet2 - All rights reserved | Terms of Use | Privacy | Contact Us
1000 Oakbrook Drive, Suite 300, Ann Arbor MI 48104 | Phone: +1-734-913-4250