MW-E2ED Conference Call
June 6, 2007

*Attendees*

Chas DiFatta, Carnegie Mellon University (chair)
Mark Poepping, Carnegie Mellon University
Matt Zekauskas, Internet2
Michael Gettes, Internet2
Renee Frost, Internet2
Steve Olshansky, Internet2
Dean Woodbeck, Internet2 (scribe)

**Action Items**
[AI]{ALL} For the next call’s agenda, outline potential use-cases considering:
1) developing/changing policies on the use of data and anonymization requirements
2) in light of the policies, outlining how researchers can use this information
3) in light of the policies, determine how Internet2 can use the information (could Internet2 use non-anonymized information? If so, with what if any restrictions?)

[AI]{Chas} Add to next call’s agenda: a discussion about how people are using the email diagnostics in EDDY.

**Anonymization Discussion**
Chas started the call with a discussion about sharing diagnostic data among educational institutions, Internet2 and the Internet observatory. During a session at the spring Member Meeting, there was concern expressed about the use of perfSONAR and that some data is not being anonymized. The concern centered on the data that Internet2 captures and retains. Concerns include the perception that people can see what is happening on a network and, for example, determining whether a particular attack is successful.

For research purposes, Internet2 collects data and will release anything that does not appear to raise privacy concerns. This contrasts to the European approach of more restrictive policies for releasing data. The Abilene/Internet2 approach is to collect and store as much data as possible to accommodate future research. As data sets get more sensitive, it will become more necessary to know the potential uses for the data in order to create effective policies.

The data flows captured by Internet2 are anonymized by zeroing the last 11 bits of IP addresses. A researcher wishing to use the data contacts Internet2 describing his or her project and specifying the needed information. After approval, the researcher then receives an rsync password and can access the 11-bit anonymized flow data.

Internally, Internet2 would like to keep more information and revisit the 11-bit anonymization policy. There are questions, such as how long flows run, that can’t be answered now. For most research it isn’t necessary to know the sources and destinations of the flows, but to know whether certain types of flows exist and their characteristics.

>From a research point of view, it would be advantageous to anonymize only the last 8 bits, but that raises privacy and security issues. So far, the community seems satisfied with the 11-bit policy.

The next point of discussion was whether it would be advantageous to anonymize all bits while preserving assigned prefixes. This would provide the ability to correlate flows, which would be a benefit to researchers. A researcher could tell whether the flow is coming from a particular location without knowing the specific location. The down side is that an unscrupulous researcher could potentially re-map the data and discover the location.

Internet2 has considered storing raw flows but doing anonymization before releasing data. Some university security professionals may not be comfortable with this, given that they would lose direct control of the data.

One technique may be to provide this week’s data, for example, with assigned prefixes that are persistent, but change from week to week. This would provide the consistency that researchers need, but still keep a large measure of privacy and security. The question comes down to the risk vs. the value of the analysis and research that could be accomplished.

Internet2 has conducted some interviews and security researchers are interested in having more information available to them in the data flows. The richer data would allow for the testing of new algorithms, for example. Possibilities might be to allow access to researchers to, for example, use a content analysis tool, but allow institutions to opt out (and, thus, their data would not be included). Another possibility may be to provide access to richer data, but embed the anonymization in the data so it cannot be used again.

There are a number of usage scenarios for researchers, depending on the specificity of the data needed (flows between two specific points or looking for a specific port or some specific piece of the flow). In another case, a researcher may insert a probe and see data from his or her own institution. This could be useful for looking at such data as time stamps, jitter, losses or paths, or for determining whether someone is spoofing an address from that researcher’s institution.

Aggregated data can also be useful if a researcher is interested in the flows, but needs to see just a slice of the traffic.

EDDY may be useful for storing these flows and information. An agent could grab all of the netflows and feed them to EDDY, where they could be anonymized, stored, and made available to researchers.

Some researchers are also interested in getting an EDDY feed from Abilene. The benefit would be the ability to see flows in real time. If a researcher or analyst sees an anomaly, he or she can see what is happening elsewhere on the network.

**Next Steps**

A suggested next step is to outline some use-cases and sketch out some potential system architecture (related to the use-cases). Considerations after outlining the use-cases include:

1) developing/changing policies on the use of data and anonymization
2) in light of the policies, outlining how researchers can use this information
3) in light of the policies, determine how Internet2 can use the information (could Internet2 use non-anonymized information?)

Other possibilities:
1) consider how perfSONAR outputs can be brought into EDDY.
2) consider grabbing SNMP stats from routers and use “top-talkers” to determine the top-talking links or the top-talking interfaces
3) adding a method to EDDY for adding different methods for aggregation (for example, providing the ability to determine the maximum utilization over the lifetime of the network on a particular link).
4) provide/use the ability to link network data to applications (email, for example)

**Next EDDY Release**
This month’s EDDY release includes a mechanism for adding arbitrary events without having to do any Java programming. As an example, a user can enter a general event (“It is 2 am and we are being spammed”) that may provide a starting point for analysis or make the data more meaningful. This is a template that generates an EDDY CER. You don’t need to add the details in the rest of the CER and the user doesn’t need to know extensive java.

In addition, CMU has a new project that will sense the entire campus, called Sensor Andrew. Sensor Andrew has many embedded systems, but no way to track logs or data. They are going to use EDDY as a place to store this information and create CERs.

Agenda items for the next call include outlining the potential use cases and a discussion about how people are using the email diagnostics in EDDY.