Collecting Data: How Much do We Really Need?

Many of us have had conversations in the past few weeks about data collection due to the reports about the NSA’s PRISM program, but ever since April and the bombings at the Boston Marathon, there has been an increased awareness of how much data is being collected about people in an attempt to track down suspects–or, increasingly, stop potential terrorist events before they happen. A recent Nova episode about the manhunt for the Boston bombers showed one such example of this at the New York Police Department. This program is called the Domain Awareness System at the New York Police Department, and consists of live footage from almost every surveillance camera in the New York City playing in one room, with the ability to search for features of individuals and even the ability to detect people acting suspiciously. Added to that a demonstration of cutting edge facial recognition software development at Carnegie Mellon University, and reality seems to be moving ever closer to science fiction movies.

Librarians focused on technical projects love to collect data and make decisions based on that data. We try hard to get data collection systems as close to real-time as possible, and work hard to make sure we are collecting as much data as possible and analyzing it as much as possible. The idea of a series of cameras to track in real-time exactly what our patrons are doing in the library in real-time might seem very tempting. But as librarians, we value the ability of our patrons to access information with as much privacy as possible–like all professions, we treat the interactions we have with our patrons (just as we would clients, patients, congregants, or sources) with care and discretion (See Item 3 of the Code of Ethics of the American Library Association). I will not address the national conversation about privacy versus security in this post–I want to address the issue of data collection right where most of us live on a daily basis inside analytics programs, spreadsheets, and server logs.

What kind of data do you collect?

Let’s start with an exercise. Write a list of all the statistical reports you are expected to provide your library–for most of us, it’s probably a very long list. Now, make a list of all the tools you use to collect the data for those statistics.

Here are a few potential examples:

Website visitors and user experience

  • Google Analytics or some other web analytics tool
  • Heat map tool
  • Server logs
  • Surveys

Electronic resource access reports

  • Electronic resources management application
  • Vendor reports (COUNTER and other)
  • Link resolver click-through report
  • Proxy server logs

The next step may require a little digging. For library created tools, do you have a privacy policy for this data? Has it gone through the Institutional Review Board? For third-party tools, is there a privacy policy? What are the terms or use or user license? (And how many people have ever read the entire terms of service?). We will return to this exercise in a moment.

How much is enough?

Think about with these tools what type of data you are collecting about your users. Some of it may be very private indeed. For instance, the heat map tool I’ve recently started using (Inspectlet) not only tracks clicks, but actually records sessions as patrons use the website. This is fascinating information–we had, for instance, one session that was a patron opening the library website, clicking the Facebook icon on the page, and coming back to the website nearly 7 hours later. It was fun to see that people really do visit the library’s Facebook page, but the question was immediately raised whether it was a visit from on campus. (It was–and wouldn’t have taken long to figure out if it was a staff machine and who was working that day and time). IP addresses from off campus are very easy to track, sometimes down to the block–again, easy enough to tie to an individual. We like to collect IP addresses for abusive or spamming behavior and block users based on IP address all the time. But what about in this case? During the screen recordings I can see exactly what the user types in the search boxes for the catalog and discovery system. Luckily, Inspectlet allows you to obscure the last two octets (which is legally required some places) of the IP address, so you can have less information collected. All similar tools should allow you the same ability.

Consider another case: proxy server logs. In the past when I did a lot of EZProxy troubleshooting, I found the logs extremely helpful in figuring out what went wrong when I got a report of trouble, particularly when it had occurred a day or two before. I could see the username, what time the user attempted to log in or succeeded in logging in, and which resources they accessed. Let’s say someone reported not being able to log in at midnight– I could check to see the failed logins at midnight, and then that username successfully logging in at 1:30 AM. That was a not infrequent occurrence, as usually people don’t think to write back and say they figured out what they did wrong! But I could also see everyone else’s logins and which articles they were reading, so I could tell (if I wanted) which grad students were keeping up with their readings or who was probably sharing their login with their friend or entire company. Where I currently work, we don’t keep the logs for more than a day, but I know a lot of people are out there holding on to EZProxy logs with the idea of doing “something” with them someday. Are you holding on to more than you really want to?

Let’s continue our exercise. Go through your list of tools, and make a list of all the potentially personally identifying information the tool collects, whether or not you use them. Are you surprised by anything? Make a plan to obscure unused pieces of data on a regular basis if it can’t be done automatically. Consider also what you can reasonably do with the data in your current job requirements, rather than future study possibilities. If you do think the data will be useful for a future study, make sure you are saving anonymized data sets unless it is absolutely necessary to have personally identifying information. In the latter case, you should clear your study in advance with your Institutional Review Board and follow a data management plan.

A privacy and data management policy should include at least these items:

  • A statement about what data you are collecting and why.
  • Where the data is stored and who has access to it.
  • A retention timeline.

F0r example, in the past I collected all virtual reference transaction logs for studying the effectiveness of a new set of virtual reference services. I knew I wanted at least a year’s worth of logs, and ideally three years to track changes over time. I was able to save the logs with anonymized IP addresses and once I had the data I needed I was able to delete the actual transcripts. The privacy policy described the process and where the data would be stored to ensure it was secure. In this case, I used the RUSA Guidelines for Implementing and Maintaining Virtual Reference Services as a guide to creating this policy. Read through the ALA Guidelines to Drafting a Library Privacy Policy for additional specific language and items you should include.

What we can do with data

In all this I don’t at all mean to imply that we shouldn’t be collecting this data. In both the examples I gave above, the data is extremely useful in improving the patron experience even while giving identifying details away. Not collecting data has trade-offs. For years, libraries have not retained a patron’s borrowing record to protect his or her privacy. But now patrons who want to have an online record of what they’ve borrowed from the library must use third-party services with (most likely) much less stringent privacy policies than libraries. By not keeping records of what users have checked out or read through databases, we are unable to provide them personalized automated suggestions about what to read next. Anyone who uses Amazon regularly knows that they will try to tempt you into purchases based on your past purchases or books you were reading the preview of–even if you would rather no one know that you were reading that book and certainly don’t want suggestions based on it popping up when you are doing a collection development project at work and are logged in on your personal account. In all the decisions we make about collecting or not collecting data, we have to consider trade-offs like these. Is the service so important that the benefits of collecting the data outweigh the risks? Or, is there another way to provide the service?

We can see some examples of this trade-off in two similar projects coming out of Harvard Library Labs. One, Library Hose, was a Twitter stream with the name of every book being checked out. The service ran for part of 2010, and has been suspended since September of 2010. In addition to daily tweet limits, this also was a potential privacy violation–even if it was a fun idea (this blog post has some discussion about it). A newer project takes the opposite approach–books that a patron thinks are “awesome” can be returned to the Awesome Box at the circulation desk and the information about the book is collected on the Awesome Box website. This is a great tweak to the earlier project, since this advertises material that’s now available rather than checked out, and people have to opt in by putting the item in the box.

In terms of personal recommendations, librarians have the advantage of being able to form close working relationships with faculty and students so they can make personal recommendations based on their knowledge of the person’s work and interests. But how to automate this without borrowing records? One example is a project that Ian Chan at California State University San Marcos has done to use student enrollment data to personalize the website based on a student’s field of study. (Slides). This provides a great deal of value for the students, who need to log in to check their course reserves and access articles from off campus anyway. This adds on top of that basic need a list of recommended resources for students, which they can choose to star as favorites.

Conclusion

In thinking about what type of data you collect, whether on purpose or accidentally, spend some time thinking about what is strictly necessary to accomplish the work that you need to do. If you don’t need a piece of data but can’t avoid collecting it (such as full IP addresses or usernames), make sure you have a privacy policy and retention schedule, and ensure that it is not accessible to more people than absolutely necessary.

Work to educate your patrons about privacy, particularly online privacy. ALA has a Choose Privacy Week, which is always the first week in May. The site for that has a number of resources you might want to consult in planning programming. Academic librarians may find it easiest to address college students in terms of their presence on social media when it comes to future job hunting, but this is just an opening to larger conversations about data. Make sure that when you ask patrons to use a third party service (such as a social network) or recommend a service (such as a book recommending site) that you make sure they are aware of what information they are sharing.

We all know that Google’s slogan is “Don’t be evil”, but it’s not always clear if they are sticking to that. Make sure that you are not being evil in your own data collection.



Leave a Reply