Collecting Data: How Much do We Really Need?

Many of us have had conversations in the past few weeks about data collection due to the reports about the NSA’s PRISM program, but ever since April and the bombings at the Boston Marathon, there has been an increased awareness of how much data is being collected about people in an attempt to track down suspects–or, increasingly, stop potential terrorist events before they happen. A recent Nova episode about the manhunt for the Boston bombers showed one such example of this at the New York Police Department. This program is called the Domain Awareness System at the New York Police Department, and consists of live footage from almost every surveillance camera in the New York City playing in one room, with the ability to search for features of individuals and even the ability to detect people acting suspiciously. Added to that a demonstration of cutting edge facial recognition software development at Carnegie Mellon University, and reality seems to be moving ever closer to science fiction movies.

Librarians focused on technical projects love to collect data and make decisions based on that data. We try hard to get data collection systems as close to real-time as possible, and work hard to make sure we are collecting as much data as possible and analyzing it as much as possible. The idea of a series of cameras to track in real-time exactly what our patrons are doing in the library in real-time might seem very tempting. But as librarians, we value the ability of our patrons to access information with as much privacy as possible–like all professions, we treat the interactions we have with our patrons (just as we would clients, patients, congregants, or sources) with care and discretion (See Item 3 of the Code of Ethics of the American Library Association). I will not address the national conversation about privacy versus security in this post–I want to address the issue of data collection right where most of us live on a daily basis inside analytics programs, spreadsheets, and server logs.

What kind of data do you collect?

Let’s start with an exercise. Write a list of all the statistical reports you are expected to provide your library–for most of us, it’s probably a very long list. Now, make a list of all the tools you use to collect the data for those statistics.

Here are a few potential examples:

Website visitors and user experience

  • Google Analytics or some other web analytics tool
  • Heat map tool
  • Server logs
  • Surveys

Electronic resource access reports

  • Electronic resources management application
  • Vendor reports (COUNTER and other)
  • Link resolver click-through report
  • Proxy server logs

The next step may require a little digging. For library created tools, do you have a privacy policy for this data? Has it gone through the Institutional Review Board? For third-party tools, is there a privacy policy? What are the terms or use or user license? (And how many people have ever read the entire terms of service?). We will return to this exercise in a moment.

How much is enough?

Think about with these tools what type of data you are collecting about your users. Some of it may be very private indeed. For instance, the heat map tool I’ve recently started using (Inspectlet) not only tracks clicks, but actually records sessions as patrons use the website. This is fascinating information–we had, for instance, one session that was a patron opening the library website, clicking the Facebook icon on the page, and coming back to the website nearly 7 hours later. It was fun to see that people really do visit the library’s Facebook page, but the question was immediately raised whether it was a visit from on campus. (It was–and wouldn’t have taken long to figure out if it was a staff machine and who was working that day and time). IP addresses from off campus are very easy to track, sometimes down to the block–again, easy enough to tie to an individual. We like to collect IP addresses for abusive or spamming behavior and block users based on IP address all the time. But what about in this case? During the screen recordings I can see exactly what the user types in the search boxes for the catalog and discovery system. Luckily, Inspectlet allows you to obscure the last two octets (which is legally required some places) of the IP address, so you can have less information collected. All similar tools should allow you the same ability.

Consider another case: proxy server logs. In the past when I did a lot of EZProxy troubleshooting, I found the logs extremely helpful in figuring out what went wrong when I got a report of trouble, particularly when it had occurred a day or two before. I could see the username, what time the user attempted to log in or succeeded in logging in, and which resources they accessed. Let’s say someone reported not being able to log in at midnight– I could check to see the failed logins at midnight, and then that username successfully logging in at 1:30 AM. That was a not infrequent occurrence, as usually people don’t think to write back and say they figured out what they did wrong! But I could also see everyone else’s logins and which articles they were reading, so I could tell (if I wanted) which grad students were keeping up with their readings or who was probably sharing their login with their friend or entire company. Where I currently work, we don’t keep the logs for more than a day, but I know a lot of people are out there holding on to EZProxy logs with the idea of doing “something” with them someday. Are you holding on to more than you really want to?

Let’s continue our exercise. Go through your list of tools, and make a list of all the potentially personally identifying information the tool collects, whether or not you use them. Are you surprised by anything? Make a plan to obscure unused pieces of data on a regular basis if it can’t be done automatically. Consider also what you can reasonably do with the data in your current job requirements, rather than future study possibilities. If you do think the data will be useful for a future study, make sure you are saving anonymized data sets unless it is absolutely necessary to have personally identifying information. In the latter case, you should clear your study in advance with your Institutional Review Board and follow a data management plan.

A privacy and data management policy should include at least these items:

  • A statement about what data you are collecting and why.
  • Where the data is stored and who has access to it.
  • A retention timeline.

F0r example, in the past I collected all virtual reference transaction logs for studying the effectiveness of a new set of virtual reference services. I knew I wanted at least a year’s worth of logs, and ideally three years to track changes over time. I was able to save the logs with anonymized IP addresses and once I had the data I needed I was able to delete the actual transcripts. The privacy policy described the process and where the data would be stored to ensure it was secure. In this case, I used the RUSA Guidelines for Implementing and Maintaining Virtual Reference Services as a guide to creating this policy. Read through the ALA Guidelines to Drafting a Library Privacy Policy for additional specific language and items you should include.

What we can do with data

In all this I don’t at all mean to imply that we shouldn’t be collecting this data. In both the examples I gave above, the data is extremely useful in improving the patron experience even while giving identifying details away. Not collecting data has trade-offs. For years, libraries have not retained a patron’s borrowing record to protect his or her privacy. But now patrons who want to have an online record of what they’ve borrowed from the library must use third-party services with (most likely) much less stringent privacy policies than libraries. By not keeping records of what users have checked out or read through databases, we are unable to provide them personalized automated suggestions about what to read next. Anyone who uses Amazon regularly knows that they will try to tempt you into purchases based on your past purchases or books you were reading the preview of–even if you would rather no one know that you were reading that book and certainly don’t want suggestions based on it popping up when you are doing a collection development project at work and are logged in on your personal account. In all the decisions we make about collecting or not collecting data, we have to consider trade-offs like these. Is the service so important that the benefits of collecting the data outweigh the risks? Or, is there another way to provide the service?

We can see some examples of this trade-off in two similar projects coming out of Harvard Library Labs. One, Library Hose, was a Twitter stream with the name of every book being checked out. The service ran for part of 2010, and has been suspended since September of 2010. In addition to daily tweet limits, this also was a potential privacy violation–even if it was a fun idea (this blog post has some discussion about it). A newer project takes the opposite approach–books that a patron thinks are “awesome” can be returned to the Awesome Box at the circulation desk and the information about the book is collected on the Awesome Box website. This is a great tweak to the earlier project, since this advertises material that’s now available rather than checked out, and people have to opt in by putting the item in the box.

In terms of personal recommendations, librarians have the advantage of being able to form close working relationships with faculty and students so they can make personal recommendations based on their knowledge of the person’s work and interests. But how to automate this without borrowing records? One example is a project that Ian Chan at California State University San Marcos has done to use student enrollment data to personalize the website based on a student’s field of study. (Slides). This provides a great deal of value for the students, who need to log in to check their course reserves and access articles from off campus anyway. This adds on top of that basic need a list of recommended resources for students, which they can choose to star as favorites.

Conclusion

In thinking about what type of data you collect, whether on purpose or accidentally, spend some time thinking about what is strictly necessary to accomplish the work that you need to do. If you don’t need a piece of data but can’t avoid collecting it (such as full IP addresses or usernames), make sure you have a privacy policy and retention schedule, and ensure that it is not accessible to more people than absolutely necessary.

Work to educate your patrons about privacy, particularly online privacy. ALA has a Choose Privacy Week, which is always the first week in May. The site for that has a number of resources you might want to consult in planning programming. Academic librarians may find it easiest to address college students in terms of their presence on social media when it comes to future job hunting, but this is just an opening to larger conversations about data. Make sure that when you ask patrons to use a third party service (such as a social network) or recommend a service (such as a book recommending site) that you make sure they are aware of what information they are sharing.

We all know that Google’s slogan is “Don’t be evil”, but it’s not always clear if they are sticking to that. Make sure that you are not being evil in your own data collection.


A Librarian’s Guide to OpenRefine

Academic librarians working in technical roles may rarely see stacks of books, but they doubtless see messy digital data on a daily basis. OpenRefine is an extremely useful tool for dealing with this data without sophisticated scripting skills and with a very low learning curve. Once you learn a few tricks with it, you may never need to force a student worker to copy and paste items onto Excel spreadsheets.

As this comparison by the creator of OpenRefine shows, the best use for the tool is to explore and transform data, and it allows you to make edits to many cells and rows at once while still seeing your data. This allows you to experiment and undo mistakes easily, which is a great advantage over databases or scripting where you can’t always see what’s happening or undo the typo you made. It’s also a lot faster than editing cell by cell like you would do with a spreadsheet.

Here’s an example of a project that I did in a spreadsheet and took hours, but then I redid in Google Refine and took a lot less time. One of the quickest things to do with OpenRefine is spot words or phrases that are almost the same, and possibly are the same thing. Recently I needed to turn a large export of data from the catalog into data that I could load into my institutional repository. There were only certain allowed values that could be used in the controlled vocabulary in the repository, so I had to modify the bibliographic data from the catalog (which was of course in more or less proper AACR2 style) to match the vocabularies available in the repository. The problem was that the data I had wasn’t consistent–there were multiple types of abbreviations, extra spaces, extra punctuation, and outright misspellings. An example is the History Department. I can look at “Department of History”, “Dep. of History”, “Dep of Hist.” and tell these are probably all referring to the same thing, but it’s difficult to predict those potential spellings. While I could deal with much of this with regular expressions in a text editor and find and replace in Excel, I kept running into additional problems that I couldn’t spot until I got an error. It took several attempts of loading the data until I cleared out all the errors.

In OpenRefine this is a much simpler task, since you can use it to find everything that probably is the same thing despite the slight differences in spelling, punctuation and spelling. So rather than trying to write a regular expression that accounts for all the differences between “Department of History”, “Dep. of History”, “Dep of Hist.”, you can find all the clusters of text that include those elements and change them all in one shot to “History”. I will have more detailed instructions on how to do this below.

Installation and Basics

OpenRefine was called, until last October, Google Refine, and while the content from the Google Refine page is being moved to the Open Refine page you should plan to look at both sites. Documentation and video tutorials refer interchangeably to Google Refine and OpenRefine. The official and current documentation is on the OpenRefine GitHub wiki. For specific questions you will probably want to use the OpenRefine Custom Search Engine, which brings together all the mix of documentation and tutorials on the web. OpenRefine is a web app that runs on your computer, so you don’t need an internet connection to run it. You can get the installation instructions on this page.

While you can jump in right away and get started playing around, it is well worth your time to watch the tutorial videos, which will cover the basic actions you need to take to start working with data. As I said, the learning curve is low, but not all of the commands will make sense until you see them in action. These videos will also give you an idea of what you might be able to do with a data set you have lying around. You may also want to browse the “recipes” on the OpenRefine site, as well search online for additional interesting things people have done. You will probably think of more ideas about what to try. The most important thing to know about OpenRefine is that you can undo anything, and go back to the beginning of the project before you messed up.

A basic understanding of the Google Refine Expression Language, or GREL will improve your ability to work with data. There isn’t a whole lot of detailed documentation, so you should feel free to experiment and see what happens when you try different functions. You will see from the tutorial videos the basics you need to know. Another essential tool is regular expressions. So much of the data you will be starting with is structured data (even if it’s not perfectly structured) that you will need to turn into something else. Regular expressions help you find patterns which you can use to break apart strings into something else. Spending a few minutes understanding regular expression syntax will save hours of inefficient find and replace. There are many tutorials–my go-to source is this one. The good news for librarians is that if you can construct a Dewey Decimal call number, you can construct a regular expression!

Some ideas for librarians

 

(A) Typos

Above I described how you would use OpenRefine to clean up messy and inconsistent catalog data. Here’s how to do it. Load in the data, and select “Text Facet” on the column in question. OpenRefine will show clusters of text that is similar and probably the same thing.

AcademicDept Text Facet

AcademicDept Text Facet

 

Click on Cluster to get a menu for working with multiple values. You can click on the “Merge” check box and then edit the text to whatever you need it to be. You can also edit each text cluster to be the correct text.

Cluster and Edit

Cluster and Edit

You can merge and re-cluster until you have fixed all the typos. Back on the first Text Facet, you can hover over any value to edit it. That way even if the automatic clustering misses some you can edit the errors, or change anything that is the same but you need to look different–for instance, change “Dept. of English” to just “English”.

(B) Bibliographies

The main thing that I have used OpenRefine for in my daily work is to change a bibliography in plain text into columns in a spreadsheet that I can run against an API. This was inspired by this article in the Code4Lib Journal: “Using XSLT and Google Scripts to Streamline Populating an Institutional Repository” by Stephen X. Flynn, Catalina Oyler, and Marsha Miles. I wanted to find a way to turn a text CV into something that would work with the SHERPA/RoMEO API, so that I could find out which past faculty publications could be posted in the institutional repository. Since CVs are lists of data presented in a structured format but with some inconsistencies, OpenRefine makes it very easy to present the data in a certain way as well as remove the inconsistencies, and then to extend the data with a web service. This is a very basic set of instructions for how to accomplish this.

The main thing to accomplish is to put the journal title in its own column. Here’s an example citation in APA format, in which I’ve colored all the “separator” punctuation in red:

Heller, M. (2011). A Review of “Strategic Planning for Social Media in Libraries”. Journal of Electronic Resources Librarianship, 24 (4), 339-240)

From the drop-down menu at the top of the column click on “Split into several columns…” from the “Edit Column” menu. You will get a menu like the one below. This example finds the opening parenthesis and removes that in creating a new column. The author’s name is its own column, and the rest of the text is in another column.

Spit into columns

 

The rest of the column works the same way–find the next text, punctuation, or spacing that indicates a separation. You can then rename the column to be something that makes sense. In the end, you will end up with something like this:

Split columns

When you have the journal titles separate, you may want to cluster the text and make sure that the journals have consistent titles or anything else to clean up the titles. Now you are a ready to build on this data with fetching data from a web service. The third video tutorial posted above will explain the basic idea, and this tutorial is also helpful. Use the pull-down menu at the top of the journal column to select “Edit column” and then “Add column by fetching URLs…”. You will get a box that will help you construct the right URL. You need to format your URL in the way required by SHERPA/RoMEO, and will need a free API key. For the purposes of this example, you can use 'http://www.sherpa.ac.uk/romeo/api29.php?ak=[YOUR API KEY HERE]&qtype=starts&jtitle=' + escape(value,'url'). Note that it will give you a preview to see if the URL is formatted in the way you expect. Give your column a name, and set the Throttle delay, which will keep the service from rejecting too many requests in a short time. I found 1000 worked fine.

refine7

After this runs, you will get a new column with the XML returned by SHERPA/RoMEO. You can use this to pull out anything you need, but for this example I want to get pre-archiving and post-archiving policies, as well as the conditions. A quick way to to this is to use the Googe Refine Expression Language parseHtml function. To use this, click on “Add column based on this column” from the “Edit Column” menu, and you will get a menu to fill in an expression.

refine91

In this example I use the code value.parseHtml().select("prearchiving")[0].htmlText(), which selects just the text from within the prearchving element. Conditions are a little different, since there are multiple conditions for each journal. In that case, you would use the following syntax (after join you can put whatever separator you want): forEach(value.parseHtml().select("condition"),v,v.htmlText()).join(". ")"

So in the end, you will end up with a neatly structured spreadsheet from your original CV with all the bibliographic information in its own column and the publisher conditions listed. You can imagine the possibilities for additional APIs to use–for instance, the WorldCat API could help you determine which faculty published books the library owns.

Once you find a set of actions that gets your desired result, you can save them for the future or to share with others. Click on Undo/Redo and then the Extract option. You will get a description of the actions you took, plus those actions represented in JSON.

refine13

Unselect the checkboxes next to any mistakes you made, and then copy and paste the text somewhere you can find it again. I have the full JSON for the example above in a Gist here. Make sure that if you save your JSON publicly you remove your personal API key! When you want to run the same recipe in the future, click on the Undo/Redo tab and then choose Apply. It will run through the steps for you. Note that if you have a mistake in your data you won’t catch it until it’s all finished, so make sure that you check the formatting of the data before running this script.

Learning More and Giving Back

Hopefully this quick tutorial got you excited about OpenRefine and thinking about what you can do. I encourage you to read through the list of External Resources to get additional ideas, some of which are library related. There is lots more to learn and lots of recipes you can create to share with the library community.

Have you used OpenRefine? Share how you’ve used it, and post your recipes.

 


Event Tracking with Google Analytics

In a previous post by Kelly Sattler and Joel Richard, we explored using web analytics to measure a website’s success. That post provides a clear high-level picture of how to create an analytics strategy by evaluating our users, web content, and goals. This post will explore a single topic in-depth; how to set up event tracking in Google Analytics.

Why Do We Need Event Tracking?

Finding solid figures to demonstrate a library’s value and make strategic decisions is a topic of increasing importance. It can be tough to stitch together the right information from a hodgepodge of third-party services; we rely on our ILSs to report circulation totals, our databases to report usage like full-text downloads, and our web analytics software to show visitor totals. But are pageviews and bounce rates the only meaningful measure of website success? Luckily, Google Analytics provides a way to track arbitrary events which occur on web pages. Event tracking lets us define what is important. Do we want to monitor how many people hover over a carousel of book covers, but only in the first second after the page has loaded? How about how many people first hover over the carousel, then the search box, but end up clicking a link in the footer? As long as we can imagine it and JavaScript has an event for it, we can track it.

How It Works

Many people are probably familiar with Google Analytics as a snippet of JavaScript pasted into their web pages. But Analytics also exposes some of its inner workings to manipulation. We can use the _gaq.push method to execute a “_trackEvent” method which sends information about our event back to Analytics. The basic structure of a call to _trackEvent is:

_gaq.push( [ '_trackEvent', 'the category of the event', 'the action performed', 'an optional label for the event', 'an optional integer value that quantifies something about the event' ] );

Looking at the array parameter of _gaq.push is telling: we should have an idea of what our event categories, actions, labels, and quantitative details will be before we go crazy adding tracking code to all our web pages. Once events are recorded, they cannot be deleted from Analytics. Developing a firm plan helps us to avoid the danger of switching the definition of our fields after we start collecting data.

We can be a bit creative with these fields. “Action” and “label” are just Google’s way of describing them; in reality, we can set up anything we like, using category->action->label as a triple-tiered hierarchy or as three independent variables.

Example: A List of Databases

Almost every library has a web page listing third-party databases, be they subscription or open access. This is a prime opportunity for event tracking because of the numerous external links. Default metrics can be misleading on this type of page. Bounce rate—the proportion of visitors who start on one of our pages and then immediately leave without viewing another page—is typically considered a negative metric; if a page has a high bounce rate, then visitors are not engaged with its content. But the purpose of a databases page is to get visitors to their research destinations as quickly as possible; bounce rate is a positive figure. Similarly, time spent on page is typically considered a positive sign of engagement, but on a databases page it’s more likely to indicate confusion or difficulty browsing. With event tracking, we can not only track which links were clicked but we can make it so database links don’t count towards bounce rate, giving us a more realistic picture of the page’s success.

One way of structuring “database” events is:

  • The top-level Category is “database”
  • The Action is the topical category, e.g. “Social Sciences”
  • The Label is the name of the database itself, e.g. “Academic Search Premier”

The final, quantitative piece could be the position of the database in the list or the number of seconds after page load it took the user to click its link. We could report some boolean value, such as whether the database is open access or highlighted in some way.

To implement this, we set up a JavaScript function which will be called every time one of our events occur. We will store some contextual information in variables, push that information to Google Analytics, and then delay the page’s navigation so the event has a chance to be recorded. Let’s walk through the code piece by piece:

function databaseTracking  ( event ) {
    var destination = $( this )[ 0 ].href,
        resource = $( this ).text(),
        // move up from <a> to parent element, then find the nearest preceding <h2> section header
        section = $( this ).parent().prevAll( 'h2' )[ 0 ].innerText,
        highlighted = $( this ).hasClass( 'highlighted' ) ? 1 : 0;

_gaq.push( [ '_trackEvent', 'database', resource, section, highlighted ] );

The top of our function just grabs information from the page. We’re using jQuery to make our lives easier, so all the $( this ) pieces of our code refer to the element that initiated the event. In our case, that’s the link pointing to an external database which the user just clicked. So we set destination to the link’s href attribute, resource to its text (e.g. the database’s name), section to the text inside the h2 element that labels a topical set of databases, and highlighted is a boolean value equal to 1 if the element has a class of “highlighted.” Next, this data is pushed into the _gaq array which is a queue of functions and their parameters that Analytics fires asynchronously. In this instance, we’re telling Analytics to run the _trackEvent function with the parameters that follow. Analytics will then record an event of type “database” with an action of [database name], a label of [section header], and a boolean representing whether it was highlighted or not.

setTimeout( function () {
    window.location = destination;
}, 200 );
event.preventDefault();
}

Next comes perhaps the least obvious piece: we prevent the default browser behavior from occurring, which in the case of a link is navigating away from our page, but then send the user to destination 200 milliseconds later anyways. The _trackEvent function now has a chance to fire; if we let the user follow the link right away it might not complete and our event would not be recorded.1

$( document ).ready(
    // target all anchors in list of databases
    $( '#databases-list a' ).on( 'click', databaseTracking )
);

There’s one last step; merely defining the databaseTracking function won’t cause it to execute when we want it to. JavaScript uses event handlers to execute certain functions based on various user actions, such as mousing over or clicking an element. Here, we add click event handlers to all <a> elements in the list of databases. Now whenever a user clicks a link in the databases list (which has a container with id “databases-list”), databaseTracking will run and send data to Google Analytics.

There is a demo on JSFiddle which uses the code above with some sample HTML. Every time you click a link, a pop-up shows you what the _gaq.push array looks like.

Though we used jQuery in our example, any JavaScript library can be used with event tracking.2 The procedure is always the same: write a function that gathers data to send back to Google Analytics and then add that function as a handler to an appropriate event, such as click or mouseover, on an element.

For another example, complete with code samples, see the article “Discovering Digital Library User Behavior with Google Analytics” in Code4Lib Journal. In it, Kirk Hess of the University of Illinois Urbana-Champaign details how to use event tracking to see how often external links are clicked or files are downloaded. While these events are particularly meaningful to digital libraries, most libraries offer PDFs or other documents online.

Some Ideas

The true power of Event Tracking is that it does not have to be limited to the mere clicking of hyperlinks; any interaction which JavaScript knows about can be recorded and categorized. Google’s own Event Tracking Guide uses the example of a video player, recording when control buttons like play, pause, and fast forward are activated. Here are some more obvious use cases for event tracking:

  • Track video plays on particular pages; we may already know how many views a video gets, but how many come from particular embedded instances of the video?
  • Clicking to external content, such as a vendor’s database or another library’s study materials.
  • If there is a print or “download to PDF” button on our site, we can track each time it’s clicked. Unfortunately, only Internet Explorer and Firefox (versions >= 6.0) have an onbeforeprint event in JavaScript which could be used to detect when a user hits the browser’s native print command.
  • Web applications are particularly suited to event tracking. Many modern web apps have a single page architecture, so while the user is constantly clicking and interacting within the app they rarely generate typical interaction statistics like pageviews or exits.

 

Notes
  1. There is a discussion on the best way to delay outbound links enough to record them as events. A Google Analytics support page condones the setTimeout approach. For other methods, there are threads on StackOverflow and various blog posts around the web. Alternatively, we could use the onmousedown event which fires slightly earlier than onclick but also might record false positives due to click-and-drag scrolling.
  2. Below is an attempt at rewriting the jQuery tracking code in pure JavaScript. It will only work in modern browsers because of use of querySelectorAll, parentElement, and previousElementSibling. Versions of Internet Explorer prior to 9 also use a unique attachEvent syntax for event handlers. Yes, there’s a reason people use libraries to do anything the least bit sophisticated with JavaScript.
function databaseTracking  ( event ) {
        var destination = event.target.href,
            resource = event.target.innerHTML,
            section = "none",
            highlighted = event.target.className.match( /highlighted/ ) ? 1: 0;

        // getting a parent element's nearest <h2> sibling is non-trivial without a library
        var currentSibling = event.target.parentElement;
        while ( currentSibling !== null ) {
            if ( currentSibling.tagName !== "H2" ) {
                currentSibling = currentSibling.previousElementSibling;
            }
            else {
                section = currentSibling.innerHTML;
                currentSibling = null;
            }
        }

        _gaq.push( [ '_trackEvent', 'database', resource, section, highlighted ] );

        // delay navigation to ensure event is recorded
        setTimeout( function () {
            window.location = destination;
        }, 200 );
        event.preventDefault();
    }

document.addEventListener( 'DOMContentLoaded', function () {
        var dbLinks = document.querySelectorAll( '#databases-list a' ),
            len = dbLinks.length;
        for ( i = 0; i < len; i++ ) {
            dbLinks[ i ].addEventListener( 'click', databaseTracking, false );
        }
    }, false );
Association of College & Research Libraries. (n.d.). ACRL Value of Academic Libraries. Retrieved January 12, 2013, from http://www.acrl.ala.org/value/
Event Tracking – Web Tracking (ga.js) – Google Analytics — Google Developers. (n.d.). Retrieved January 12, 2013, from https://developers.google.com/analytics/devguides/collection/gajs/eventTrackerGuide
Hess, K. (2012). Discovering Digital Library User Behavior with Google Analytics. The Code4Lib Journal, (17). Retrieved from http://journal.code4lib.org/articles/6942
Marek, K. (2011). Using Web Analytics in the Library a Library Technology Report. Chicago, IL: ALA Editions. Retrieved from http://public.eblib.com/EBLPublic/PublicView.do?ptiID=820360
Sattler, K., & Richard, J. (2012, October 30). Learning Web Analytics from the LITA 2012 National Forum Pre-conference. ACRL TechConnect Blog. Blog. Retrieved January 18, 2013, from http://acrl.ala.org/techconnect/?p=2133
Tracking Code: Event Tracking – Google Analytics — Google Developers. (n.d.). Retrieved January 12, 2013, from https://developers.google.com/analytics/devguides/collection/gajs/methods/gaJSApiEventTracking
window.onbeforeprint – Document Object Model (DOM) | MDN. (n.d.). Mozilla Developer Network. Retrieved January 12, 2013, from https://developer.mozilla.org/en-US/docs/DOM/window.onbeforeprint

An Elevator Pitch for File Naming Conventions

As a curator and a coder, I know it is essential to use naming conventions.  It is important to employ a consistent approach when naming digital files or software components such as modules or variables. However, when a student assistant asked me recently why it was important not to use spaces in our image file names, I struggled to come up with an answer.  “Because I said so,” while tempting, is not really an acceptable response.  Why, in fact, is this important?  For this blog entry, I set out to answer this question and to see if, along the way, I could develop an “elevator pitch” – a short spiel on the reasoning behind file naming conventions.

The Conventions

As a habit, I implore my assistants and anyone I work with on digital collections to adhere to the following when naming files:

  • Do not use spaces or special characters (other than “-” and “_”)
  • Use descriptive file names.  Descriptive file names include date information and keywords regarding the content of the file, within a reasonable length.
  • Date information is the following format: YYYY-MM-DD.

So, 2013-01-03-SmithSculptureOSU.jpg would be an appropriate file name, whereas Smith Jan 13.jpg would not.  But, are these modern practices?  Current versions of Windows, for example, will accept a wide variety of special characters and spaces in naming files, so why is it important to restrict the use of these characters in our work?

The Search Results

A quick Google search finds support for my assertions, though often for very specific cases of file management.  For example, the University of Oregon publishes recommendations on file naming for managing research data.  A similar guide is available from the University of Illinois library, but takes a much more technical, detailed stance on the format of file names for the purposes of the library’s digital content.

The Bentley Historical Library at University of Michigan, however, provides a general guide to digital file management very much in line with my practices: use descriptive directory and file names, avoid special characters and spaces.  In addition, this page discusses avoiding personal names in the directory structure and using consistent conventions to indicate the version of a file.

The Why – Dates

The Bentley page also provides links to a couple of sources which help answer the “why” question.  First, there is the ISO date standard (or, officially, “ISO 8601:2004: Data elements and interchange formats — Information interchange — Representation of dates and times”).  This standard dictates that dates be ordered from largest term to smallest term, so instead of the month-day-year we all wrote on our grade school papers, dates should take the form year-month-day.  Further, since we have passed into a new millennium, a four digit year is necessary.  This provides a consistent format to eliminate confusion, but also allows for file systems to sort the files appropriately.  For example, let’s look at the following three files:

1960-11-01_libraryPhoto.jpg
1977-01-05_libraryPhoto.jpg
2000-05-01_libraryPhoto.jpg

If we expressed those dates in another format, say, month-day-year, they would not be listed in chronological order in a file system sorting alphabetically.  Instead, we would see:

01-05-1977_libraryPhoto.jpg
05-01-2000_libraryPhoto.jpg
11-01-1960_libraryPhoto.jpg

This may not be a problem if you are visually searching through three files, but what if there were 100?  Now, if we only used a two digit year, we would see:

00-05-01_libraryPhoto.jpg
60-11-01_libraryPhoto.jpg
77-01-05_libraryPhoto.jpg

If we did not standardize the number of digits, we might see:

00-5-1_libraryPhoto.jpg
77-1-5_libraryPhoto.jpg
60-11-1_libraryPhoto.jpg

You can try this pretty easily on your own system.  Create three text files with the names above, sort the files by name and check the order.  Imagine the problems this might create for someone trying to quickly locate a file.

You might ask, why include the date at all, when dates are also maintained by the operating system?  There are many situations where the operating system dates are unreliable.  In cases where a file moves to a new drive or computer, for example, the Date Created may reflect the date the file moved to the new system, instead of the initial creation date.  Or, consider the case where a user opens a file to view it and the application changes the Date Modified, even though the file content was not modified.  Lastly, consider our earlier example of a photograph from 1960; the Date Created is likely to reflect the date of digitization.  In each of these examples, it would be helpful to include an additional date in the file name.

The Why – Descriptive File Names

So far we have digressed down a date-specific path.  What about our other conventions?  Why are those important?  Also linked from the Bentley Library and in the Google search results are YouTube videos created by the State Library of North Carolina which answer some of these questions.  The Inform U series on file naming has four parts, and is intended to help users manage personal files.  However, the rationale described in Part 1 for descriptive file names in personal file management also applies in our libraries.

First, we want to avoid the accidental overwriting of files.  Image files can provide a good example here: many cameras use the file naming convention of IMG_1234.jpg.  If this name is unique to the system, that works ok, but in a situation where multiple cameras or scanners are generating files for a digital collection, there is potential for problems.  It is better to batch re-name image files with a more descriptive name. (Tutorials on this can be found all over the web, such as the first item in this list on using Mac’s Automator program to re-name a batch of photos).

Second, we want to avoid the loss of files due to non-descriptive names.  While many operating systems will search the text content of files, naming files appropriately makes for more efficient access.  For example, consider mynotes.docx and 2012-01-05WebMeetingNotes.docx – which file’s contents are easier to ascertain?

I should note, however, that there are cases where non-descriptive file names are appropriate.  The use of a unique identifier as a filename is sometimes a necessary approach.  However, in those cases where you must use a non-descriptive filename, be sure that the file names are unique and in a descriptive directory structure.  Overall, it is important that others in the organization have the same ability to find and use the files we currently manage, long after we have moved on to another institution.

The Why – Special Characters & Spaces

We have now covered descriptive names and reasons for including dates, which leaves us with spaces and special characters to address.  Part 3 of the Inform U video series addresses this as well.  Special characters can designate special meaning to programming languages and operating systems, and might be misinterpreted when included in file names.  For instance, the $ character designates the beginning of variable names in the php programming language and the \ character designates file path locations in the Windows operating system.

Spaces may make things easier for humans to read, but systems generally do better without the spaces.  While operating systems attempt to handle spaces gracefully and generally do so, browsers and software programs are not consistent in how they handle spaces.  For example, consider a file stored in a digital repository system with a space in the file name.  The user downloads the file and their browser truncates the file name after the first space.  This equates to the loss of any descriptive information after the first space.  Plus, the file extension is also removed, which may make it harder for less tech savvy users to use a file.

The Pitch

That example leads us to the heart of the issue: we never know where our files are going to end up, especially files disseminated to the public.  Our content is useless if our users cannot open it due to a poorly formatted file name.  And, in the case of non-public files or our personal archives, it is essential to facilitate the discovery of items in the piles and piles of digital material accumulated every day.

So, do I have my elevator pitch?  I think so.  When asked about file naming standards in the future, I think I can safely reply with the following:  “It is impossible to accurately predict all of the situations in which a file might be used.  Therefore, in the interest of preserving access to digital files, we choose file name components that are least likely to cause a problem in any environment.  File names should provide context and be easily understood by humans and computers, now and in the future.”

And, like a good file name, that is much more effective and descriptive than, “Because I said so.”


What exactly does a fiber-based gigabit speed library app look like?

Mozilla and the National Science Foundation are sponsoring an open round of submissions for developers/app designers to create fiber-based gigabit apps. The detailed contest information is available over at Mozilla Ignite (https://mozillaignite.org/about/). Cash prizes to fund promising start up ideas are being award for a total of $500,000 over three rounds of submissions. Note: this is just the start, and these are seed projects to garner interest and momentum in the area. A recent hackathon in Chattanooga lists out what some coders are envisioning for this space:  http://colab.is/2012/hackers-think-big-at-hackanooga/

The video for the Mozilla Ignite Challenge 2012 on Vimeo is slightly helpful for examples of gigabit speed affordances.

If you’re still puzzled after the video, you are not alone. One of the reasons for the contest is that network designers are not quite sure what immense levels of processing in the network and next generation transfer speeds will really mean.

Consider that best case transfer speeds on a network are somewhere along the lines of 10 megabits per second. There are of course variances of this speed across your home line (it may hover closer to 5 mb/s), but this is pretty much the standard that average subscribers can expect. A gigabit speed rate transfers data at 100 times that speed, 1,000 megabits per second. When a whole community is able to achieve 1,000 megabits upstream and downstream, you basically have no need for things like “streaming” video – the data pipes are that massive.

One theory is that gigabit apps could provide public benefit, solve societal issues and usher in the next generation of the Internet. Think of gigabit speed as the difference between getting water (Internet) through a straw, and getting water (Internet) through a fire-hose. The practicality of this contest is to seed startups with ideas that will in some way impact healthcare (realtime health monitoring), the environment and energy challenges. The local Champaign Urbana municipal gigabit speed fiber cause is noble, as it will provide those in areas without access to broadband an awesome pipeline to the Internet. It is a intergovernmental partnership that aims to serve municipal needs, as well as pave the way for research and industry start-ups.

Here are some attributes that Mozilla Ignite Challenge lists as the possible affordances of fiber based gigabit speed apps:

  • Speed
  • Video
  • Big Data
  • Programmable networks

As I read about the Mozilla Ignite open challenge, I wondered about the possibilities for libraries and as a thought experiment I list out here some ideas for library services that live on gigabit speed networks:

* Consider the video data you could provide access to – in libraries that are stewarding any kind of video gigabit speeds would allow you to provide in-library viewing that has few bottlenecks. A fiber-based gigabit speed video viewing app of all library video content available at once. Think about viewing every video in your collection simultaneously. You could have them playing to multiple clusters (grid videos) in the library at multiple stations. Without streaming.

* Consider Sensors and sensor arrays and fiber. One idea that is promulgated for the use of fiber-based gigabit speed networks are the affordances to monitor large amounts of data in real time. The sensor networks that could be installed around the library facility could help to throttle energy consumption in real time, making the building more energy efficient and less costly to maintain. Such a facilities based app would impact savings on the facilities budgets.

* Consider collaborations among libraries with fiber affordances. Libraries that are linked by fiber-based gigabit speeds would be able to transfer large amounts of data in fractions of what it takes now. There are implications here for data curation infrastructure (http://smartech.gatech.edu/handle/1853/28513).

Another way to approach this problem is by asking: “What problem does gigabit speed networking solve?” One of the problems with the current web is the de facto  latency. Your browser needs to request a page from a server which is then sent back to your client. We’ve gotten so accustomed to this latency that we expect it, but what if pages didn’t have to be requested? What if servers didn’t have to send pages? What if a zero latency web meant we needed a new architecture to take advantage of data possibilities?

Is your library poised to take advantage of increased data transfer? What apps do you want to get funding for?

 


Rapid Prototyping Mobile Augmented Reality Applications

This Fall semester the Undergraduate Library at the University of Illinois at Urbana Champaign along with partners from the Graduate School of Library and Information Science and Computer Science graduate students with experience in programming OpenCV, will begin coding an open source mobile Augmented Reality (AR) app for deeper in-library engagement with both print and physical resources. The funding comes from a recently awarded IMLS Sparks! Grant. Our objectives include the following:

  • Create shelf recognition software for mobile devices that integrate print and digital resources into the on-site library experience and experiment with location based recommendation services.
  • Investigate the potential of creating a system that shows users how they are physically navigating an “idea space.”
  • Complete iterative rapid use studies of mobile software with library patrons and communicate results back to programming staff for incremental app design.
  • Work with our Library IT staff to identify skills and technical infrastructure needed in order to make AR an ongoing part of technology in libraries.
  • Make available the AR apps through the Library’s mobile labs experimental apps area (http://m.library.illinois.edu/labs.asp).

There are multiple problems with access to the variety of collections in our networked era (Lee, 2000) including their highly disparate nature (many vended platforms serving licensed library content) and their increasing intangibility (the move to massively electronic, or e-only access in libraries and information centers). Moreover, library collection developers are faced with the challenge of providing increased access to digital while still maintaining print. Lee (2000) argues for library research redefining library collections as information contexts.

This work will address the contextual information needs of library users while leveraging recent advances in mobile-networked technologies, experimenting with a way to increase access to collections of all types. The research team will deploy, test, and evaluate mobile applications that create novel “augmented book stacks.”

 

(a.) Subject of book stack set is identified by app index, and displayed on interface. (b.) Recommendations (e-book, digital items, or databases) brought onto interface in real-time. (c.) Popular books are indicated on title using circulation data from integrated library system historical circulation count (this can be a Z39.50 call or a pre-loaded circulation report database).

 

To create such applications, researchers will make use of video functionality that augment shelves of interest to a user in the library stacks inserting interactive graphics through the video feed of a phone onto the physical book stack environment in real-time. As a comparison to current state of the art mobile AR apps, like the ShelvAR app in development at Miami University, the proposed system does not require 2D tags as targets on books, but rather uses a combination of computer vision software code for feature detection and optical character recognition (OCR) software to parse the text of titles, call numbers, and subjects on the book stacks. A prototype project for OCR running in Android can be implemented following this tutorial. Our research group does not propose a replication of the state of the art, but will implement a system that pushes forward the state of the art in innovation for research and learning with AR in library stacks.

The project team will experiment with overlaying relevant resources from other parts of the library’s collection such as the library’s licensed set of databases, other Internet based resources, or books that are relevant but not shelved nearby. This augmentation will enhance the serendipitous discovery of books so that items relevant to a user’s location, but not shelved near her can be brought into the browsing experience; with this technology books that are checked out, or otherwise unavailable can still be made useful to a users information search. Our staff will experiment with system features that create “idea spaces” for the user, which will serve to help students and library users exploit previous discovery routes and spaces in the book stacks. The premise of “idea spaces” comes from an unspoken assumption among librarians: the intellectual organization of items in library collections are valuable constructs. By presenting graphical overlays of the subject areas of the collection, we make this assumption explicit and assert that as a user navigates the geographic spaces of a library collection, they are actually navigating intellectual spaces. With a user location is paired an idea (or set of related ideas), delivered in our proposed system with a graphical overly in the video feed. The user’s location, her context in the collection, is the query point for the idea spaces system.

This experiment will be valuable for all libraries that support print and digital resources. Underscoring this work is the overarching concern with making all library collections more accessible. Researchers will undertake rapid prototyping (as a test case for the chosen method see: Jones & Richey, 2000) of the augmented reality feature set in order to understand user preferences of mobile interfaces that best support location-based recommendations, and make all results of this experimentation including software code and computing workflows freely available. Such experimentation could lead to profound changes in the way people research and learn in library spaces.

Grant activities will begin in October 2012 and conclude September 2013. The evaluation plan for the grant is a systematic measurement of project outputs against the stated goals with the resulting evaluative outputs communicating what worked and was useful for library patrons in AR apps. By operationalizing a rapid evaluation of augmented reality services the research team hopes to identify the fail points for mobile services in this domain in addition to the most desired and useful feature set for all augmented reality systems in library book stacks.

Cited

Jones, T. & Richey, R.  (2000) “Rapid Prototyping methodology in action: a developmental study,” Educational Technology Research and Development 48, 63-80.

Lee, H. (2000), “What is a collection?” Journal of the American Society for Information Science, 51 (12) 1106-1113.

Suggested Reading

Regarding collocation objectives in library science see: Svenonius, E. (2000), The Intellectual Foundation of Information Organization, MIT Press, Cambridge, MA, pp.21-22

See also

Additional sample code for image processing with Android devices available here, courtesy of openly available lecture notes from Stanford’s Digital Image Processing Course EE368.

Forthcoming this October, a paper detailing additional AR use cases in library services: Hahn, J. (2012). Mobile augmented reality applications for library services. New Library World 113 (9/10)


2012 Eyeo Festival

Tuesday June 5th through Friday June 8th 2012, 500 creatives from numerous fields such as, computer science, art, design, data visualization, gathered together to listen, converse, and participate in the second Eyeo Festival. Held in Minneapolis, MN at the Walker Art Center, the event organizers created an environment of learning, exchange, exploration, and fun. There were various workshops with some top names leading the way. Thoughtfully curated presentations throughout the day complemented keynotes held nightly in party-like atmospheres: Eyeo was an event not to be missed. Ranging from independent artists to the highest levels of innovative companies, Eyeo offered inspiration on many levels.

Why the Eyeo Festival?
As I began to think about what I experienced at the Eyeo Festival, I struggled to express exactly how impactful this event was for me and those I connected with. In a way, Eyeo is like TED and in fact, many presenters have given TED talks. Eyeo has a more targeted focus on art, design, data, and creative code but it is also so much more than that. With an interactive art and sound installation, Zygotes, by Tangible Interaction kicking off the festival, though the video is a poor substitute to actually being there, it still evokes a sense of wonder and possibility. I strongly encourage anyone who is drawn to design, data, art, interaction or to express their creativity through code to attend this outstanding creative event and follow the incredible people that make up the impressive speaker list.

I went to the Eyeo Festival because I like to seek out what professionals in other fields are doing. I like staying curious and stretching outside my comfort zone in big ways, surrounding myself with people doing things I don’t understand, and then trying to understand them. Over the years I’ve been to many library conferences and there are some amazing events with excellent programming but they are, understandably, very library-centric. So, to challenge myself, I decided to go to a conference where there would be some content related to libraries but that was not a library conference. There are many individuals and professions outside of libraries that care about many of the same values and initiatives we do, that work on similar kinds of problems, and have the same drive to make the world a better place. So why not talk to them, ask questions, learn, and see what their perspective is? How do they approach and solve problems? What is their process in creating? What is their perspective and attitude? What kind of communities are they part of and work with?

I was greatly inspired by the group of librarians who have attended the SXSWi Festival which has grown further over the years. There are a now a rather large number of librarians speaking about and advocating for libraries in such an innovative and elevated platform. There is even a Facebook Group where professionals working in libraries, archives, and museums can connect with each other for encouragement, support, and collaborations in relation to SXSWi. Andrea Davis, Reference & Instruction Librarian at the Dudley Knox Library, Naval Postgraduate School in Monterey, CA, has been heavily involved in offering leadership in getting librarians to collaborate at SXSW. She states, “I’ve found it absolutely invigorating to get outside of library circles to learn from others, and to test the waters on what changes and effects are having on those not so intimately involved in libraries. Getting outside of library conferences keeps the blood flowing across tech, publishing, education. Insularity doesn’t do much for growth and learning.”

I’ve also been inspired by librarians who have been involved in the TED community, such as Janie Herman and her leadership with Princeton Public Library’s local TEDx in addition to her participation in the TEDxSummit in Doha, Qatar. Additionally, Chrystie Hill, the Community Relations Director at OCLC, has given more than one TedX talk about libraries. Seeing our library colleagues represent our profession in arenas broader than libraries is energizing and infectious.

Librarians having a seat at the table and a voice at two of the premier innovative gatherings in the world is powerful. This concept of librarians embedding themselves in communities outside of librarianship has been discussed in a number of articles including The Undergraduate Science Librarian and In the Library With the Lead Pipe.

Highlights
Rather than giving detailed comprehensive coverage of Eyeo, you’ll see a glimpse of a few presentations plus a number of resources so that you can see for yourself some of the amazing, collaborative work being done. Presenter’s names link to the full talk that you can watch for yourself. Because a lot of the work being done is interactive and participatory in some way, I encourage you to seek these projects out and interact with them. The organizers are in the midst of processing a lot of videos and putting them up on the Eyeo Festival Vimeo channel; I highly recommend watching them and checking back for more.

Ben Fry
Principal of Fathom, a Boston based design and data visualization firm, and co-initiator of the programming language Processing, Ben Fry’s work in data visualization and design is worth delving into. In his Eyeo presentation, 3 Things, the project that most stood out was the digitization project Fathom produced for GE: http://fathom.info/latest/category/ge. Years of annual reports were beautifully digitized and incorporated into an interactive web application they built from scratch. When faced with scanning issues, they built a tool that improved the scanned results.

Jer Thorp
Data artist in residence for the New York Times, and former geneticist, Jer Thorp’s range in working with data, art, and design is far and wide. Thorp is one of the few founders of the Eyeo Festival and in his presentation Near/Far he discussed several data visualization projects with the focus on storytelling. The two main pieces that stood out from Jer’s talk was his encouragement to dive into data visualization. He even included 10 year old, Theodore Zaballos’ handmade visualization of The Illiad which was rather impressive. The other piece that stood out was his focus on data visualization in context to location and people owning their own data versus a third party. This lead into the Open Paths project he showcased. He has also presented to librarians at the Canadian library conference, Access 2011.

Jen Lowe
Jen Lowe was by far the standout from all of the amazing Ignite Eyeo talks. She spoke about how people are intrinsically inspired by storytelling and the need for those working with data to focus on storytelling through the use of visualizing data and the story it tells. She works for the Open Knowledge Foundation in addition to running Datatelling and she has her library degree (she’s one of us!).

Jonathan Harris
Jonathan Harris gave one of the most personal and poignant presentations at Eyeo. In a retrospective of his work, Jonathan covered years of work interwoven with personal stories from his life. Jonathan is an artist and designer and his work life and personal life are rarely separated. Each project began with the initial intention and ended with a more critical inward examination from the artist. The presentation led to his most recent endeavor, the Cowbird project, where storytelling once again emerges strongly. In describing this project he focused on the idea that technology and software could be used for good, in a more human way, created by “social engineers” to build a community of storytellers. He describes Cowbird as “a community of storytellers working to build a public library of human experience.”

Additional people + projects to delve into:

Fernanda Viegas and Martin Wattenberg of the Google Big Picture data visualization group. Wind Map: http://hint.fm/wind/

Kyle McDonald: http://kylemcdonald.net/

Tahir Temphill: http://tahirhemphill.com/ and his latest work, Hip Hop Word Count: http://staplecrops.com/index.php/hiphop_wordcount/

Julian Oliver: http://julianoliver.com/

Nicholas Felton of Facebook: http://feltron.com/

Aaron Koblin of the Google Data Arts Group: http://www.aaronkoblin.com/ and their latest project with the Tate Modern: http://www.exquisiteforest.com/

Local Projects: http://localprojects.net/

Oblong Industries: http://oblong.com/

Eyebeam Art + Technology Center: http://eyebeam.org/

What can libraries get from the Eyeo Festival?

Libraries and library work are everywhere at this conference. That this eclectic group of creative people were often thinking about and producing work similar to librarians is thrilling. There is incredible potential for libraries to embrace some of the concepts and problems in many of the presentations I saw and conversations I was part of. There are multiple ways that libraries could learn from and perhaps participate in this broader community and work across fields.

People love libraries and these attendees were no exception. There were attendees from numerous private/corporate companies, newspapers, museums, government, libraries, and more. I was not the only library professional in attendance so I suspect those individuals might see the potential I see, which I also find really exciting. The drive behind every presenter and attendee was by far creativity in some form, the desire to make something, and communicate. The breadth of creativity and imagination that I saw reminded me of a quote from David Lankes in his keynote from the New England Library Association Annual Conference:

“What might kill our profession is not ebooks, Amazon or Google, but a lack of imagination. We must envision a bright future for librarians and the communities they serve, then fight to make that vision a reality. We need a new activist librarianship focused on solving the grand challenges of our communities. Without action we will kill librarianship.”

If librarianship is in need of more imagination and perhaps creativity too, there is a world of wonder out there in terms of resources to help us achieve this vision.

The Eyeo Festival is but one place where we can become inspired, learn, and dream and then bring that experience back to our libraries and inject our own imagination, ideas, experimentation, and creativity into the work we do. By doing the most creative, imaginative library work we can do will inspire our communities; I have seen it first hand. Eyeo personally taught me that I need to fail more, focus more, make more, and have more fun doing it all.


Personal Data Monitoring: Gamifying Yourself

The academic world has been talking about gamification of learning for some time now. The 2012 Horizon Report says gamification of learning will become mainstream in 2-3 years. Gamification taps into the innate human love of narrative and displaying accomplishments.  Anyone working through Code Year is personally familiar with the lure of the green bar that tells you how far you are to your next badge. In this post I want to address a related but slightly different topic: personal data capture and analytics.

Where does the library fit into this? One of the roles of the academic library is to help educate and facilitate the work of researchers. Effective research requires collecting a wide variety of relevant sources, reading them, and saving the relevant information for the future. The 2010 book Too Much to Know by Ann Blair describes the note taking and indexing habits taught to scholars in early modern Europe. Keeping a list of topics and sources was a major focus of scholars, and the resulting notes and indexes were published in their own right. Nowadays maintaining a list of sources is easier than ever with the many tools to collect and store references–but challenges remain due to the abundance of sources and pressure to publish, among others.

New Approaches and Tools in Personal Data Monitoring

Tracking one’s daily habits, reading lists and any other personal information is a very old human habit. Understanding what you are currently doing is the first step in creating better habits, and technology makes it easier to collect this data. Stephen Wolfram has been using technology to collect data about himself for nearly 25 years, and he posted some visual examples of this a few weeks ago. This includes items such as how many emails he’s sent and received, keystrokes made, and file types created. The Felton report, produced by Nick Felton, is a gorgeously designed book with personal data about himself and his family. But you don’t have to be a data or design whiz to collect and display personal information. For instance, to display your data in a visually compelling way you can use a service such as Daytum to create a personal data dashboard.

Hours of Activity recorded by Fitbit

In the realm of fitness and health, there are many products that will help capture, store, and analyze personal data.  Devices like the Fitbit now clip or strap to your body and count steps taken, floors climbed, and hours slept. Pedometers and GPS enabled sport watches help those trying to get in shape, but the new field of personal genetic monitoring and behavior analytics promise to make it possible to know very specific information about your health and understand potential future choices to make. 23andMe will map your personal genome and provide a portal for analyzing and understanding your genetic profile, allowing unprecedented ability to understand health. (Though there is doubt about whether this can accurately predict disease). For the behavioral and lifestyle aspects of health a new service called Ginger.io will help collect daily data for health professionals.

Number of readers recorded by Mendeley

Visual cues of graphs of accomplishments and green progress bars can be as helpful in keeping up research and monitoring one’s personal research habits just as much as they help in learning to code or training for a marathon. One such feature is the personal reading challenge on Goodreads,which lets you set a goal of how many books to read in the year, tracks what you’ve read, and lets you know how far behind or ahead you are at your current reading pace. Each book listed as in progress has a progress bar indicating how far along in the book you are. This is a simple but effective visual cue. Another popular tool, Mendeley, provides a convenient way to store PDFs and track references of all kinds. Built into this is a small green icon that indicates a reference is unread. You can sort references by read/unread–by marking a reference as “read”, the article appears as read in the Mendeley research database. Academia.eduprovides another way for scholars to share research papers and see how many readers they have.

Libraries and Personal Data

How can libraries facilitate this type of personal data monitoring and make it easy for researchers to keep track of what they have done and help them set goals for the future? Last November the Academic Book Writing Month (#acbowrimo) Twitter hashtag community spun off of National Novel Writing Month and challenged participants to complete the first draft of an academic book or other lengthy work. Participants tracked daily word counts and research goals and encouraged each other to complete the work. Librarians could work with researchers at their institutions, both faculty and students, on this type of peer encouragement. We already do this type of activity, but tools like Twitter make it easier to share with a community who might not come to the library often.

The recent furor over the change in Google’s privacy settings prompted many people to delete their Google search histories. Considered another way, this is a treasure trove of past interests to mine for a researcher trying to remember a book he or she was searching for some years ago—information that may not be available anywhere else. Librarians have certain professional ethics that make collecting and analyzing that type of personal data extremely complex. While we collect all types of data and avidly analyze it, we are careful to not keep track of what individuals read, borrowed, or asked of a librarian. This keeps individual researchers’ privacy safe; the major disadvantage is that it puts the onus on the individual to collect his own data. For people who might read hundreds or thousands of books and articles it can be a challenge to track all those individual items. Library catalogs are not great at facilitating this type of recordkeeping. Some next generation catalogs provide better listing and sharing features, but the user has to know how to add each item. Even if we can’t provide users a historical list of all items they’ve ever borrowed, we can help to educate them on how to create such lists. And in fact, unless we do help researchers create lists like this we lose out on an important piece of the historical record, such as the library borrowing history in Dissenting Academies Online.

Conclusion

What are some types of data we can ethically and legally share to help our researchers track personal data? We could share statistics on the average numbers of books checked out by students and faculty, articles downloaded, articles ordered, and other numbers that will help people understand where they fall along a continuum of research. Of course all libraries already collect this information–it’s just a matter of sharing it in a way that makes it easy to use. People want to collect and analyze data about what they do to help them reach their goals. Now that this is so easy we must consider how we can help them.

 

Works Cited
Blair, Ann. Too Much to Know : Managing Scholarly Information Before the Modern Age. New Haven: Yale University Press, 2010.

The Internet of Things meets the Library of Things

If the New York Times article The Internet Gets Physical is any indication, a sea change is approaching in just how smart everyday appliances are going to become. In theory, smart infrastructure will connect you and any appliance with an IP address to everything else.

For example: your car will talk to your phone. Appliances like your computer, and chair, and desk, interact over the web. Data will be passed via standard web technologies from every Internet-capable appliance. Everyday consumer electronics will be de facto networked to the Internet. The overall effect of these smart objects means the possibility of new library services and research environments.

According to the New Media Consortium’s 2012 Horizon Reportthe Internet of things is made possible by the IPV6 initiative, which essentially allows for the explosion of IP addresses across the globe and in your everyday life:

“with the advent of the New Internet Protocol, version six, those objects can now have an IP address, enabling their information store to be accessed in the same way a webcam might be, allowing real-time access to that information from anywhere… the implications are not yet clear, but it is evident that hundreds of billions of devices — from delicate lab equipment to refrigerators to next-generation home security systems — will soon be designed to take advantage of such connections…” (p.8)

What are the implications of the physical Internet in library settings?

Your smart phone interacts with the library building

The ways in which mobile apps can interact with the library building is not yet fully realized; for example, should your phone and the building be able to tell you things such as the interrelations among your physical presence and searches you’ve done on your home or office computer – or places you’ve driven past in your commute; or where you spend you leisure hours? Who makes the choices of suggesting resources to you based on the information in all of your life-sensors? Surely libraries will need filtering algorithms to control for allowable data referencing but where and how will we implement such recommender services?

Smart digital shelving units

What if a future digital shelf arrangement could be responsive to your personal preferences? For example – the library building’s digital smart infrastructure could respond to your circulation history or Internet searches in a way that shelves could promote content to you in real time. What would this recommendation look like for individual research, study, or browsing? And how would libraries be able to leverage such a service?

Digital library integration with physical objects

Smart objects allow libraries to consider how to make the virtual presence (databases, e-books, ILS data) physical. Many libraries would welcome a more physical instantiation of vended software products, since to a certain extent, users believe the library’s collection consists of only the things that they can see in the library.

The 2012 NMC Horizon Report indicates that smart objects are on the far term horizon. So it may be four to five years before they affect higher eduction — what is your plan for smart objects in the library environment?

Additional Resources

 

 


Getting Creative with Data Visualization: A Case Study

The Problem

At the NCSU Libraries, my colleagues and I in the Research and Information Services department do a fair bit of instruction, especially to classes from the university’s First Year Writing Program. Some new initiatives and outreach have significantly increased our instruction load, to the point where it was getting more difficult for us to effectively cover all the sessions that were requested due to practical limits of our schedules. By way of a solution, we wanted to train some of our grad assistants, who (at the time of this writing) are all library/information science students from that school down the road, in the dark arts of basic library instruction, to help spread the instruction burden out a little.

This would work great, but there’s a secondary problem: since UNC is a good 40 minute drive away, our grad assistants tend to have very rigid schedules, which are fixed well in advance — so we can’t just alter our grad assistants’ schedules on short notice to have them cover a class. Meanwhile, instruction scheduling is very haphazard, due to wide variation in how course slots are configured in the weekly calendar, so it can be hard to predict when instruction requests are likely to be scheduled. What we need is a technique to maximize the likelihood that a grad student’s standing schedule will overlap with the timing of instruction requests that we do get — before the requests come in.

Searching for a Solution – Bar graph-based analysis

The obvious solution was to try to figure out when during the day and week we provided library instruction most frequently. If we could figure this out, we could work with our grad students to get their schedules to coincide with these busy periods.

Luckily, we had some accrued data on our instructional activity from previous semesters. This seemed like the obvious starting point: look at when we taught previously and see what days and times of day were most popular. The data consisted of about 80 instruction sessions given over the course of the prior two semesters; data included date, day of week, session start time, and a few other tidbits. The data was basically scraped by hand from the instruction records we maintain for annual reports; my colleague Anne Burke did the dirty work of collecting and cleaning the data, as well as the initial analysis.

Anne’s first pass at analyzing the data was to look each day of the week in terms of courses taught in the morning, afternoon, and evening. A bit of hand-counting and spreadsheet magic produced this:

Instruction session count by day of week and time of day, Spring 2010-Fall 2011

This chart was somewhat helpful — certainly it’s clear that Monday, Tuesday and Thursday are our busiest days — but but it doesn’t provide a lot of clarity regarding times of day that are hot for instruction.  Other than noting that Friday evening is a dead time (hardly a huge insight), we don’t really get a lot of new information on how the instruction sessions shake out throughout the week.

Let’s Get Visual – Heatmap-based visualization

The chart above gets the fundamentals right — since we’re designing weekly schedules for our grad assistants, it’s clear that the relevant dimensions are days of week and times of day. However, there are basically two problems with the stacked bar chart approach: (1) The resolution of the stacked bars — morning, afternoon and evening — is too coarse. We need to get more granular if we’re really going to see the times that are popular for instruction; (2) The stacked bar chart slices just don’t fit our mental model of a week. If we’re going to solve a calendaring problem, doesn’t it make a lot of sense to create a visualization that looks like a calendar?

What we need is a matrix — something where one dimension is the day of the week and the other dimension is the hour of the day (with proportional spacing) — just like a weekly planner. Then for any given hour, we need something to represent how “popular” that time slot is for instruction. It’d be great if we had some way for closely clustered but non-overlapping sessions to contribute “weight” to each other, since it’s not guaranteed that instruction session timing will coincide precisely.

When I thought about analyzing the data in these terms, the concept of a heatmap immediately came to mind. A heatmap is a tool commonly used to look for areas of density in spatial data. It’s often used for mapping click or eye-tracking data on websites, to develop an understanding of the areas of interest on the website. A heatmap’s density modeling works like this: each data point is mapped in two dimensions and displayed graphically as a circular “blob” with a small halo effect; in closely-packed data, the blobs overlap. Areas of overlap are drawn with more intense color, and the intensity effect is cumulative, so the regions with the most intense color correspond to the areas of highest density of points.

I had heatmaps on the brain since I had just used them extensively to analyze user interaction patterns with a touchscreen application that I had recently developed.

Heatmap example from my previous work, tracking touches on a touchscreen interface. The heatmap is overlaid onto an image of the interface.

Part of my motivation for using heatmaps to solve our scheduling problem was simply to use the tools I had at hand: it seemed that it would be a simple matter to convert the instruction data into a form that would be amenable to modeling with the heatmap software I had access to. But in a lot of ways, a heatmap was a perfect tool: with a proper arrangement of the data, the heatmap’s ability to model intensity would highlight the parts of each day where the most instruction occurred, without having to worry too much about the precise timing of instruction sessions.

The heatmap generation tool that I had was a slightly modified version of the Heatmap PHP class from LabsMedia’s ClickHeat, an open-source tool for website click tracking. My modified version of the heatmap package takes in an array of (x,y) ordered pairs, corresponding to the locations of the data points to be mapped, and outputs a PNG file of the generated heatmap.

So here was the plan: I would convert each instruction session in the data to a set of (x,y) coordinates, with one coordinate representing day of week and the other representing time of day. Feeding these coordinates into the heatmap software would, I hoped, create five colorful swatches, one for each day of the week. The brightest regions in the swatches would represent the busiest times of the corresponding days.

Arbitrarily, I selected the y-coordinate to represent the day of the week. So I decided that any Monday slot, for instance, would be represented by some small (but nonzero) y-coordinate, with Tuesday represented by some greater y-coordinate, etc., with the intervals between consecutive days of the week equal. The main concern in assigning these y-coordinates was for the generated swatches to be far enough apart so that the heatmap “halo” around one day of the week would not interfere with its neighbors — we’re treating the days of the week independently. Then it was a simple matter of mapping time of day to the x-coordinate in a proportional manner. The graphic below shows the output from this process.

Raw heatmap of instruction data as generated by heatmap software

In this graphic, days of the week are represented by the horizontal rows of blobs, with Monday as the first row and Friday as the last. The leftmost extent of each row corresponds to approximately 8am, while the rightmost extent is about 7:30pm. The key in the upper left indicates (more or less) the number of overlapping data points in a given location. A bit of labeling helps to clarify things:

Heatmap of instruction data, labeled with days of week and approximate indications of time of day.

Right away, we get a good sense of the shape of the instruction week. This presentation reinforces the findings of the earlier chart: that Monday, Tuesday, and Thursday are busiest, and that Friday afternoon is basically dead. But we do see a few other interesting tidbits, which are visible to us specifically through the use of the heatmap:

  • Monday, Tuesday and Thursday aren’t just busy, they’re consistently well-trafficked throughout the day.
  • Friday is really quite slow throughout.
  • There are a few interesting hotspots scattered here and there, notably first thing in the morning on Tuesday.
  • Wednesday is quite sparse overall, except for two or three prominent afternoon/evening times.
  • There is a block of late afternoon-early evening time-slots that are consistently busy in the first half of the week.

Using this information, we can take a much more informed approach to scheduling our graduate students, and hopefully be able to maximize their availability for instruction sessions.

“Better than it was before. Better. Stronger. Faster.” – Open questions and areas for improvement

As a proof of concept, this approach to analyzing our instruction data for the purposes of setting student schedules seems quite promising. We used our findings to inform our scheduling of graduate students this semester, but it’s hard to know whether our findings can even be validated: since this is the first semester where we’re actively assigning instruction to our graduate students, there’s no data available to compare this semester against, with respect to amount of grad student instruction performed. Nevertheless, it seems clear that knowledge of popular instruction times is a good guideline for grad student scheduling for this purpose.

There’s also plenty of work to be done as far as data collection and analysis is concerned. In particular:

  • Data curation by hand is burdensome and inefficient. If we can automate the data collection process at all, we’ll be in a much better position to repeat this type of analysis in future semesters.
  • The current data analysis completely ignores class session length, which is an important factor for scheduling (class times vary between 50 and 100 minutes). This data is recorded in our instruction spreadsheet, but there aren’t any set guidelines on how it’s entered — librarians entering their instruction data tend to round to the nearest quarter- or half-hour increment at their own preference, so a 50-minute class is sometimes listed as “.75 hours” and other times as “1 hour”. More accurate and consistent session time recording would allow us to reliably use session length in our analysis.
  • To make the best use of session length in the analysis, I’ll have to learn a little bit more about PHP’s image generation libraries. The current approach is basically a plug-in adaptation of ClickHeat’s existing Heatmap class, which is only designed to handle “point” data. To modify the code to treat sessions as little line segments corresponding to their duration (rather than points that correspond to their start times) would require using image processing methods that are currently beyond my ken.
  • A bit better knowledge of the image libraries would also allow me to add automatic labeling to the output file. You’ll notice the prominent use of “ish” to describe the hours dimension of the labeled heatmap above: this is because I had neither the inclination nor the patience to count pixels to determine where exactly the labels should go. With better knowledge of the image libraries I would be able to add graphical text labels directly to the generated heatmap, at precisely the correct location.

There are other fundamental questions that may be worth answering — or at least experimenting against — as well. For instance, in this analysis I used data about actual instruction sessions performed. But when lecturers request library sessions, they include two or three “preferred” dates, of which we pick the one that fits our librarian and room schedules best. For the purposes of analysis, it’s not entirely clear whether we should use the actual instruction data, which takes into account real space limitations but is also skewed by librarian availability; or whether we should look strictly at what lecturers are requesting, which might allow us to schedule our grad students in a way that could accommodate lecturers’ first choices better, but which might run us up against the library’s space limitations. In previous semesters, we didn’t store the data on the requests we received; this semester we’re doing that, so I’ll likely perform two analyses, one based on our actual instruction and one based on requests. Some insight might be gained by comparing the results of the two analyses, but it’s unclear what exactly the outcome will be.

Finally, it’s hard to predict how long-term trends in the data will affect our ability to plan for future semesters. It’s unclear whether prior semesters are a good indicator of future semesters, especially as lecturers move into and out of the First Year Writing Program, the source of the vast majority of our requests. We’ll get a better sense of this, presumably, as we perform more frequent analyses — it would also make sense to examine each semester separately to look for trends in instruction scheduling from semester to semester.

In any case, there’s plenty of experimenting left to do and plenty of improvements that we could make.

Reflections and Lessons Learned

There’s a few big points that I took away from this experience. A big one is simply that sometimes the right approach is a totally unexpected one. You can gain some interesting insights if you don’t limit yourself to the tools that are most familiar for a particular problem. Don’t be afraid to throw data at the wall and see what sticks.

Really, what we did in this case is not so different from creating separate histograms of instruction times for each day of the week, and comparing the histograms to each other. But using heatmaps gave us a couple of advantages over traditional histograms: first, our bin size is essentially infinitely narrow; because of the proximity effects of the heatmap calculation, nearby but non-overlapping data points still contribute weight to each other without us having to define bins as in a regular histogram. Second, histograms are typically drawn in two dimensions, which would make comparing them against each other rather a nuisance. In this case, our separate heatmap graphics for each day of the week are basically one-dimensional, which allows us to compare them side by side with little fuss. This technique could be used for side-by-side examinations of multiple sets of any histogram-like data for quick and intuitive at-a-glance comparison.

In particular, it’s important to remember — especially if your familiarity with heatmaps is already firmly entrenched in a spatial mapping context — that data doesn’t have to be spatial in order to be analyzed with heatmaps. This is really just an extension of the idea of graphical data analysis: A heatmap is just another way to look at arbitrary data represented graphically, not so different from a bar graph, pie chart, or scatter plot. Anything that you can express in two dimensions (or even just one), and where questions of frequency, density, proximity, etc., are relevant, can be analyzed using the heatmap approach.

A final point: as an analysis tool, the heatmap is really about getting a feel for how the data lies in aggregate, rather than getting a precise sense of where each point falls. Since the halo effect of a data point extends some distance away from the point, the limits of influence of that point on the final image are a bit fuzzy. If precision analysis is necessary, then heatmaps are not the right tool.

About our guest author: Andreas Orphanides is Librarian for Digital Technologies and Learning in the Research and Information Services department at NCSU Libraries. He holds an MSLS from UNC-Chapel Hill and a BA in mathematics from Oberlin College. His interests include instructional design, user interface development, devising technological solutions to problems in library instruction and public services, long walks on the beach, and kittens.