Academic librarians working in technical roles may rarely see stacks of books, but they doubtless see messy digital data on a daily basis. OpenRefine is an extremely useful tool for dealing with this data without sophisticated scripting skills and with a very low learning curve. Once you learn a few tricks with it, you may never need to force a student worker to copy and paste items onto Excel spreadsheets.
As this comparison by the creator of OpenRefine shows, the best use for the tool is to explore and transform data, and it allows you to make edits to many cells and rows at once while still seeing your data. This allows you to experiment and undo mistakes easily, which is a great advantage over databases or scripting where you can’t always see what’s happening or undo the typo you made. It’s also a lot faster than editing cell by cell like you would do with a spreadsheet.
Here’s an example of a project that I did in a spreadsheet and took hours, but then I redid in Google Refine and took a lot less time. One of the quickest things to do with OpenRefine is spot words or phrases that are almost the same, and possibly are the same thing. Recently I needed to turn a large export of data from the catalog into data that I could load into my institutional repository. There were only certain allowed values that could be used in the controlled vocabulary in the repository, so I had to modify the bibliographic data from the catalog (which was of course in more or less proper AACR2 style) to match the vocabularies available in the repository. The problem was that the data I had wasn’t consistent–there were multiple types of abbreviations, extra spaces, extra punctuation, and outright misspellings. An example is the History Department. I can look at “Department of History”, “Dep. of History”, “Dep of Hist.” and tell these are probably all referring to the same thing, but it’s difficult to predict those potential spellings. While I could deal with much of this with regular expressions in a text editor and find and replace in Excel, I kept running into additional problems that I couldn’t spot until I got an error. It took several attempts of loading the data until I cleared out all the errors.
In OpenRefine this is a much simpler task, since you can use it to find everything that probably is the same thing despite the slight differences in spelling, punctuation and spelling. So rather than trying to write a regular expression that accounts for all the differences between “Department of History”, “Dep. of History”, “Dep of Hist.”, you can find all the clusters of text that include those elements and change them all in one shot to “History”. I will have more detailed instructions on how to do this below.
Installation and Basics
OpenRefine was called, until last October, Google Refine, and while the content from the Google Refine page is being moved to the Open Refine page you should plan to look at both sites. Documentation and video tutorials refer interchangeably to Google Refine and OpenRefine. The official and current documentation is on the OpenRefine GitHub wiki. For specific questions you will probably want to use the OpenRefine Custom Search Engine, which brings together all the mix of documentation and tutorials on the web. OpenRefine is a web app that runs on your computer, so you don’t need an internet connection to run it. You can get the installation instructions on this page.
While you can jump in right away and get started playing around, it is well worth your time to watch the tutorial videos, which will cover the basic actions you need to take to start working with data. As I said, the learning curve is low, but not all of the commands will make sense until you see them in action. These videos will also give you an idea of what you might be able to do with a data set you have lying around. You may also want to browse the “recipes” on the OpenRefine site, as well search online for additional interesting things people have done. You will probably think of more ideas about what to try. The most important thing to know about OpenRefine is that you can undo anything, and go back to the beginning of the project before you messed up.
A basic understanding of the Google Refine Expression Language, or GREL will improve your ability to work with data. There isn’t a whole lot of detailed documentation, so you should feel free to experiment and see what happens when you try different functions. You will see from the tutorial videos the basics you need to know. Another essential tool is regular expressions. So much of the data you will be starting with is structured data (even if it’s not perfectly structured) that you will need to turn into something else. Regular expressions help you find patterns which you can use to break apart strings into something else. Spending a few minutes understanding regular expression syntax will save hours of inefficient find and replace. There are many tutorials–my go-to source is this one. The good news for librarians is that if you can construct a Dewey Decimal call number, you can construct a regular expression!
Some ideas for librarians
Above I described how you would use OpenRefine to clean up messy and inconsistent catalog data. Here’s how to do it. Load in the data, and select “Text Facet” on the column in question. OpenRefine will show clusters of text that is similar and probably the same thing.
Click on Cluster to get a menu for working with multiple values. You can click on the “Merge” check box and then edit the text to whatever you need it to be. You can also edit each text cluster to be the correct text.
You can merge and re-cluster until you have fixed all the typos. Back on the first Text Facet, you can hover over any value to edit it. That way even if the automatic clustering misses some you can edit the errors, or change anything that is the same but you need to look different–for instance, change “Dept. of English” to just “English”.
The main thing that I have used OpenRefine for in my daily work is to change a bibliography in plain text into columns in a spreadsheet that I can run against an API. This was inspired by this article in the Code4Lib Journal: “Using XSLT and Google Scripts to Streamline Populating an Institutional Repository” by Stephen X. Flynn, Catalina Oyler, and Marsha Miles. I wanted to find a way to turn a text CV into something that would work with the SHERPA/RoMEO API, so that I could find out which past faculty publications could be posted in the institutional repository. Since CVs are lists of data presented in a structured format but with some inconsistencies, OpenRefine makes it very easy to present the data in a certain way as well as remove the inconsistencies, and then to extend the data with a web service. This is a very basic set of instructions for how to accomplish this.
The main thing to accomplish is to put the journal title in its own column. Here’s an example citation in APA format, in which I’ve colored all the “separator” punctuation in red:
Heller, M. (2011). A Review of “Strategic Planning for Social Media in Libraries”. Journal of Electronic Resources Librarianship, 24 (4), 339-240)
From the drop-down menu at the top of the column click on “Split into several columns…” from the “Edit Column” menu. You will get a menu like the one below. This example finds the opening parenthesis and removes that in creating a new column. The author’s name is its own column, and the rest of the text is in another column.
The rest of the column works the same way–find the next text, punctuation, or spacing that indicates a separation. You can then rename the column to be something that makes sense. In the end, you will end up with something like this:
When you have the journal titles separate, you may want to cluster the text and make sure that the journals have consistent titles or anything else to clean up the titles. Now you are a ready to build on this data with fetching data from a web service. The third video tutorial posted above will explain the basic idea, and this tutorial is also helpful. Use the pull-down menu at the top of the journal column to select “Edit column” and then “Add column by fetching URLs…”. You will get a box that will help you construct the right URL. You need to format your URL in the way required by SHERPA/RoMEO, and will need a free API key. For the purposes of this example, you can use
'http://www.sherpa.ac.uk/romeo/api29.php?ak=[YOUR API KEY HERE]&qtype=starts&jtitle=' + escape(value,'url'). Note that it will give you a preview to see if the URL is formatted in the way you expect. Give your column a name, and set the Throttle delay, which will keep the service from rejecting too many requests in a short time. I found 1000 worked fine.
After this runs, you will get a new column with the XML returned by SHERPA/RoMEO. You can use this to pull out anything you need, but for this example I want to get pre-archiving and post-archiving policies, as well as the conditions. A quick way to to this is to use the Googe Refine Expression Language parseHtml function. To use this, click on “Add column based on this column” from the “Edit Column” menu, and you will get a menu to fill in an expression.
In this example I use the code
value.parseHtml().select("prearchiving").htmlText(), which selects just the text from within the prearchving element. Conditions are a little different, since there are multiple conditions for each journal. In that case, you would use the following syntax (after join you can put whatever separator you want):
So in the end, you will end up with a neatly structured spreadsheet from your original CV with all the bibliographic information in its own column and the publisher conditions listed. You can imagine the possibilities for additional APIs to use–for instance, the WorldCat API could help you determine which faculty published books the library owns.
Once you find a set of actions that gets your desired result, you can save them for the future or to share with others. Click on Undo/Redo and then the Extract option. You will get a description of the actions you took, plus those actions represented in JSON.
Unselect the checkboxes next to any mistakes you made, and then copy and paste the text somewhere you can find it again. I have the full JSON for the example above in a Gist here. Make sure that if you save your JSON publicly you remove your personal API key! When you want to run the same recipe in the future, click on the Undo/Redo tab and then choose Apply. It will run through the steps for you. Note that if you have a mistake in your data you won’t catch it until it’s all finished, so make sure that you check the formatting of the data before running this script.
Learning More and Giving Back
Hopefully this quick tutorial got you excited about OpenRefine and thinking about what you can do. I encourage you to read through the list of External Resources to get additional ideas, some of which are library related. There is lots more to learn and lots of recipes you can create to share with the library community.
Have you used OpenRefine? Share how you’ve used it, and post your recipes.
- OpenRefine on Twitter. If you post something about OpenRefine on Twitter, they will usually retweet you as a way to showcase what people are doing.
- OpenRefine Google Group
The academic world has been talking about gamification of learning for some time now. The 2012 Horizon Report says gamification of learning will become mainstream in 2-3 years. Gamification taps into the innate human love of narrative and displaying accomplishments. Anyone working through Code Year is personally familiar with the lure of the green bar that tells you how far you are to your next badge. In this post I want to address a related but slightly different topic: personal data capture and analytics.
Where does the library fit into this? One of the roles of the academic library is to help educate and facilitate the work of researchers. Effective research requires collecting a wide variety of relevant sources, reading them, and saving the relevant information for the future. The 2010 book Too Much to Know by Ann Blair describes the note taking and indexing habits taught to scholars in early modern Europe. Keeping a list of topics and sources was a major focus of scholars, and the resulting notes and indexes were published in their own right. Nowadays maintaining a list of sources is easier than ever with the many tools to collect and store references–but challenges remain due to the abundance of sources and pressure to publish, among others.
New Approaches and Tools in Personal Data Monitoring
Tracking one’s daily habits, reading lists and any other personal information is a very old human habit. Understanding what you are currently doing is the first step in creating better habits, and technology makes it easier to collect this data. Stephen Wolfram has been using technology to collect data about himself for nearly 25 years, and he posted some visual examples of this a few weeks ago. This includes items such as how many emails he’s sent and received, keystrokes made, and file types created. The Felton report, produced by Nick Felton, is a gorgeously designed book with personal data about himself and his family. But you don’t have to be a data or design whiz to collect and display personal information. For instance, to display your data in a visually compelling way you can use a service such as Daytum to create a personal data dashboard.
In the realm of fitness and health, there are many products that will help capture, store, and analyze personal data. Devices like the Fitbit now clip or strap to your body and count steps taken, floors climbed, and hours slept. Pedometers and GPS enabled sport watches help those trying to get in shape, but the new field of personal genetic monitoring and behavior analytics promise to make it possible to know very specific information about your health and understand potential future choices to make. 23andMe will map your personal genome and provide a portal for analyzing and understanding your genetic profile, allowing unprecedented ability to understand health. (Though there is doubt about whether this can accurately predict disease). For the behavioral and lifestyle aspects of health a new service called Ginger.io will help collect daily data for health professionals.
Visual cues of graphs of accomplishments and green progress bars can be as helpful in keeping up research and monitoring one’s personal research habits just as much as they help in learning to code or training for a marathon. One such feature is the personal reading challenge on Goodreads,which lets you set a goal of how many books to read in the year, tracks what you’ve read, and lets you know how far behind or ahead you are at your current reading pace. Each book listed as in progress has a progress bar indicating how far along in the book you are. This is a simple but effective visual cue. Another popular tool, Mendeley, provides a convenient way to store PDFs and track references of all kinds. Built into this is a small green icon that indicates a reference is unread. You can sort references by read/unread–by marking a reference as “read”, the article appears as read in the Mendeley research database. Academia.eduprovides another way for scholars to share research papers and see how many readers they have.
Libraries and Personal Data
How can libraries facilitate this type of personal data monitoring and make it easy for researchers to keep track of what they have done and help them set goals for the future? Last November the Academic Book Writing Month (#acbowrimo) Twitter hashtag community spun off of National Novel Writing Month and challenged participants to complete the first draft of an academic book or other lengthy work. Participants tracked daily word counts and research goals and encouraged each other to complete the work. Librarians could work with researchers at their institutions, both faculty and students, on this type of peer encouragement. We already do this type of activity, but tools like Twitter make it easier to share with a community who might not come to the library often.
The recent furor over the change in Google’s privacy settings prompted many people to delete their Google search histories. Considered another way, this is a treasure trove of past interests to mine for a researcher trying to remember a book he or she was searching for some years ago—information that may not be available anywhere else. Librarians have certain professional ethics that make collecting and analyzing that type of personal data extremely complex. While we collect all types of data and avidly analyze it, we are careful to not keep track of what individuals read, borrowed, or asked of a librarian. This keeps individual researchers’ privacy safe; the major disadvantage is that it puts the onus on the individual to collect his own data. For people who might read hundreds or thousands of books and articles it can be a challenge to track all those individual items. Library catalogs are not great at facilitating this type of recordkeeping. Some next generation catalogs provide better listing and sharing features, but the user has to know how to add each item. Even if we can’t provide users a historical list of all items they’ve ever borrowed, we can help to educate them on how to create such lists. And in fact, unless we do help researchers create lists like this we lose out on an important piece of the historical record, such as the library borrowing history in Dissenting Academies Online.
What are some types of data we can ethically and legally share to help our researchers track personal data? We could share statistics on the average numbers of books checked out by students and faculty, articles downloaded, articles ordered, and other numbers that will help people understand where they fall along a continuum of research. Of course all libraries already collect this information–it’s just a matter of sharing it in a way that makes it easy to use. People want to collect and analyze data about what they do to help them reach their goals. Now that this is so easy we must consider how we can help them.
Between static analytics and a usability lab
Would you like an even more intimate glimpse into what users are actually doing on your site, instead of what you (or the library web committee) think they are doing? There are several easy-to-use web-based analytics services like ClickTale , userfly, Loop11, Crazy Egg, Inspectlet, or Optimalworkshop. These online usability services offer various ways to track what users are doing as they actually navigate your pages — all without setting up a usability lab, recruiting participants, or introducing the artificiality and anxiety of an observed user session. ClickTale and userfly record user actions that you can view later as a video; most services offer heatmaps of where users actually click on your site; some offer “eye tracking” maps based on mouse movement.
- Most services allow you to sign up for one free account for a limited amount of data or time.
- Most allow you to specify which pages or sections of your site that you want to test at a time.
- Many have monthly pricing plans that would allow for snapshots of user activity in various months of the year without having to pay for an entire year’s service.
We’re testing Inspectlet at the moment. I like it because the free account offers the two services I’m most interested in: periodic video captures of the designated site and heat maps of actual clicks. The code is a snippet added to the web pages of interest. The screen captures are fascinating — watch below as an off-campus user searches the library home page for the correct place to do an author search in the library catalog. I view it as a bit of a cautionary illustration about providing a lot of options. Follow the yellow “spotlight” to track the user’s mouse movements. As a contrast, I watched video after video of clearly experienced users taking less than two seconds to hit the “Ebsco Academic Search” link. Be prepared; watching a series of videos of unassisted users can dismantle your or your web committee’s cherished notions about how users navigate your site.
This is a Jing video of a screen capture — the actual screen captures are much sharper, and I have zoomed out for illustrative purposes. The free Inspectlet account does not support downloads of capture videos, but Rachit Gupta, the founder, wrote me that in the coming few weeks, Inspectlet is releasing a feature to allow downloads for paid accounts. Paid accounts also have access to real time analytics, so libraries would be able to get a montage of what’s happening in the lobby as it is happening. Imagine being able to walk out and announce a “pop-up library workshop” on using the library catalog effectively after seeing the twentieth person fumble through the OPAC.
Another thing I like about Inspectlet is the ability to anonymize the IP addresses in the individual screen captures to protect an individual patron’s privacy.
The chart below compares the features of a few of the most widely used web-based analytics tools.
education discounts available (call)
|Terms (with a brief privacy explanation)||Basic $10/month; free 10 captures a monthCan cancel subscription at any time.|
If you are using one of these services, or a similar service, what have you learned about your users?
Testing new designs or alternative designs – widely used web-based usability tools
After you’ve watched your users and determined where there are problems or where you would like to try an alternative design, these services offer easy ways to test new designs and gather feedback from users without setting up a local usability lab.
|OpenHallway||Create test scenarios and analyze results||Terms of Service||Basic: $49/month; limited free account, Can cancel subscription at any time.|
|Usabilla||Create test scenarios and analyze results; mobile UX testing||Terms of Service||Starter: $19/month. Can cancel subscription at any time.|
At the NCSU Libraries, my colleagues and I in the Research and Information Services department do a fair bit of instruction, especially to classes from the university’s First Year Writing Program. Some new initiatives and outreach have significantly increased our instruction load, to the point where it was getting more difficult for us to effectively cover all the sessions that were requested due to practical limits of our schedules. By way of a solution, we wanted to train some of our grad assistants, who (at the time of this writing) are all library/information science students from that school down the road, in the dark arts of basic library instruction, to help spread the instruction burden out a little.
This would work great, but there’s a secondary problem: since UNC is a good 40 minute drive away, our grad assistants tend to have very rigid schedules, which are fixed well in advance — so we can’t just alter our grad assistants’ schedules on short notice to have them cover a class. Meanwhile, instruction scheduling is very haphazard, due to wide variation in how course slots are configured in the weekly calendar, so it can be hard to predict when instruction requests are likely to be scheduled. What we need is a technique to maximize the likelihood that a grad student’s standing schedule will overlap with the timing of instruction requests that we do get — before the requests come in.
Searching for a Solution – Bar graph-based analysis
The obvious solution was to try to figure out when during the day and week we provided library instruction most frequently. If we could figure this out, we could work with our grad students to get their schedules to coincide with these busy periods.
Luckily, we had some accrued data on our instructional activity from previous semesters. This seemed like the obvious starting point: look at when we taught previously and see what days and times of day were most popular. The data consisted of about 80 instruction sessions given over the course of the prior two semesters; data included date, day of week, session start time, and a few other tidbits. The data was basically scraped by hand from the instruction records we maintain for annual reports; my colleague Anne Burke did the dirty work of collecting and cleaning the data, as well as the initial analysis.
Anne’s first pass at analyzing the data was to look each day of the week in terms of courses taught in the morning, afternoon, and evening. A bit of hand-counting and spreadsheet magic produced this:
This chart was somewhat helpful — certainly it’s clear that Monday, Tuesday and Thursday are our busiest days — but but it doesn’t provide a lot of clarity regarding times of day that are hot for instruction. Other than noting that Friday evening is a dead time (hardly a huge insight), we don’t really get a lot of new information on how the instruction sessions shake out throughout the week.
Let’s Get Visual – Heatmap-based visualization
The chart above gets the fundamentals right — since we’re designing weekly schedules for our grad assistants, it’s clear that the relevant dimensions are days of week and times of day. However, there are basically two problems with the stacked bar chart approach: (1) The resolution of the stacked bars — morning, afternoon and evening — is too coarse. We need to get more granular if we’re really going to see the times that are popular for instruction; (2) The stacked bar chart slices just don’t fit our mental model of a week. If we’re going to solve a calendaring problem, doesn’t it make a lot of sense to create a visualization that looks like a calendar?
What we need is a matrix — something where one dimension is the day of the week and the other dimension is the hour of the day (with proportional spacing) — just like a weekly planner. Then for any given hour, we need something to represent how “popular” that time slot is for instruction. It’d be great if we had some way for closely clustered but non-overlapping sessions to contribute “weight” to each other, since it’s not guaranteed that instruction session timing will coincide precisely.
When I thought about analyzing the data in these terms, the concept of a heatmap immediately came to mind. A heatmap is a tool commonly used to look for areas of density in spatial data. It’s often used for mapping click or eye-tracking data on websites, to develop an understanding of the areas of interest on the website. A heatmap’s density modeling works like this: each data point is mapped in two dimensions and displayed graphically as a circular “blob” with a small halo effect; in closely-packed data, the blobs overlap. Areas of overlap are drawn with more intense color, and the intensity effect is cumulative, so the regions with the most intense color correspond to the areas of highest density of points.
I had heatmaps on the brain since I had just used them extensively to analyze user interaction patterns with a touchscreen application that I had recently developed.
Part of my motivation for using heatmaps to solve our scheduling problem was simply to use the tools I had at hand: it seemed that it would be a simple matter to convert the instruction data into a form that would be amenable to modeling with the heatmap software I had access to. But in a lot of ways, a heatmap was a perfect tool: with a proper arrangement of the data, the heatmap’s ability to model intensity would highlight the parts of each day where the most instruction occurred, without having to worry too much about the precise timing of instruction sessions.
The heatmap generation tool that I had was a slightly modified version of the Heatmap PHP class from LabsMedia’s ClickHeat, an open-source tool for website click tracking. My modified version of the heatmap package takes in an array of (x,y) ordered pairs, corresponding to the locations of the data points to be mapped, and outputs a PNG file of the generated heatmap.
So here was the plan: I would convert each instruction session in the data to a set of (x,y) coordinates, with one coordinate representing day of week and the other representing time of day. Feeding these coordinates into the heatmap software would, I hoped, create five colorful swatches, one for each day of the week. The brightest regions in the swatches would represent the busiest times of the corresponding days.
Arbitrarily, I selected the y-coordinate to represent the day of the week. So I decided that any Monday slot, for instance, would be represented by some small (but nonzero) y-coordinate, with Tuesday represented by some greater y-coordinate, etc., with the intervals between consecutive days of the week equal. The main concern in assigning these y-coordinates was for the generated swatches to be far enough apart so that the heatmap “halo” around one day of the week would not interfere with its neighbors — we’re treating the days of the week independently. Then it was a simple matter of mapping time of day to the x-coordinate in a proportional manner. The graphic below shows the output from this process.
In this graphic, days of the week are represented by the horizontal rows of blobs, with Monday as the first row and Friday as the last. The leftmost extent of each row corresponds to approximately 8am, while the rightmost extent is about 7:30pm. The key in the upper left indicates (more or less) the number of overlapping data points in a given location. A bit of labeling helps to clarify things:
Right away, we get a good sense of the shape of the instruction week. This presentation reinforces the findings of the earlier chart: that Monday, Tuesday, and Thursday are busiest, and that Friday afternoon is basically dead. But we do see a few other interesting tidbits, which are visible to us specifically through the use of the heatmap:
- Monday, Tuesday and Thursday aren’t just busy, they’re consistently well-trafficked throughout the day.
- Friday is really quite slow throughout.
- There are a few interesting hotspots scattered here and there, notably first thing in the morning on Tuesday.
- Wednesday is quite sparse overall, except for two or three prominent afternoon/evening times.
- There is a block of late afternoon-early evening time-slots that are consistently busy in the first half of the week.
Using this information, we can take a much more informed approach to scheduling our graduate students, and hopefully be able to maximize their availability for instruction sessions.
“Better than it was before. Better. Stronger. Faster.” – Open questions and areas for improvement
As a proof of concept, this approach to analyzing our instruction data for the purposes of setting student schedules seems quite promising. We used our findings to inform our scheduling of graduate students this semester, but it’s hard to know whether our findings can even be validated: since this is the first semester where we’re actively assigning instruction to our graduate students, there’s no data available to compare this semester against, with respect to amount of grad student instruction performed. Nevertheless, it seems clear that knowledge of popular instruction times is a good guideline for grad student scheduling for this purpose.
There’s also plenty of work to be done as far as data collection and analysis is concerned. In particular:
- Data curation by hand is burdensome and inefficient. If we can automate the data collection process at all, we’ll be in a much better position to repeat this type of analysis in future semesters.
- The current data analysis completely ignores class session length, which is an important factor for scheduling (class times vary between 50 and 100 minutes). This data is recorded in our instruction spreadsheet, but there aren’t any set guidelines on how it’s entered — librarians entering their instruction data tend to round to the nearest quarter- or half-hour increment at their own preference, so a 50-minute class is sometimes listed as “.75 hours” and other times as “1 hour”. More accurate and consistent session time recording would allow us to reliably use session length in our analysis.
- To make the best use of session length in the analysis, I’ll have to learn a little bit more about PHP’s image generation libraries. The current approach is basically a plug-in adaptation of ClickHeat’s existing Heatmap class, which is only designed to handle “point” data. To modify the code to treat sessions as little line segments corresponding to their duration (rather than points that correspond to their start times) would require using image processing methods that are currently beyond my ken.
- A bit better knowledge of the image libraries would also allow me to add automatic labeling to the output file. You’ll notice the prominent use of “ish” to describe the hours dimension of the labeled heatmap above: this is because I had neither the inclination nor the patience to count pixels to determine where exactly the labels should go. With better knowledge of the image libraries I would be able to add graphical text labels directly to the generated heatmap, at precisely the correct location.
There are other fundamental questions that may be worth answering — or at least experimenting against — as well. For instance, in this analysis I used data about actual instruction sessions performed. But when lecturers request library sessions, they include two or three “preferred” dates, of which we pick the one that fits our librarian and room schedules best. For the purposes of analysis, it’s not entirely clear whether we should use the actual instruction data, which takes into account real space limitations but is also skewed by librarian availability; or whether we should look strictly at what lecturers are requesting, which might allow us to schedule our grad students in a way that could accommodate lecturers’ first choices better, but which might run us up against the library’s space limitations. In previous semesters, we didn’t store the data on the requests we received; this semester we’re doing that, so I’ll likely perform two analyses, one based on our actual instruction and one based on requests. Some insight might be gained by comparing the results of the two analyses, but it’s unclear what exactly the outcome will be.
Finally, it’s hard to predict how long-term trends in the data will affect our ability to plan for future semesters. It’s unclear whether prior semesters are a good indicator of future semesters, especially as lecturers move into and out of the First Year Writing Program, the source of the vast majority of our requests. We’ll get a better sense of this, presumably, as we perform more frequent analyses — it would also make sense to examine each semester separately to look for trends in instruction scheduling from semester to semester.
In any case, there’s plenty of experimenting left to do and plenty of improvements that we could make.
Reflections and Lessons Learned
There’s a few big points that I took away from this experience. A big one is simply that sometimes the right approach is a totally unexpected one. You can gain some interesting insights if you don’t limit yourself to the tools that are most familiar for a particular problem. Don’t be afraid to throw data at the wall and see what sticks.
Really, what we did in this case is not so different from creating separate histograms of instruction times for each day of the week, and comparing the histograms to each other. But using heatmaps gave us a couple of advantages over traditional histograms: first, our bin size is essentially infinitely narrow; because of the proximity effects of the heatmap calculation, nearby but non-overlapping data points still contribute weight to each other without us having to define bins as in a regular histogram. Second, histograms are typically drawn in two dimensions, which would make comparing them against each other rather a nuisance. In this case, our separate heatmap graphics for each day of the week are basically one-dimensional, which allows us to compare them side by side with little fuss. This technique could be used for side-by-side examinations of multiple sets of any histogram-like data for quick and intuitive at-a-glance comparison.
In particular, it’s important to remember — especially if your familiarity with heatmaps is already firmly entrenched in a spatial mapping context — that data doesn’t have to be spatial in order to be analyzed with heatmaps. This is really just an extension of the idea of graphical data analysis: A heatmap is just another way to look at arbitrary data represented graphically, not so different from a bar graph, pie chart, or scatter plot. Anything that you can express in two dimensions (or even just one), and where questions of frequency, density, proximity, etc., are relevant, can be analyzed using the heatmap approach.
A final point: as an analysis tool, the heatmap is really about getting a feel for how the data lies in aggregate, rather than getting a precise sense of where each point falls. Since the halo effect of a data point extends some distance away from the point, the limits of influence of that point on the final image are a bit fuzzy. If precision analysis is necessary, then heatmaps are not the right tool.
About our guest author: Andreas Orphanides is Librarian for Digital Technologies and Learning in the Research and Information Services department at NCSU Libraries. He holds an MSLS from UNC-Chapel Hill and a BA in mathematics from Oberlin College. His interests include instructional design, user interface development, devising technological solutions to problems in library instruction and public services, long walks on the beach, and kittens.
What is Action Analytics?
If you say “analytics” to most technology-savvy librarians, they think of Google Analytics or similar web analytics services. Many libraries are using such sophisticated data collection and analyses to improve the user experience on library-controlled sites. But the standard library analytics are retrospective: what have users done in the past? Have we designed our web platforms and pages successfully, and where do we need to change them?
Technology is enabling a different kind of future-oriented analytics. Action Analytics is evidence-based, combines data sets from different silos, and uses actions, performance, and data from the past to provide recommendations and actionable intelligence meant to influence future actions at both the institutional and the individual level. We’re familiar with these services in library-like contexts such as Amazon’s “customers who bought this item also bought” book recommendations and Netflix’s “other movies you might enjoy”.
Action Analytics in the Academic Library Landscape
It was a presentation by Mark David Milliron at Educause 2011 on “Analytics Today: Getting Smarter About Emerging Technology, Diverse Students, and the Completion Challenge” that made me think about the possibilities of the interventionist aspect of analytics for libraries. He described the complex dependencies between inter-generational poverty transmission, education as a disrupter, drop-out rates for first generation college students, and other factors such international competition and the job market. Then he moved on to the role of sophisticated analytics and data platforms and spoke about how it can help individual students succeed by using technology to deliver the right resource at the right time to the right student. Where do these sorts of analytics fit into the academic library landscape?
If your library is like my library, the pressure to prove your value to strategic campus initiatives such student success and retention is increasing. But assessing services with most analytics is past-oriented; how do we add the kind of library analytics that provide a useful intervention or recommendation? These analytics could be designed to help an individual student choose a database, or trigger a recommendation to dive deeper into reference services like chat reference or individual appointments. We need to design platforms and technology that can integrate data from various campus sources, do some predictive modeling, and deliver a timely text message to an English 101 student that recommends using these databases for the first writing assignment, or suggests an individual research appointment with the appropriate subject specialist (and a link to the appointment scheduler) to every honors students a month into their thesis year.
But should we? Are these sorts of interventions creepy and stalker-ish?* Would this be seen as an invasion of privacy? Does the use of data in this way collide with the profession’s ethical obligation and historical commitment to keep individual patron’s reading, browsing, or viewing habits private?
Every librarian I’ve discussed this with felt the same unease. I’m left with a series of questions: Have technology and online data gathering changed the context and meaning of privacy in such fundamental ways that we need to take a long hard look at our assumptions, especially in the academic environment? (Short answer — yes.) Are there ways to manage opt-in and opt-out preferences for these sorts of services so these services are only offered to those who want them? And does that miss the point? Aren’t we trying to influence the students who are unaware of library services and how the library could help them succeed?
Furthermore, are we modeling our ideas of “creepiness” and our adamant rejection of any “intervention” on the face-to-face model of the past that involved a feeling of personal surveillance and possible social judgment by live flesh persons? The phone app Mobilyze helps those with clinical depression avoid known triggers by suggesting preventative measures. The software is highly personalized and combines all kinds of data collected by the phone with self-reported mood diaries. Researcher Colin Depp observes that participants felt that the impersonal advice delivered via technology was easier to act on than “say, getting advice from their mother.”**
While I am not suggesting in any way that libraries move away from face-to-face, personalized encounters at public service desks, is there room for another model for delivering assistance? A model that some students might find less intrusive, less invasive, and more effective — precisely because it is technological and impersonal? And given the struggle that some students have to succeed in school, and the staggering debt that most of them incur, where exactly are our moral imperatives in delivering academic services in an increasingly personalized, technology-infused, data-dependent environment?
Increasingly, health services, commercial entities, and technologies such as browsers and social networking environments that are deeply embedded in most people’s lives, use these sorts of action analytics to allow the remote monitoring of our aging parents, sell us things, and match us with potential dates. Some of these uses are for the benefit of the user; some are for the benefit of the data gatherer. The moment from the Milliron presentation that really stayed with me was the poignant question that a student in a focus group asked him: “Can you use information about me…to help me?”
Can we? What do you think?
* For a recent article on academic libraries and Facebook that addresses some of these issues, see Nancy Kim Phillips, Academic Library Use of Facebook: Building Relationships with Students, The Journal of Academic Librarianship, Volume 37, Issue 6, December 2011, Pages 512-522, ISSN 0099-1333, 10.1016/j.acalib.2011.07.008. See also a very recent New York Times article on use of analytics by companies which discusses the creepiness factor.
Do you know what an infographic is? Infographics are visual representation of facts, tutorials, or other data-centered information that inform while staying focused on great design.
Here’s an example of one about the history of the iPad:
This infographic takes a whole mess of data and makes it visually interesting and easy to digest.
So, what do infographics have to do with libraries? Libraries have tons of data- both informational and instructional data ranging from topics like local history facts to how to do research. Take a look at this Google infographic recently posted on the HackCollege site: http://www.hackcollege.com/blog/2011/11/23/infographic-get-more-out-of-google.html
This image highlights several complex research skills while explaining the thought process behind it in one easy to understand sentence, while being attractive and compelling to look at. What’s better than that?
Great examples of infographics can be found across the web. Wired magazine, for one, often uses them and Holy Kaw!, Guy Kawasaki’s website (http://holykaw.alltop.com) also highlights great infographics from other sites. Another great site to see examples of different types of infographics is http://killerinfographics.submitinfographics.com/.
The importance of infographics and other great visualizations of data (see Warby Parker’s 2011 Annual Report for the best annual report ever: http://www.warbyparker.com/annual-report-2011) to libraries is obvious. People respond to great design, and great design makes information accessible and inviting. It is in our best interests to strive for great design in all that we do, to make libraries accessible and inviting.
Recently, Sophie Brookover, New Jersey librarian, posted in the ALA Think Tank Facebook page (http://www.facebook.com/groups/ALAthinkTANK/) about starting a group of librarians learning to create infographics, much like the Code Year project. This idea is very much in the early stages, but keep an eye on it or get involved- good things are sure to come.
What Library Circulation Data Shows
Unless current patterns change, by 2020 university libraries will no longer have circulation desks. This claim may seem hyperbolic if you’ve been observing your library, or even if you’ve been glancing over ACRL or National Center for Education Statistics data. If you have been looking at the data, you might be familiar with a pattern that looks like this:
This chart shows total circulation for academic libraries, and while there’s a decline it certainly doesn’t look like it will hit zero anytime soon, definitely not in just 8 years. But there is a problem with this data and this perspective on library statistics. When we talk about “total circulation” we’re talking about a property of the library, we’re not really thinking about users.
Here’s another set of data that you need to look at to really understand circulation:
Academic enrollment has been rising rapidly. This means more students, which in turns means greater circulation. So if total circulation has been dropping despite an increase in users then something else must be going on. So rather than asking the question “How many items does my library circulate?” we need to alter that to “How many items does the average student checkout?”
Here is that data:
This chart shows the upper/lower quartiles and median for circulation per FTE student. As you can see this data shows a much more dramatic drop in the circulation of library materials. Rising student populations hide this fact.
But 2020? Can I be serious? The simple linear regression model in the charts is probably a good predictor of 2012, but not necessarily 2020. Hitting zero without flattening out seems pretty unlikely. However, it is worth noting the circulation per user in the lower quartile for less than 4 year colleges reached 1.1 in 2010. If you’re averaging around 1 item per user, every user that takes out 2 items means there’s another who has checked out 0.
What’s Happening Here?
Rather than waste too much time trying to predict a future we’ll live in in less than a decade, let’s explore the more interesting question: “What’s happening here?”
By far the number one hypothesis I get when I show people this data is “Clearly this is just because of the rise of e-journals and e-books”. This hypothesis is reasonable: What has happened is simply that users have switched from print to electronic. This data represents a shift in media, nothing more.
But there are 2 very large problems with this hypothesis.
First, print journal circulation is not universal among academic libraries. In the cases where there is no print journal circulation the effect of e-journals would not be present in circulation data. However, I don’t have information to point out exactly how many academic libraries did circulate print journals. Maybe the effect of e-journals on just the libraries that do circulate serials could effect the data for everyone. The data we have already shown resolves this issue. Libraries that did circulate serials would have higher circulation per user than those that did not. By showing different quartiles we can address this discrepancy in the data between libraries that did and did not circulate journals. If you look at the data you’ll see that indeed the upper quartile does seem to have a higher rate of decline, but not enough to validate this hypothesis. The median and lower quartiles also experience this shift, so something else must be at work.
Second, e-books were not largely adopted until the mid 2000s, yet the decline preceding 2000 is at least as steep as after. If you look at the chart below you’ll notice that ebook acquisition rates did not exceed print until 2010:
Ebooks, of course, do have an effect on usage, but they’re not the primary factor in this change.
So clearly we must reject the hypothesis that this is merely a media shift. Certainly the shift from print to electronic has had some effect, but it is not the sole cause. If it’s not a shift in media, the most reasonable explanation is that it’s a shift in user behavior. Students are simply not using books (in any format) as much as they used to.
What is Causing this Shift in User Behavior?
The next question is what is the cause of this shift.
I think the most simple answer is the web. 1996 is the first data point showing a drop in circulation. Of course the web was quite small then, but AOL and Yahoo! were already around, and the Internet Archive had been founded. If you think back to a pre-web time, pretty much anything you needed to know more about required a trip to the library and checking out a book.
The most important thing to take away is that, regardless of cause, user behavior has changed and by all data points is still changing. In the end, the greatest question is how will academic libraries adapt? It is clear that the answer is not as simple as a transition to a new media. To survive, librarians must find the answer before we have enough data to prove these predictions.
- All library data referenced in this post comes from the Library Statistics Program (National Center for Education Statistics) nces.ed.gov/pubsearch/getpubcats.asp?sid=041#
- Data regarding fall enrollments is from “Fast Facts” (National Center for Education Statistics) http://nces.ed.gov/fastfacts/display.asp?id=98
About our guest author: Will Kurt is a software engineer at Articulate Global, pursuing his masters in computer science at the University of Nevada, Reno and is a former librarian. He holds an MLIS from Simmons College and has worked in various roles in public, private and special libraries at organizations such as: MIT, BBN Technologies and the University of Nevada, Reno. He has written and presented on a range of topics including: play, user interfaces, functional programming and data