Building a Data Dashboard

Dashboard web pages that display, at a glance, a wide range of information about a library’s operations are becoming common. These dashboards are made possible by the ubiquity of web-based information visualization tools combined with the ever increasing availability of data sources. Naturally, libraries want to take advantage of these tools to provide insight alongside some “wow” factor.

Ball State Libraries' dashboard

Ball State University Libraries’ dashboard shows many high-level figures about a variety of services. One can learn more about a particular figure by clicking the “+” or “i” icons beside it.

TADL dashboard

The Traverse Area District Library dashboard has a few sections with different types of charts. Shown here are four line graphs displaying changes in various usage statistics over time.

I’ve assembled a list of several data dashboards; if you know of one that isn’t on the list, let me know in the comments so I can add it.

It’s difficult to set about piecing together a display from the information immediately available. The design and data collection process largely determines the success of the dashboard. I don’t pretend to be an expert on these processes; rather, my own attempt at building a dashboard was a productive failure that helped to highlight where I went wrong.

Why do we want to build a dashboard?

As with any project, the starting point for building a dashboard is to identify what we’re trying to accomplish. Having a guiding idea will determine everything that follows, from the data we collect to how we choose to display them.

There are two primary goals for dashboards: marketing and success. One seeks to advertise the excellence of the library—perhaps to secure further funding, perhaps to raise its profile on campus—while the other aims at improved daily operations, however that may be defined. These are two terrifically broad categories but they create a useful distinction when building a dashboard.

Perhaps our goal is “to make a flashy dashboard, one that’ll make spectators rubberneck as they navigate the information superhighway.”

I’m not even being sarcastic. It’s easy to dismiss designs that are flashy to the detriment of their content. The web is rife with visualizations that leave their audiences with no improved understanding of the issue at hand. But there is also value to shiny sites that show the library can create an intriguing product. Furthermore, experimental visualizations that explore the possibilities of design can be a wonderful starting point. What appeared to be superficial glitz may unearth an important pattern. Several failed, bombastic charts may ultimately lead to a balanced display.

In general, flashy displays fall into the marketing category. They’re eye candy for the library, whether through the enormity of the figures themselves (“look how many books we circulated last year! articles downloaded from our databases! reference queries we answered!”) or loud design.

Increasing the success of the library is a very different goal. What defines “success” is more subjective than marketing and as such the types of data included will vary greatly. Assessing services calls for timelines, cross-sections of data, and charts that render divergent data representing different aspects of a library comparable. Unlike marketing figures, the charts need not be easily interpreted. Simplification may unintentionally distort what’s going on, obscuring important but subtle trends. To quote from an In the Library with the Lead Pipe post from 2009:

Libraries collect a lot of data that encompass complex networks about how users navigate through online resources, which subjects circulate the most or the least, which resources are requested via interlibrary loan, visitation patterns over periods of time, reference queries, and usage statistics of online journals and databases. Making sense of these complex networks of use and need isn’t easy. But the relationships between use and need patterns can help libraries make hard decisions (say, about which journals to cut) and creative decisions to improve user experiences, outreach, achieve efficiencies, and enhance alignment with organizational goals.
— Hilary Davis, “Not Just Another Pretty Picture” 1

If an institution or department has a more narrow goal in mind than marketing or assessment, chances are it still relates to one of the two. It’s also possible to build a dashboard that accomplishes both goals, providing insight for staff while thrilling external stakeholders. However, that’s no easy feat to accomplish. What external parties find surprising may be pedestrian to staff, and vice versa. Libraries provide simple, Google-esque search fields alongside triplicate advanced search forms for the same reason. Users shouldn’t need to know SQL to find a title; staff shouldn’t be limited to keyword searches for their administrative tasks. Appealing to such varied sensibilities at once is an onerous task best avoided if possible.

Choosing Our Data Sources

The ultimate goal of a data dashboard will greatly affect what data sources are sensible. Reviewing the dashboards of other libraries, a few data sources are repeated across them:

  • Checkouts and renewals
  • Print volumes and other material holdings
  • Inter-library loan
  • Gate counts
  • Computer use
  • Reference questions

These data are commonly available, but don’t capture the unique value that libraries provide. We can use them as a starting point but should keep in mind that our dashboard’s purpose determines what information is pertinent. Simply because a figure is readily available within our ILS isn’t a sufficient reason to include it. On the other hand, we may find our display would benefit greatly from a figure which isn’t available, thus influencing how we collect and analyze data going forward.

Marketing displays must be persuasive. The information in them should be positive, surprising, and understandable to a broad audience.

Total circulation, article/ebook downloads, and new items digitized all might be starting points. The breadth of a library’s offerings is often surprising to external parties, too. Show circulation across many subject areas, attempt to outline all the impact of the library’s many services. For instance, if our library has an active social media presence we could show various accounts and the measures of popularity for each (followers and favorites, translated across the schemas of the specific networks).

Simply pilfering all the dashboard’s data from typical year-end reports submitted to organizations like the Nation Center for Educational Statistics, ALA, or our accrediting bodies is a terrible idea for marketing displays. These numbers are less important outside of the long-term, cross-institutional trends they make visible. Our students, faculty, or funders probably do not care about the exact quantity of our microfilm holdings. Since our dashboard isn’t held to any reporting standards, this is our chance to show what’s new, what those statistical warehouse organizations do not yet collect.

Marketing displays should focus on aggregate numbers that look impressive in isolate and do not require much context to apprehend. Different categories need not be comparable or normalized; each figure can have its own type of chart and style as long as the design is visually appealing. Marketing isn’t meant to generate insight so highlighting patterns across different data sources isn’t necessary. All of the screenshot examples at the beginning of this post represent marketing-oriented dashboards because of the impressive nature of the figures presented as well as their relative lack of informative depth.

I’ll be honest, presenting data for marketing purposes makes me feel conflicted. It’s easy to skirt the truth since one’s incentives lie in producing ever more stunning figures, not in representing reality. If there is serious dysfunction within a library, marketing displays won’t disclose them. Nonetheless, there’s value to these efforts. I wouldn’t be in this profession if I didn’t believe that libraries produce an astonishing amount of value for their host institutions, through a wide range of services. Demonstrating that value is vital.

A dashboard meant to improve library success should answer pressing questions about collections and services. A library might want to investigate different areas each year, and so data sources could change periodically. Investigations might be specific and change each year; one year we aim at improving information literacy instruction, another year at increasing access to electronic resources. The dashboard can differ from year to year if it’s meant to aid a focused study. Trying to represent everything a library does at once leads to clutter, information overload, and time spent culling statistics no one cares about.

For a nice end-to-end discussion of building a data warehouse and hooking it up to dashboard charts, see “Trends at a Glance: A Management Dashboard of Library Statistics” (open access via ITAL) by Emily G. Morton-Owens and Karen L. Hanson. They cover in-depth creating a dashboard meant to give an overview of library operations from a broad variety of data sources, including external ones like Google Analytics as well as a local inter-library loan database.

GVSU Libraries status

A dashboard shows that a few minor issues affect various systems at GVSU.

While it’s not displaying the sorts of data this posts discusses, the Grand Valley State University Libraries’ Status page is an example of a good dashboard that aims at improving service. Issues affecting several areas are brought together in on one page, making it possible to gain a quick overview of the library’s operations in seconds. Just because GVSU Libraries Status is not a marketing display doesn’t imply design doesn’t matter; bright, easily scannable colors make information more digestible. The familiar red-for-error, green-for-success color scheme is well applied.

There are some attributes that influence data choices regardless of the goal of our dashboard. Data sources that are easy to access and reuse will naturally be better selections than cumbersome ones. For instance, while COUNTER usage statistics for vendor databases can be both informative and persuasive, they’re often painful to collect. One must log into a number of separate administrative sites, download a CSV (possibly multiple CSVs if we need information from separate reports), trim useless rows, and combine into one master file. This process is difficult to automate and the SUSHI standard meant to address this labor has no viable software implementations. 2

On the other hand, many modern third-party services offer robust APIs that make retrieving data simple. For the dashboard I built, some of the easiest pieces came from LibGuides, YouTube, and WordPress because all three surfaced usage information. Since I wanted to show the breadth of web services we were using and their popularity, it was easy to set up some simple API calls that for our most popular guides, videos, and blog posts. 3

Building the Dashboard Pt. I: Data Stores

Once we’ve chosen our data sources—ignoring, for the sake of brevity, the important part where we get staff buy-in and ensure our data quality is sufficient—we’ll need a central data store to power our dashboard. This can be simpler than it sounds, for instance a single relational database with tables for various data sources which are imported on a periodic basis. I’ve even used Google Spreadsheets as a data store and since the spreadsheet is already online, its data can be accessed through an API or even downloaded as a CSV.

Even if we rely on third-party APIs for much of our data, we may want to download the data into a data store. Having a full copy of our library’s information has many benefits: we don’t need to worry about service outages or sudden discontinuation, we can add summative values which many not be present in the original API, and we can reduce the total number of databases the dashboard relies upon by keeping everything centralized.

Once we have a data store, we can use it to power an API and a site that consumes the API, known as a client. At a high level, our client-API interaction looks like:

  • the client passes parameters in the URL letting the API know what kind of data it wants
  • the API receives the request, parses the parameters (in PHP this is done with the superglobal $_GET associative array, for instance)
  • the API queries a database with those parameters
  • the API takes the query response and formats it as JSON, sending it along to our client (formatting as JSON is common enough that most languages have a built-in function for it) 4
  • the client consumes the JSON and outputs charts, figures, etc.

A more code-like example of this process:

a GET HTTP request for example.com/api.php?month=02&dataset=reference asks for reference data from the month of February

the API performs query like SELECT * FROM reference WHERE month LIKE 'february' on the database

the API takes the query results and formats them into a JSON response like:

{
 "reference questions": 404,
 "directional questions": 808,
 "computer questions": 10101,
 "ready reference": 42,
 "research consultations": 42
}

This is a basic example of library-specific data that probably lives in a database we control. However, depending on how many third party APIs we choose to rely upon, the necessary flexibility of the client varies. If we’re normalizing and storing everything in our data store, the front end can be simple since the data is in a pre-processed, predictable format. For instance, aggregation operations like sums, maximums, and averages can be baked into a well-designed, comprehensive dashboard API such that the client only needs to know what URL to request. On the other end, a sophisticated client might consume multiple sources of data, combine them into a similar profile, and then run summative calculations itself. Both approaches have their relative advantages depending upon the situation.

Building the Dashboard Pt. II: Charting

There are numerous choices for building graphs and charts on the web. This post will not attempt to list them. Instead, I have two recommendations. One, our front-end shouldn’t be Flash. Adobe Flash doesn’t work on the mobile web and, at this point, JavaScript has a tremendous number of data visualization libraries. Secondly, we should use D3 or something that builds on top of D3.

While there are a lot of approaches to data visualization in JavaScript, D3 has a large user community, tremendous flexibility (it can be used for visualizations as varied as timelines, bar charts, and word clouds), and a nice design based on chained function calls that will be familiar to jQuery users.

While D3 is a knockout choice, we can’t just plug data in, pick a chart type, and watch the output instantly appear as in Excel or Numbers. Instead, D3 is more of a framework for working with data on the web. It takes a lot of learning and code to produce high quality charts. Instead, it’s best to use a library that provides an easy-to-use wrapper around D3. Here are several choices that abstract away common difficulties:

  • C3 makes creating D3 charts full of nice features and best practices as simple as passing data to c3.generate()
  • reD3 is a set of reusable D3 charts
  • The D3 Gallery shows what the framework can do with imitable example code

Raw is another interesting option. It’s browser-based charting software that works more similarly to spreadsheet software: plug in our data, choose a chart type, and then take the output that’s produced (using D3). If our charts don’t change often then we don’t need to build a sophisticated client-server setup; we can take our data and then use Raw to produce a web-friendly chart. Repeat every year or as often as the dashboard needs be updated.

When I made a dashboard, I didn’t use a wrapper around D3 and instead ended up building my own, a creaky abstraction for making pie charts with a few built-in options. While it was fun, I spent time making something not on par with what was already available.

Conclusion

Figure out why you want a dashboard, choose data sources appropriate to your purposes, build a data store with an API, use D3 to display web-based charts. Easy, right?

That sounds like numerous increasingly sophisticated steps, but I think each one also poses value in its own right. If you can’t define the goals of the data you’re collecting, why exactly are you gathering them? If you’re not aggregating your data in one place, aren’t they just a messy hodgepodge that everyone dreads negotiating come annual report time? Dashboards are shiny. Shiny things are nice. But the reflective processes behind building a data dashboard are perhaps more valuable than their final product.

Notes

  1. This post also recommends the book Information Dashboard Design by Stephen Few which is probably an excellent resource for those who want to learn more about the topic.
  2. I’d love to hear from anyone who has found a good open source SUSHI client. I’ve been looking for one for years now and found a number of broken options but nothing I was capable of setting up. There are a few commercial SUSHI products that ease this pain significantly for those libraries that can afford them.
  3. I haven’t seen many good explanations of what an API is; the acronym stands for “Application Programming Interface” which I find utterly nondescript. APIs expose information about web services such that they can be easily processed by programs. They fall somewhere in between having direct access to query a database and only being able to see an HTML page intended for human consumption.
  4. There are non-JSON APIs—XML is common—but JSON is becoming a de facto standard due to its light weight and the ease with which JavaScript and other languages can parse it.

Analyzing EZProxy Logs

Analyzing EZProxy logs may not be the most glamorous task in the world, but it can be illuminating. Depending on your EZProxy configuration, log analysis can allow you to see the top databases your users are visiting, the busiest days of the week, the number of connections to your resources occurring on or off-campus, what kinds of users (e.g., staff or faculty) are accessing proxied resources, and more.

What’s an EZProxy Log?

EZProxy logs are not significantly different from regular server logs.  Server logs are generally just plain text files that record activity that happens on the server.  Logs that are frequently analyzed to provide insight into how the server is doing include error logs (which can be used to help diagnose problems the server is having) and access logs (which can be used to identify usage activity).

EZProxy logs are a kind of modified access log, which record activities (page loads, http requests, etc.) your users undertake while connected in an EZProxy session. This article will briefly outline five potential methods for analyzing EZProxy logs:  AWStats, Piwik, EZPaarse, a custom Python script for parsing starting-point URLS (SPU) logs, and a paid option called Splunk.

The ability of  any log analyzer will of course depend upon how your EZProxy log directives are configured.  You will need to know your LogFormat and/or LogSPU directives in order to configure most log file analyzing solutions.  In EZProxy, you can see how your logs are formatted in config.txt/ezproxy.cfg by looking for the LogFormat directive, 1  e.g.,

LogFormat %h %l %u %t “%r” %s %b “%{user-agent}i”

and / or, to log Starting Point URLs (SPUs):

LogSPU -strftime log/spu/spu%Y%m.log %h %l %u %t “%r” %s %b “%{ezproxy-groups}i”

Logging Starting Point URLs can be useful because those tend to be users either clicking into a database or the full-text of an article, but no activity after that initial contact is logged.  This type of logging does not log extraneous resource loading, such as loading scripts and images – which often clutter up your traditional LogFormat logs and lead to misleadingly high hits.  LogSPU directives can be defined in addition to traditional LogFormat to provide two different possible views of your users’ data.  SPULogs can be easier to analyze and give more interesting data, because they can give a clearer picture of which links and databases are most popular  among your EZProxy users.  If you haven’t already set it up, SPULogs can be a very useful way to observe general usage trends by database.

You can find some very brief anonymized EZProxy log sample files on Gist:

On a typical EZProxy installation, historical monthly logs can be found inside the ezproxy/log directory.  By default they will rotate out every 12 months, so you may only find the past year of data stored on your server.

AWStats

Get It:  http://www.awstats.org/#DOWNLOAD

Best Used With:  Full Logs or SPU Logs

Code / Framework:  Perl

    An example AWStats monthly history report. Can you tell when our summer break begins?

An example AWStats monthly history report. Can you tell when our summer break begins?

AWStats Pros:

  • Easy installation, including on localhost
  • You can define your unique LogFormat easily in AWStats’ .conf file.
  • Friendly, albeit a little bit dated looking, charts show overall usage trends.
  • Extensive (but sometimes tricky) customization options can be used to more accurately represent sometimes unusual EZProxy log data.
Hourly traffic distribution in AWStats.  While our traffic peaks during normal working hours, we have steady usage going on until about 1 AM, after which point it crashes pretty hard.  We could use this data to determine  how much virtual reference staffing we should have available during these hours.

Hourly traffic distribution in AWStats. While our traffic peaks during normal working hours, we have steady usage going on until about Midnight, after which point it crashes pretty hard. We could use this data to determine how much virtual reference staffing we should have available during these hours.

 

AWStats Cons:

  • If you make a change to .conf files after you’ve ingested logs, the changes do not take effect on already ingested data.  You’ll have to re-ingest your logs.
  • Charts and graphs are not particularly (at least easily) customizable, and are not very modern-looking.
  • Charts are static and not interactive; you cannot easily cross-section the data to make custom charts.

Piwik

Get It:  http://piwik.org/download/

Best Used With:  SPULogs, or embedded on web pages web traffic analytic tool

Code / Framework:  Python

piwik visitor dashboard

The Piwik visitor dashboard showing visits over time. Each point on the graph is interactive. The report shown actually is only displaying stats for a single day. The graphs are friendly and modern-looking, but can be slow to load.

Piwik Pros:

  • The charts and graphs generated by Piwik are much more attractive and interactive than those produced by AWStats, with report customizations very similar to what’s available in Google Analytics.
  • If you are comfortable with Python, you can do additional customizations to get more details out of your logs.
Piwik file ingestion in PowerShell

To ingest a single monthly log took several hours. On the plus side, with this running on one of Lauren’s monitors, anytime someone walked into her office they thought she was doing something *really* technical.

Piwik Cons:

  • By default, parsing of large log files seems to be pretty slow, but performance may depend on your environment, the size of your log files and how often you rotate your logs.
  • In order to fully take advantage of the library-specific information your logs might contain and your LogFormat setup, you might have to do some pretty significant modification of Piwik’s import_logs.py script.
When looking at popular pages in Piwik you’re somewhat at the mercy that the subdirectories of databases have meaningful labels; luckily EBSCO does, as shown here.  We have a lot of users looking at EBSCO Ebooks, apparently.

When looking at popular pages in Piwik you’re somewhat at the mercy that the subdirectories of database URLs have meaningful labels; luckily EBSCO does, as shown here. We have a lot of users looking at EBSCO Ebooks, apparently.

EZPaarse

Get Ithttp://analogist.couperin.org/ezpaarse/download

Best Used With:  Full Logs or SPULogs

Code / Framework:  Node.js

ezPaarse’s friendly drag and drop interface.  You can also copy/paste lines for your logs to try out the functionality by creating an account at http://ezpaarse.couperin.org.

ezPaarse’s friendly drag and drop interface. You can also copy/paste lines for your logs to try out the functionality by creating an account at http://ezpaarse.couperin.org.

EZPaarse Pros:

  • Has a lot of potential to be used to analyze existing log data to better understand e-resource usage.
  • Drag-and-drop interface, as well as copy/paste log analysis
  • No command-line needed
  • Its goal is to be able to associate meaningful metadata (domains, ISSNs) to provide better electronic resource usage statistics.
ezPaarse Excel output generated from a sample log file, showing type of resource (article, book, etc.) ISSN, publisher, domain, filesize, and more.

ezPaarse Excel output generated from a sample log file, showing type of resource (article, book, etc.) ISSN, publisher, domain, filesize, and more.

EZPaarse Cons:

  • In Lauren’s testing, we couldn’t get of the logs to ingest correctly (perhaps due to a somewhat non-standard EZProxy logformat) but the samples files provided worked well. UPDATE 11/26:  With some gracious assistance from EZPaarse’s developers, we got EZPaarse to work!  It took about 10 minutes to process 2.5 million log lines, which is pretty awesome. Lesson learned – if you get stuck, reach out to ezpaarse [at] couperin.org or tweet for help @ezpaarse.  Also be sure to try out some of the pre-defined parameters set up by other institutions under Parameters. Check out the comments below for some more detail from ezpaarse’s developers.
  • Output is in Excel Sheets rather than a dashboard-style format – but as pointed out in the comments below, you can optionally output the results in JSON.

Write Your Own with Python

Get Started With:  https://github.com/robincamille/ezproxy-analysis/blob/master/ezp-analysis.py

Best used with: SPU logs

Code / Framework:  Python

code

Screenshot of a Python script, available at Robin Davis’ Github

 

Custom Script Pros:

  • You will have total control over what data you care about. DIY analyzers are usually written up because you’re looking to answer a specific question, such as “How many connections come from within the Library?”
  • You will become very familiar with the data! As librarians in an age of user tracking, we need to have a very good grasp of the kinds of data that our various services collect from our patrons, like IP addresses.
  • If your script is fairly simple, it should run quickly. Robin’s script took 5 minutes to analyze almost 6 years of SPU logs.
  • Your output will probably be a CSV, a flexible and useful data format, but could be any format your heart desires. You could even integrate Python libraries like Plotly to generate beautiful charts in addition to tabular data.
  • If you use Python for other things in your day-to-day, analyzing structured data is a fun challenge. And you can impress your colleagues with your Pythonic abilities!

 

Action shot: running the script from the command line. (Source)

Action shot: running the script from the command line.

Custom Script Cons:

  • If you have not used Python to input/output files or analyze tables before, this could be challenging.
  • The easiest way to run the script is within an IDE or from the command line; if this is the case, it will likely only be used by you.
  • You will need to spend time ascertaining what’s what in the logs.
  • If you choose to output data in a CSV file, you’ll need more elbow grease to turn the data into a beautiful collection of charts and graphs.
output

Output of the sample script is a labeled CSV that divides connections by locations and user type (student or faculty). (Source)

Splunk (Paid Option)

Best Used with:  Full Logs and SPU Logs

Get It (as a free trial):  http://www.splunk.com/download

Code / Framework:  Various, including Python

A Splunk distribution showing traffic by days of the week.  You can choose to visualize this data in several formats, such as a bar chart or scatter plot.  Notice that this chart was generated by a syntactical query in the upper left corner:  host=lmagnuson| top limit=20 date_wday

A Splunk distribution showing traffic by days of the week. You can choose to visualize this data in several formats, such as a bar chart or scatter plot. Notice that this chart was generated by a syntactical query in the upper left corner: host=lmagnuson| top limit=20 date_wday

Splunk Pros:  

  • Easy to use interface, no scripting/command line required (although command line interfacing (CLI) is available)
  • Incredibly fast processing.  As soon as you import a file, splunk begins ingesting the file and indexing it for searching
  • It’s really strong in interactive searching.  Rather than relying on canned reports, you can dynamically and quickly search by keywords or structured queries to generate data and visualizations on the fly.
Here's a search for log entries containing a URL (digital.films.com), which Splunk uses to create a chart showing the hours of the day that this URL is being accessed.  This particular database is most popular around 4 PM.

Here’s a search for log entries containing a URL (digital.films.com), which Splunk uses to display a chart showing the hours of the day that this URL is being accessed. This particular database is most popular around 4 PM.

Splunk Cons:

    • It has a little bit of a learning curve, but it’s worth it for the kind of features and intelligence you can get from Splunk.
    • It’s the only paid option on this list.  You can try it out for 60 days with up to 500MB/day a day, and certain non-profits can apply to continue using Splunk under the 500MB/day limit.  Splunk can be used with any server access or error log, so a library might consider partnering with other departments on campus to purchase a license.2

What should you choose?

It depends on your needs, but AWStats is always a tried and true easy to install and maintain solution.  If you have the knowledge, a custom Python script is definitely better, but obviously takes time to test and develop.  If you have money and could partner with others on your campus (or just need a one-time report generated through a free trial), Splunk is very powerful, generates some slick-looking charts, and is definitely work looking into.  If there are other options not covered here, please let us know in the comments!

About our guest author: Robin Camille Davis is the Emerging Technologies & Distance Services Librarian at John Jay College of Criminal Justice (CUNY) in New York City. She received her MLIS from the University of Illinois Urbana-Champaign in 2012 with a focus in data curation. She is currently pursuing an MA in Computational Linguistics from the CUNY Graduate Center.

Notes
  1. Details about LogFormat and what each %/lettter value means can be found at http://www.oclc.org/support/services/ezproxy/documentation/cfg/logformat.en.html; LogSPU details can be found http://oclc.org/support/services/ezproxy/documentation/cfg/logspu.en.html
  2. Another paid option that offers a free trial, and comes with extensions made for parsing EZProxy logs, is Sawmill: https://www.sawmill.net/downloads.html

Two Free Methods for Sharing Google Analytics Data Visualizations on a Public Dashboard

UPDATE (October 15th, 2014):

OOCharts as an API and service, described below, will be shut down November 15th, 2014.  More information about the decision by the developers to shut down the service is here. It’s not entirely surprising that the service going away, considering the Google SuperProxy option described in this post.  I’m leaving all the instructions here for OOCharts for posterity, but as of November 15th you can only use the SuperProxy or build your own with the Core Reporting API.

At this point, Google Analytics is arguably the standard way to track website usage data.  It’s easy to implement, but powerful enough to capture both wide general usage trends and very granular patterns.  However, it is not immediately obvious how to share Google Analytics data either with internal staff or external stakeholders – both of whom often have widespread demand for up-to-date metrics about library web property performance.  While access to Google Analytics can be granted by Google Analytics administrators to individual Google account-holders through the Google Analytics Admin screen, publishing data without requiring authentication requires some intermediary steps.

There are two main free methods of publishing Google Analytics visualizations to the web, and both involve using the Google Analytics API: OOCharts and the Google Analytics superProxy.  Both methods rely upon an intermediary service to retrieve data from the API and cache it, both to improve retrieval time and to avoid exceeding limits in API requests.1  The first method – OOCharts – requires much less time to get running initially. However, OOCharts’ long-standing beta status, and its status as a stand-alone service has less potential for long-term support than the second method, the Google Analytics superProxy.  For that reason, while OOCharts is certainly easier to set up, the superProxy method is definitely worth the investment in time (depending on what your needs).  I’ll cover both methods.

OOCharts Beta

OOCharts’ service facilitates the creation and storage of Google Analytics API Keys, which are required for sending secure requests to the API in order to retrieve data.

When setting up an OOCharts account, create your account utilizing the email address you use to access Google Analytics.  For example, if you log into Google Analytics using admin@mylibrary.org, I suggest you use this email address to sign up for OOCharts.  After creating an account with OOCharts, you will be directed to Google Analytics to authorize the OOCharts service to access your Google Analytics data on your behalf.

Let OOCharts gather Google Analytics Data on your Behalf.

Let OOCharts gather Google Analytics Data on your Behalf.

After authorizing the service, you will be able to generate an API key for any of the properties your to which your Google Analytics account has access.

In the OOCharts interface, click your name in the upper right corner, and then click inside the Add API Key field to see a list of available Analytics Properties from your account.

 

Once you’ve select a property, OOCharts will generate a key for your OOCharts application.  When you go back to the list of API Keys, you’ll see your keys along with property IDs (shown in brackets after the URL of your properties, e.g., [9999999].

APIkeys

Your API Keys now show your key values and your Google Analytics property IDs, both of which you’ll need to create visualizations with the OOCharts library.

 

Creating Visualizations with the OOCharts JavaScript Library

OOCharts appears to have started as a simple chart library for visualization data returned from the Google Analytics API. After you have set up your OOCharts account, download the front-end JavaScript library (available at http://docs.oocharts.com/) and upload it to your server or localhost.   Navigate to the /examples directory and locate the ‘timeline.html’ file.  This file contains a simplified example that displays web visits over time in Google Analytics’ familiar timeline format.

The code for this page is very simple, and contains two methods – JavaScript-only and with HTML Attributes – for creating OOCharts visualizations.  Below, I’ve separated out the required elements for both methods.  While either methods will work on their own, using HTML attributes allows for additional customizations and styling:

JavaScript-Only
		<h3>With JS</h3>
		<div id='chart'></div>
		
		<script src='../oocharts.js'></script>
		<script type="text/javascript">

			window.onload = function(){

				oo.setAPIKey("{{ YOUR API KEY }}");

				oo.load(function(){

					var timeline = new oo.Timeline("{{ YOUR PROFILE ID }}", "180d");

					timeline.addMetric("ga:visits", "Visits");

					timeline.addMetric("ga:newVisits", "New Visits");

					timeline.draw('chart');

				});
			};

		</script>
With HTML Attributes:
<h3>With HTML Attributes</h3>
	<div data-oochart='timeline' 
        data-oochart-metrics='ga:visits,Visits,ga:newVisits,New Visits' 
        data-oochart-start-date='180d' 
        data-oochart-profile='{{ YOUR PROFILE ID }}'></div>
		
			<script src='../oocharts.js'></script>
			<script type="text/javascript">

			window.onload = function(){

				oo.setAPIKey("{{ YOUR API KEY }}");

				oo.load(function(){
				});
			};

		</script>

 

For either method, plugin {{ YOUR API KEY }} where indicated with the API Key generated in OOCharts and replace {{ YOUR PROFILE ID }} with the associated eight-digit profile ID.  Load the page in your browser, and you get this:

With the API Key and Profile ID in place, the timeline.html example looks like this.  In this example I also adjusted the date paramter (30d by default) to 180d for more data.

With the API Key and Profile ID in place, the timeline.html example looks like this. In this example I also adjusted the date parameter (30d by default) to 180d for more data.

This example shows you two formats for the chart – one that is driven solely by JavaScript, and another that can be customized using HTML attributes.  For example, you could modify the <div> tag to include a style attribute or CSS class to change the width of the chart, e.g.:

<h3>With HTML Attributes</h3>

<div style=”width:400px” data-oochart='timeline' 
data-oochart-start-date='180d' 
data-oochart-metrics='ga:visits,Visits,ga:newVisits,New Visits' 
data-oochart-profile='{{ YOUR PROFILE ID }}'></div>

Here’s the same example.html file showing both the JavaScript-only format and the HTML-attributes format, now with a bit of styling on the HTML attributes chart to make it smaller:

You can use styling to adjust the HTML attributes example.

You can use styling to adjust the HTML attributes example.

Easy, right?  So what’s the catch?

OOCharts only allows 10,000 requests a month – which is even easier to exceed than the 50,000 limit on the Google Analytics API.  Each time your page loads, you use another request.  Perhaps more importantly, your Analytics API key and profile ID are pretty much ‘out there’ for the world to see if they view your page source, because those values are stored in your client-side JavaScript2.  If you’re making a private intranet for your library staff, that’s probably not a big deal; but if you want to publish your dashboard fully to the public, you’ll want to make sure those values are secure.  You can do this with the Google Analytics superProxy.

Google Analytics superProxy

In 2013, Google Analytics released a method of accessing Google Analytics API data that doesn’t require end-users to authenticate in order to view data, known as the Google Analytics superProxy.  Much like OOCharts, the superProxy facilitates the creation of a query engine that retrieves Google Analytics statistics through the Google Analytics Core Reporting API, caches the statistics in a separate web application service, and enables the display of Google Analytics data to end users without requiring individual authentication. Caching the data has the additional benefit of ensuring that your application will not exceed the Google Core Reporting API request limit of 50,000 requests each day. The superProxy can be set up to refresh a limited number of times per day, and most dashboard applications only need a daily refresh of data to stay current.
The required elements of this method are available on the superProxy Github page (Google Analytics, “Google Analytics superProxy”). There are four major parts to the setup of the superProxy:

  1. Setting up Google App Engine hosting,
  2. Preparing the development environment,
  3. Configuring and deploying the superProxy to Google’s App Engine Appspot host; and
  4. Writing and scheduling queries that will be used to populate your dashboard.
Set up Google App Engine hosting

First, your Google Analytics account credentials to access the Google App Engine at https://appengine.google.com/start. The superProxy application you will be creating will be freely hosted by the Google App Engine. Create your application and designate an Application Identifier that will serve as the endpoint domain for queries to your Google Analytics data (e.g., mylibrarycharts.appspot.com).

Create your App Engine Application

Create your App Engine Application

You can leave the default authentication option, Open to All Google Users, selected.  This setting only reflects access to your App Engine administrative screen and does not affect the ability for end-users to view the dashboard charts you create.  Only those Google users who have been authorized to access Google Analytics data will be able to access any Google Analytics information through the Google App Engine.

Ensure that API access to Google Analytics is turned on under the Services pane of the Google Developer’s Console. Under APIs and Auth for your project, visit the APIs menu and ensure that the Analytics API is turned on.

Turn on the Google Analytics API.

Turn on the Google Analytics API.  Make sure the name under the Projects label in the upper left corner is the same as your newly created Google App Engine project (e.g., librarycharts).

 

Then visit the Credentials menu to set up an OAuth 2.0 Client ID. Set the Authorized JavaScript Origins value to your appspot domain (e.g., http://mylibrarycharts.appspot.com). Use the same value for the Authorized Redirect URI, but add /admin/auth to the end (e.g., http://mylibrarycharts.appspot.com/admin/auth). Note the OAuth Client ID, OAuth Client Secret, and OAuth Redirect URI that are stored here, as you will need to reference them later before you deploy your superProxy application to the Google App Engine.

Finally, visit the Consent Screen menu and choose an email address (such as your Google account email address), fill in the product name field with your Application Identifier (e.g., mylibrarycharts) and save your settings. If you do not include these settings, you may experience errors when accessing your superProxy application admin menu.

Prepare the Development Environment

In order to configure superProxy and deploy it to Google App Engine you will need Python 2.7 installed and the Google App Engine Launcher (AKA the Google App Engine SDK).  Python just needs to be installed for the App Engine Launcher to run; don’t worry, no Python coding is required.

Configure and Deploy the superProxy

The superProxy application is available from the superProxy Github page. Download the .zip files and extract them onto your computer into a location you can easily reference (e.g., C:/Users/yourname/Desktop/superproxy or /Applications/superproxy). Use a text editor such as Notepad or Notepad++ to edit the src/app.yaml to include your Application ID (e.g., mylibrarycharts). Then use Notepad to edit src/config.py to include the OAuth Client ID, OAuth Client Secret, and the OAuth Redirect URI that were generated when you created the Client ID in the Google Developer’s Console under the Credentials menu. Detailed instructions for editing these files are available on the superProxy Github page.

After you have edited and saved src/app.yaml and src/config.py, open the Google App Engine Launcher previously downloaded. Go to File > Add Existing Application. In the dialogue box that appears, browse to the location of your superProxy app’s /src directory.

To upload your superProxy application, use the Google App Engine Launcher and browse to the /src directory where you saved and configured your superProxy application.

To upload your superProxy application, use the Google App Engine Launcher and browse to the /src directory where you saved and configured your superProxy application.

Click Add, then click the Deploy button in the upper right corner of the App Engine Launcher. You may be asked to log into your Google account, and a log console may appear informing you of the deployment process. When deployment has finished, you should be able to access your superProxy application’s Admin screen at http://[yourapplicationID].appspot.com/admin, replacing [yourapplicationID] with your Application Identifier.

Creating superProxy Queries

SuperProxy queries request data from your Google Analytics account and return that data to the superProxy application. When the data is returned, it is made available to an end-point that can be used to populate charts, graphs, or other data visualizations. Most data available to you through the Google Analytics native interface is available through superProxy queries.

An easy way to get started with building a query is to visit the Google Analytics Query Explorer. You will need to login with your Google Analytics account to use the Query Explorer. This tool allows you to build an example query for the Core Reporting API, which is the same API service that your superProxy application will be using.

Google Analytics API Query Explorer

Running example queries through the Google Analytics Query Explorer can help you to identify the metrics and dimensions you would like to use in superProxy queries. Be sure to note the metrics and dimensions you use, and also be sure to note the ids value that is populated for you when using the API Explorer.

 

When experimenting with the Google Analytics Query explorer, make note of all the elements you use in your query. For example, to create a query that retrieves the number of users that visited your site between July 4th and July 18th 2014, you will need to select your Google Account, Property and View from the drop-down menus, and then build a query with the following parameters:

  • ids = this is a number (usually 8 digits) that will be automatically populated for you when you choose your Google Analytics Account, Property and View. The ids value is your property ID, and you will need this value later when building your superProxy query.
  • dimensions = ga:browser
  • metrics = ga:users
  • start-date = 07-04-2014
  • end-date = 07-18-2014

You can set the max-results value to limit the number of results returned. For queries that could potentially have thousands of results (such as individual search terms entered by users), limiting to the top 10 or 50 results will retrieve data more quickly. Clicking on any of the fields will generate a menu from which you can select available options. Click Get Data to retrieve Google Analytics data and verify that your query works

Successful Google Analytics Query Explorer query result showing visits by browser.

Successful Google Analytics Query Explorer query result showing visits by browser.

After building a successful query, you can replicate the query in your superProxy application. Return to your superProxy application’s admin page (e.g., http://[yourapplicationid].appspot.com/admin) and select Create Query. Name your query something to make it easy to identify later (e.g., Users by Browser). The Refresh Interval refers to how often you want the superProxy to retrieve fresh data from Google Analytics. For most queries, a daily refresh of the data will be sufficient, so if you are unsure, set the refresh interval to 86400. This will refresh your data every 86400 seconds, or once per day.

Create a superProxy Query

Create a superProxy Query

We can reuse all of the elements of queries built using the Google Analytics API Explorer to build the superProxy query encoded URI.  Here is an example of an Encoded URI that queries the number of users (organized by browser) that have visited a web property in the last 30 days (you’ll need to enter your own profile ID in the ids value for this to work):

https://www.googleapis.com/analytics/v3/data/ga?ids=ga:99999991
&metrics=ga:users&dimensions=ga:browser&max-results=10
&start-date={30daysago}&end-date={today}

Before saving, be sure to run Test Query to see a preview of the kind of data that is returned by your query. A successful query will return a json response, e.g.:

{"kind": "analytics#gaData", "rows": 
[["Amazon Silk", "8"], 
["Android Browser", "36"], 
["Chrome", "1456"], 
["Firefox", "1018"], 
["IE with Chrome Frame", "1"], 
["Internet Explorer", "899"], 
["Maxthon", "2"], 
["Opera", "7"], 
["Opera Mini", "2"], 
["Safari", "940"]], 
"containsSampledData": false, 
"totalsForAllResults": {"ga:users": "4398"}, 
"id": "https://www.googleapis.com/analytics/v3/data/ga?ids=ga:84099180&dimensions=ga:browser&metrics=ga:users&start-date=2014-06-29&end-date=2014-07-29&max-results=10", 
"itemsPerPage": 10, "nextLink": "https://www.googleapis.com/analytics/v3/data/ga?ids=ga:84099180&dimensions=ga:browser&metrics=ga:users&start-date=2014-07-04&end-date=2014-07-18&start-index=11&max-results=10", 
"totalResults": 13, "query": {"max-results": 10, "dimensions": "ga:browser", "start-date": "2014-06-29", "start-index": 1, "ids": "ga:84099180", "metrics": ["ga:users"], "end-date": "2014-07-18"}, 
"profileInfo": {"webPropertyId": "UA-9999999-9", "internalWebPropertyId": "99999999", "tableId": "ga:84099180", "profileId": "9999999", "profileName": "Library Chart", "accountId": "99999999"}, 
"columnHeaders": [{"dataType": "STRING", "columnType": "DIMENSION", "name": "ga:browser"}, {"dataType": "INTEGER", "columnType": "METRIC", "name": "ga:users"}], 
"selfLink": "https://www.googleapis.com/analytics/v3/data/ga?ids=ga:84099180&dimensions=ga:browser&metrics=ga:users&start-date=2014-06-29&end-date=2014-07-29&max-results=10"}

Once you’ve tested a successful query, save it, which will allow the json string to become accessible to an application that can help to visualize this data. After saving, you will be directed to the management screen for your API, where you will need to click Activate Endpoint to begin publishing the results of the query in a way that is retrievable. Then click Start Scheduling so that the query data is refreshed on the schedule you determined when you built the query (e.g., once a day). Finally, click Refresh Data to return data for the first time so that you can start interacting with the data returned from your query. Return to your superProxy application’s Admin page, where you will be able to manage your query and locate the public end-point needed to create a chart visualization.

Using the Google Visualization API to visualize Google Analytics data

Included with the superProxy .zip file downloaded to your computer from the Github repository is a sample .html page located under /samples/superproxy-demo.html. This file uses the Google Visualization API to generate two pie charts from data returned from superProxy queries. The Google Visualization API is a service that can ingest raw data (such as json arrays that are returned by the superProxy) and generate visual charts and graphs. Save superproxy-demo.html onto a web server or onto your computer’s localhost.  We’ll set up the first pie chart to use the data from the Users by Browser query saved in your superProxy app.

Open superproxy-demo.html and locate this section:

var browserWrapper = new google.visualization.ChartWrapper({
// Example Browser Share Query
"containerId": "browser",
// Example URL: http://your-application-id.appspot.com/query?id=QUERY_ID&format=data-table-response
"dataSourceUrl": "REPLACE WITH Google Analytics superProxy PUBLIC URL, DATA TABLE RESPONSE FORMAT",
"refreshInterval": REPLACE_WITH_A_TIME_INTERVAL,
"chartType": "PieChart",
"options": {
"showRowNumber" : true,
"width": 630,
"height": 440,
"is3D": true,
"title": "REPLACE WITH TITLE"
}
});

Three values need to be modified to create a pie chart visualization:

  • dataSourceUrl: This value is the public end-point of the superProxy query you have created. To get this value, navigate to your superProxy admin page and click Manage Query on the Users by Browser query you have created. On this page, right click the DataTable (JSON Response) link and copy the URL (Figure 8). Paste the copied URL into superproxy-demo.html, replacing the text REPLACE WITH Google Analytics superProxy PUBLIC URL, DATA TABLE FORMAT. Leave quotes around the pasted URL.
Right-click the DataTable (JSON Response) link and copy the URL to your clipboard. The copied link will serve as the dataSouceUrl value in superproxy-demo.html.

Right-click the DataTable (JSON Response) link and copy the URL to your clipboard. The copied link will serve as the dataSouceUrl value in superproxy-demo.html.

  • refreshInterval – you can leave this value the same as the refresh Interval of your superProxy query (in seconds) – e.g., 86400.
  • title – this is the title that will appear above your pie chart, and should describe the data your users are looking at – e.g., Users by Browser.

Save the modified file to your server or local development environment, and load the saved page in a browser.  You should see a rather lovely pie chart:

Your pie chart's data will refresh automatically on the refresh schedule you set in your query.

Your pie chart’s data will refresh automatically based upon the Refresh Interval you specify in your superProxy query and your page’s JavaScript parameters.

That probably seemed like a lot of work just to make a pie chart.  But now that your app is set up, making new charts from your Google Analytics data just involves visiting your App Engine site, scheduling a new query, and referencing that with the Google Visualization API.  To me, the Google superProxy method has three distinct advantages over the simpler OOCharts method:

  • Security – Users won’t be able to view your API Keys by viewing the source of your dashboard’s web page
  • Stability – OOCharts might not be around forever.  For that matter, Google’s free App Engine service might not be around forever, but betting on Google is [mostly] a safe bet
  • Flexibility – You can create a huge range of queries, and test them out easily using the API explorer, and the Google Visualization API has extensive documentation and a fairly active user group from whom to gather advice and examples.

 

Notes

  1. There is a 50,000 request per day limit on the Analytics API.  That sounds like a lot, but it’s surprisingly easy to exceed. Consider creating a dashboard with 10 charts, each making a call to the Analytics API.  Without a service that caches the data, the data is refreshed every time a user loads a page.  After just 5,000 visits to the page (which makes 10 API calls – one for each chart – each time the page is loaded), the API limit is exceeded:  5,000 page loads x 10 calls per page = 50,000 API requests.
  2. You can use pre-built OOCharts Queries – https://app.oocharts.com/mc/query/list – to hide your profile ID (but not your API Key). There are many ways to minify and obfuscate client-side JavaScript to make it harder to read, but it’s still pretty much accessible to someone who wants to get it

Collecting Data: How Much do We Really Need?

Many of us have had conversations in the past few weeks about data collection due to the reports about the NSA’s PRISM program, but ever since April and the bombings at the Boston Marathon, there has been an increased awareness of how much data is being collected about people in an attempt to track down suspects–or, increasingly, stop potential terrorist events before they happen. A recent Nova episode about the manhunt for the Boston bombers showed one such example of this at the New York Police Department. This program is called the Domain Awareness System at the New York Police Department, and consists of live footage from almost every surveillance camera in the New York City playing in one room, with the ability to search for features of individuals and even the ability to detect people acting suspiciously. Added to that a demonstration of cutting edge facial recognition software development at Carnegie Mellon University, and reality seems to be moving ever closer to science fiction movies.

Librarians focused on technical projects love to collect data and make decisions based on that data. We try hard to get data collection systems as close to real-time as possible, and work hard to make sure we are collecting as much data as possible and analyzing it as much as possible. The idea of a series of cameras to track in real-time exactly what our patrons are doing in the library in real-time might seem very tempting. But as librarians, we value the ability of our patrons to access information with as much privacy as possible–like all professions, we treat the interactions we have with our patrons (just as we would clients, patients, congregants, or sources) with care and discretion (See Item 3 of the Code of Ethics of the American Library Association). I will not address the national conversation about privacy versus security in this post–I want to address the issue of data collection right where most of us live on a daily basis inside analytics programs, spreadsheets, and server logs.

What kind of data do you collect?

Let’s start with an exercise. Write a list of all the statistical reports you are expected to provide your library–for most of us, it’s probably a very long list. Now, make a list of all the tools you use to collect the data for those statistics.

Here are a few potential examples:

Website visitors and user experience

  • Google Analytics or some other web analytics tool
  • Heat map tool
  • Server logs
  • Surveys

Electronic resource access reports

  • Electronic resources management application
  • Vendor reports (COUNTER and other)
  • Link resolver click-through report
  • Proxy server logs

The next step may require a little digging. For library created tools, do you have a privacy policy for this data? Has it gone through the Institutional Review Board? For third-party tools, is there a privacy policy? What are the terms or use or user license? (And how many people have ever read the entire terms of service?). We will return to this exercise in a moment.

How much is enough?

Think about with these tools what type of data you are collecting about your users. Some of it may be very private indeed. For instance, the heat map tool I’ve recently started using (Inspectlet) not only tracks clicks, but actually records sessions as patrons use the website. This is fascinating information–we had, for instance, one session that was a patron opening the library website, clicking the Facebook icon on the page, and coming back to the website nearly 7 hours later. It was fun to see that people really do visit the library’s Facebook page, but the question was immediately raised whether it was a visit from on campus. (It was–and wouldn’t have taken long to figure out if it was a staff machine and who was working that day and time). IP addresses from off campus are very easy to track, sometimes down to the block–again, easy enough to tie to an individual. We like to collect IP addresses for abusive or spamming behavior and block users based on IP address all the time. But what about in this case? During the screen recordings I can see exactly what the user types in the search boxes for the catalog and discovery system. Luckily, Inspectlet allows you to obscure the last two octets (which is legally required some places) of the IP address, so you can have less information collected. All similar tools should allow you the same ability.

Consider another case: proxy server logs. In the past when I did a lot of EZProxy troubleshooting, I found the logs extremely helpful in figuring out what went wrong when I got a report of trouble, particularly when it had occurred a day or two before. I could see the username, what time the user attempted to log in or succeeded in logging in, and which resources they accessed. Let’s say someone reported not being able to log in at midnight– I could check to see the failed logins at midnight, and then that username successfully logging in at 1:30 AM. That was a not infrequent occurrence, as usually people don’t think to write back and say they figured out what they did wrong! But I could also see everyone else’s logins and which articles they were reading, so I could tell (if I wanted) which grad students were keeping up with their readings or who was probably sharing their login with their friend or entire company. Where I currently work, we don’t keep the logs for more than a day, but I know a lot of people are out there holding on to EZProxy logs with the idea of doing “something” with them someday. Are you holding on to more than you really want to?

Let’s continue our exercise. Go through your list of tools, and make a list of all the potentially personally identifying information the tool collects, whether or not you use them. Are you surprised by anything? Make a plan to obscure unused pieces of data on a regular basis if it can’t be done automatically. Consider also what you can reasonably do with the data in your current job requirements, rather than future study possibilities. If you do think the data will be useful for a future study, make sure you are saving anonymized data sets unless it is absolutely necessary to have personally identifying information. In the latter case, you should clear your study in advance with your Institutional Review Board and follow a data management plan.

A privacy and data management policy should include at least these items:

  • A statement about what data you are collecting and why.
  • Where the data is stored and who has access to it.
  • A retention timeline.

F0r example, in the past I collected all virtual reference transaction logs for studying the effectiveness of a new set of virtual reference services. I knew I wanted at least a year’s worth of logs, and ideally three years to track changes over time. I was able to save the logs with anonymized IP addresses and once I had the data I needed I was able to delete the actual transcripts. The privacy policy described the process and where the data would be stored to ensure it was secure. In this case, I used the RUSA Guidelines for Implementing and Maintaining Virtual Reference Services as a guide to creating this policy. Read through the ALA Guidelines to Drafting a Library Privacy Policy for additional specific language and items you should include.

What we can do with data

In all this I don’t at all mean to imply that we shouldn’t be collecting this data. In both the examples I gave above, the data is extremely useful in improving the patron experience even while giving identifying details away. Not collecting data has trade-offs. For years, libraries have not retained a patron’s borrowing record to protect his or her privacy. But now patrons who want to have an online record of what they’ve borrowed from the library must use third-party services with (most likely) much less stringent privacy policies than libraries. By not keeping records of what users have checked out or read through databases, we are unable to provide them personalized automated suggestions about what to read next. Anyone who uses Amazon regularly knows that they will try to tempt you into purchases based on your past purchases or books you were reading the preview of–even if you would rather no one know that you were reading that book and certainly don’t want suggestions based on it popping up when you are doing a collection development project at work and are logged in on your personal account. In all the decisions we make about collecting or not collecting data, we have to consider trade-offs like these. Is the service so important that the benefits of collecting the data outweigh the risks? Or, is there another way to provide the service?

We can see some examples of this trade-off in two similar projects coming out of Harvard Library Labs. One, Library Hose, was a Twitter stream with the name of every book being checked out. The service ran for part of 2010, and has been suspended since September of 2010. In addition to daily tweet limits, this also was a potential privacy violation–even if it was a fun idea (this blog post has some discussion about it). A newer project takes the opposite approach–books that a patron thinks are “awesome” can be returned to the Awesome Box at the circulation desk and the information about the book is collected on the Awesome Box website. This is a great tweak to the earlier project, since this advertises material that’s now available rather than checked out, and people have to opt in by putting the item in the box.

In terms of personal recommendations, librarians have the advantage of being able to form close working relationships with faculty and students so they can make personal recommendations based on their knowledge of the person’s work and interests. But how to automate this without borrowing records? One example is a project that Ian Chan at California State University San Marcos has done to use student enrollment data to personalize the website based on a student’s field of study. (Slides). This provides a great deal of value for the students, who need to log in to check their course reserves and access articles from off campus anyway. This adds on top of that basic need a list of recommended resources for students, which they can choose to star as favorites.

Conclusion

In thinking about what type of data you collect, whether on purpose or accidentally, spend some time thinking about what is strictly necessary to accomplish the work that you need to do. If you don’t need a piece of data but can’t avoid collecting it (such as full IP addresses or usernames), make sure you have a privacy policy and retention schedule, and ensure that it is not accessible to more people than absolutely necessary.

Work to educate your patrons about privacy, particularly online privacy. ALA has a Choose Privacy Week, which is always the first week in May. The site for that has a number of resources you might want to consult in planning programming. Academic librarians may find it easiest to address college students in terms of their presence on social media when it comes to future job hunting, but this is just an opening to larger conversations about data. Make sure that when you ask patrons to use a third party service (such as a social network) or recommend a service (such as a book recommending site) that you make sure they are aware of what information they are sharing.

We all know that Google’s slogan is “Don’t be evil”, but it’s not always clear if they are sticking to that. Make sure that you are not being evil in your own data collection.


A Librarian’s Guide to OpenRefine

Academic librarians working in technical roles may rarely see stacks of books, but they doubtless see messy digital data on a daily basis. OpenRefine is an extremely useful tool for dealing with this data without sophisticated scripting skills and with a very low learning curve. Once you learn a few tricks with it, you may never need to force a student worker to copy and paste items onto Excel spreadsheets.

As this comparison by the creator of OpenRefine shows, the best use for the tool is to explore and transform data, and it allows you to make edits to many cells and rows at once while still seeing your data. This allows you to experiment and undo mistakes easily, which is a great advantage over databases or scripting where you can’t always see what’s happening or undo the typo you made. It’s also a lot faster than editing cell by cell like you would do with a spreadsheet.

Here’s an example of a project that I did in a spreadsheet and took hours, but then I redid in Google Refine and took a lot less time. One of the quickest things to do with OpenRefine is spot words or phrases that are almost the same, and possibly are the same thing. Recently I needed to turn a large export of data from the catalog into data that I could load into my institutional repository. There were only certain allowed values that could be used in the controlled vocabulary in the repository, so I had to modify the bibliographic data from the catalog (which was of course in more or less proper AACR2 style) to match the vocabularies available in the repository. The problem was that the data I had wasn’t consistent–there were multiple types of abbreviations, extra spaces, extra punctuation, and outright misspellings. An example is the History Department. I can look at “Department of History”, “Dep. of History”, “Dep of Hist.” and tell these are probably all referring to the same thing, but it’s difficult to predict those potential spellings. While I could deal with much of this with regular expressions in a text editor and find and replace in Excel, I kept running into additional problems that I couldn’t spot until I got an error. It took several attempts of loading the data until I cleared out all the errors.

In OpenRefine this is a much simpler task, since you can use it to find everything that probably is the same thing despite the slight differences in spelling, punctuation and spelling. So rather than trying to write a regular expression that accounts for all the differences between “Department of History”, “Dep. of History”, “Dep of Hist.”, you can find all the clusters of text that include those elements and change them all in one shot to “History”. I will have more detailed instructions on how to do this below.

Installation and Basics

OpenRefine was called, until last October, Google Refine, and while the content from the Google Refine page is being moved to the Open Refine page you should plan to look at both sites. Documentation and video tutorials refer interchangeably to Google Refine and OpenRefine. The official and current documentation is on the OpenRefine GitHub wiki. For specific questions you will probably want to use the OpenRefine Custom Search Engine, which brings together all the mix of documentation and tutorials on the web. OpenRefine is a web app that runs on your computer, so you don’t need an internet connection to run it. You can get the installation instructions on this page.

While you can jump in right away and get started playing around, it is well worth your time to watch the tutorial videos, which will cover the basic actions you need to take to start working with data. As I said, the learning curve is low, but not all of the commands will make sense until you see them in action. These videos will also give you an idea of what you might be able to do with a data set you have lying around. You may also want to browse the “recipes” on the OpenRefine site, as well search online for additional interesting things people have done. You will probably think of more ideas about what to try. The most important thing to know about OpenRefine is that you can undo anything, and go back to the beginning of the project before you messed up.

A basic understanding of the Google Refine Expression Language, or GREL will improve your ability to work with data. There isn’t a whole lot of detailed documentation, so you should feel free to experiment and see what happens when you try different functions. You will see from the tutorial videos the basics you need to know. Another essential tool is regular expressions. So much of the data you will be starting with is structured data (even if it’s not perfectly structured) that you will need to turn into something else. Regular expressions help you find patterns which you can use to break apart strings into something else. Spending a few minutes understanding regular expression syntax will save hours of inefficient find and replace. There are many tutorials–my go-to source is this one. The good news for librarians is that if you can construct a Dewey Decimal call number, you can construct a regular expression!

Some ideas for librarians

 

(A) Typos

Above I described how you would use OpenRefine to clean up messy and inconsistent catalog data. Here’s how to do it. Load in the data, and select “Text Facet” on the column in question. OpenRefine will show clusters of text that is similar and probably the same thing.

AcademicDept Text Facet

AcademicDept Text Facet

 

Click on Cluster to get a menu for working with multiple values. You can click on the “Merge” check box and then edit the text to whatever you need it to be. You can also edit each text cluster to be the correct text.

Cluster and Edit

Cluster and Edit

You can merge and re-cluster until you have fixed all the typos. Back on the first Text Facet, you can hover over any value to edit it. That way even if the automatic clustering misses some you can edit the errors, or change anything that is the same but you need to look different–for instance, change “Dept. of English” to just “English”.

(B) Bibliographies

The main thing that I have used OpenRefine for in my daily work is to change a bibliography in plain text into columns in a spreadsheet that I can run against an API. This was inspired by this article in the Code4Lib Journal: “Using XSLT and Google Scripts to Streamline Populating an Institutional Repository” by Stephen X. Flynn, Catalina Oyler, and Marsha Miles. I wanted to find a way to turn a text CV into something that would work with the SHERPA/RoMEO API, so that I could find out which past faculty publications could be posted in the institutional repository. Since CVs are lists of data presented in a structured format but with some inconsistencies, OpenRefine makes it very easy to present the data in a certain way as well as remove the inconsistencies, and then to extend the data with a web service. This is a very basic set of instructions for how to accomplish this.

The main thing to accomplish is to put the journal title in its own column. Here’s an example citation in APA format, in which I’ve colored all the “separator” punctuation in red:

Heller, M. (2011). A Review of “Strategic Planning for Social Media in Libraries”. Journal of Electronic Resources Librarianship, 24 (4), 339-240)

From the drop-down menu at the top of the column click on “Split into several columns…” from the “Edit Column” menu. You will get a menu like the one below. This example finds the opening parenthesis and removes that in creating a new column. The author’s name is its own column, and the rest of the text is in another column.

Spit into columns

 

The rest of the column works the same way–find the next text, punctuation, or spacing that indicates a separation. You can then rename the column to be something that makes sense. In the end, you will end up with something like this:

Split columns

When you have the journal titles separate, you may want to cluster the text and make sure that the journals have consistent titles or anything else to clean up the titles. Now you are a ready to build on this data with fetching data from a web service. The third video tutorial posted above will explain the basic idea, and this tutorial is also helpful. Use the pull-down menu at the top of the journal column to select “Edit column” and then “Add column by fetching URLs…”. You will get a box that will help you construct the right URL. You need to format your URL in the way required by SHERPA/RoMEO, and will need a free API key. For the purposes of this example, you can use 'http://www.sherpa.ac.uk/romeo/api29.php?ak=[YOUR API KEY HERE]&qtype=starts&jtitle=' + escape(value,'url'). Note that it will give you a preview to see if the URL is formatted in the way you expect. Give your column a name, and set the Throttle delay, which will keep the service from rejecting too many requests in a short time. I found 1000 worked fine.

refine7

After this runs, you will get a new column with the XML returned by SHERPA/RoMEO. You can use this to pull out anything you need, but for this example I want to get pre-archiving and post-archiving policies, as well as the conditions. A quick way to to this is to use the Googe Refine Expression Language parseHtml function. To use this, click on “Add column based on this column” from the “Edit Column” menu, and you will get a menu to fill in an expression.

refine91

In this example I use the code value.parseHtml().select("prearchiving")[0].htmlText(), which selects just the text from within the prearchving element. Conditions are a little different, since there are multiple conditions for each journal. In that case, you would use the following syntax (after join you can put whatever separator you want): forEach(value.parseHtml().select("condition"),v,v.htmlText()).join(". ")"

So in the end, you will end up with a neatly structured spreadsheet from your original CV with all the bibliographic information in its own column and the publisher conditions listed. You can imagine the possibilities for additional APIs to use–for instance, the WorldCat API could help you determine which faculty published books the library owns.

Once you find a set of actions that gets your desired result, you can save them for the future or to share with others. Click on Undo/Redo and then the Extract option. You will get a description of the actions you took, plus those actions represented in JSON.

refine13

Unselect the checkboxes next to any mistakes you made, and then copy and paste the text somewhere you can find it again. I have the full JSON for the example above in a Gist here. Make sure that if you save your JSON publicly you remove your personal API key! When you want to run the same recipe in the future, click on the Undo/Redo tab and then choose Apply. It will run through the steps for you. Note that if you have a mistake in your data you won’t catch it until it’s all finished, so make sure that you check the formatting of the data before running this script.

Learning More and Giving Back

Hopefully this quick tutorial got you excited about OpenRefine and thinking about what you can do. I encourage you to read through the list of External Resources to get additional ideas, some of which are library related. There is lots more to learn and lots of recipes you can create to share with the library community.

Have you used OpenRefine? Share how you’ve used it, and post your recipes.

 


Event Tracking with Google Analytics

In a previous post by Kelly Sattler and Joel Richard, we explored using web analytics to measure a website’s success. That post provides a clear high-level picture of how to create an analytics strategy by evaluating our users, web content, and goals. This post will explore a single topic in-depth; how to set up event tracking in Google Analytics.

Why Do We Need Event Tracking?

Finding solid figures to demonstrate a library’s value and make strategic decisions is a topic of increasing importance. It can be tough to stitch together the right information from a hodgepodge of third-party services; we rely on our ILSs to report circulation totals, our databases to report usage like full-text downloads, and our web analytics software to show visitor totals. But are pageviews and bounce rates the only meaningful measure of website success? Luckily, Google Analytics provides a way to track arbitrary events which occur on web pages. Event tracking lets us define what is important. Do we want to monitor how many people hover over a carousel of book covers, but only in the first second after the page has loaded? How about how many people first hover over the carousel, then the search box, but end up clicking a link in the footer? As long as we can imagine it and JavaScript has an event for it, we can track it.

How It Works

Many people are probably familiar with Google Analytics as a snippet of JavaScript pasted into their web pages. But Analytics also exposes some of its inner workings to manipulation. We can use the _gaq.push method to execute a “_trackEvent” method which sends information about our event back to Analytics. The basic structure of a call to _trackEvent is:

_gaq.push( [ '_trackEvent', 'the category of the event', 'the action performed', 'an optional label for the event', 'an optional integer value that quantifies something about the event' ] );

Looking at the array parameter of _gaq.push is telling: we should have an idea of what our event categories, actions, labels, and quantitative details will be before we go crazy adding tracking code to all our web pages. Once events are recorded, they cannot be deleted from Analytics. Developing a firm plan helps us to avoid the danger of switching the definition of our fields after we start collecting data.

We can be a bit creative with these fields. “Action” and “label” are just Google’s way of describing them; in reality, we can set up anything we like, using category->action->label as a triple-tiered hierarchy or as three independent variables.

Example: A List of Databases

Almost every library has a web page listing third-party databases, be they subscription or open access. This is a prime opportunity for event tracking because of the numerous external links. Default metrics can be misleading on this type of page. Bounce rate—the proportion of visitors who start on one of our pages and then immediately leave without viewing another page—is typically considered a negative metric; if a page has a high bounce rate, then visitors are not engaged with its content. But the purpose of a databases page is to get visitors to their research destinations as quickly as possible; bounce rate is a positive figure. Similarly, time spent on page is typically considered a positive sign of engagement, but on a databases page it’s more likely to indicate confusion or difficulty browsing. With event tracking, we can not only track which links were clicked but we can make it so database links don’t count towards bounce rate, giving us a more realistic picture of the page’s success.

One way of structuring “database” events is:

  • The top-level Category is “database”
  • The Action is the topical category, e.g. “Social Sciences”
  • The Label is the name of the database itself, e.g. “Academic Search Premier”

The final, quantitative piece could be the position of the database in the list or the number of seconds after page load it took the user to click its link. We could report some boolean value, such as whether the database is open access or highlighted in some way.

To implement this, we set up a JavaScript function which will be called every time one of our events occur. We will store some contextual information in variables, push that information to Google Analytics, and then delay the page’s navigation so the event has a chance to be recorded. Let’s walk through the code piece by piece:

function databaseTracking  ( event ) {
    var destination = $( this )[ 0 ].href,
        resource = $( this ).text(),
        // move up from <a> to parent element, then find the nearest preceding <h2> section header
        section = $( this ).parent().prevAll( 'h2' )[ 0 ].innerText,
        highlighted = $( this ).hasClass( 'highlighted' ) ? 1 : 0;

_gaq.push( [ '_trackEvent', 'database', resource, section, highlighted ] );

The top of our function just grabs information from the page. We’re using jQuery to make our lives easier, so all the $( this ) pieces of our code refer to the element that initiated the event. In our case, that’s the link pointing to an external database which the user just clicked. So we set destination to the link’s href attribute, resource to its text (e.g. the database’s name), section to the text inside the h2 element that labels a topical set of databases, and highlighted is a boolean value equal to 1 if the element has a class of “highlighted.” Next, this data is pushed into the _gaq array which is a queue of functions and their parameters that Analytics fires asynchronously. In this instance, we’re telling Analytics to run the _trackEvent function with the parameters that follow. Analytics will then record an event of type “database” with an action of [database name], a label of [section header], and a boolean representing whether it was highlighted or not.

setTimeout( function () {
    window.location = destination;
}, 200 );
event.preventDefault();
}

Next comes perhaps the least obvious piece: we prevent the default browser behavior from occurring, which in the case of a link is navigating away from our page, but then send the user to destination 200 milliseconds later anyways. The _trackEvent function now has a chance to fire; if we let the user follow the link right away it might not complete and our event would not be recorded.1

$( document ).ready(
    // target all anchors in list of databases
    $( '#databases-list a' ).on( 'click', databaseTracking )
);

There’s one last step; merely defining the databaseTracking function won’t cause it to execute when we want it to. JavaScript uses event handlers to execute certain functions based on various user actions, such as mousing over or clicking an element. Here, we add click event handlers to all <a> elements in the list of databases. Now whenever a user clicks a link in the databases list (which has a container with id “databases-list”), databaseTracking will run and send data to Google Analytics.

There is a demo on JSFiddle which uses the code above with some sample HTML. Every time you click a link, a pop-up shows you what the _gaq.push array looks like.

Though we used jQuery in our example, any JavaScript library can be used with event tracking.2 The procedure is always the same: write a function that gathers data to send back to Google Analytics and then add that function as a handler to an appropriate event, such as click or mouseover, on an element.

For another example, complete with code samples, see the article “Discovering Digital Library User Behavior with Google Analytics” in Code4Lib Journal. In it, Kirk Hess of the University of Illinois Urbana-Champaign details how to use event tracking to see how often external links are clicked or files are downloaded. While these events are particularly meaningful to digital libraries, most libraries offer PDFs or other documents online.

Some Ideas

The true power of Event Tracking is that it does not have to be limited to the mere clicking of hyperlinks; any interaction which JavaScript knows about can be recorded and categorized. Google’s own Event Tracking Guide uses the example of a video player, recording when control buttons like play, pause, and fast forward are activated. Here are some more obvious use cases for event tracking:

  • Track video plays on particular pages; we may already know how many views a video gets, but how many come from particular embedded instances of the video?
  • Clicking to external content, such as a vendor’s database or another library’s study materials.
  • If there is a print or “download to PDF” button on our site, we can track each time it’s clicked. Unfortunately, only Internet Explorer and Firefox (versions >= 6.0) have an onbeforeprint event in JavaScript which could be used to detect when a user hits the browser’s native print command.
  • Web applications are particularly suited to event tracking. Many modern web apps have a single page architecture, so while the user is constantly clicking and interacting within the app they rarely generate typical interaction statistics like pageviews or exits.

 

Notes
  1. There is a discussion on the best way to delay outbound links enough to record them as events. A Google Analytics support page condones the setTimeout approach. For other methods, there are threads on StackOverflow and various blog posts around the web. Alternatively, we could use the onmousedown event which fires slightly earlier than onclick but also might record false positives due to click-and-drag scrolling.
  2. Below is an attempt at rewriting the jQuery tracking code in pure JavaScript. It will only work in modern browsers because of use of querySelectorAll, parentElement, and previousElementSibling. Versions of Internet Explorer prior to 9 also use a unique attachEvent syntax for event handlers. Yes, there’s a reason people use libraries to do anything the least bit sophisticated with JavaScript.
function databaseTracking  ( event ) {
        var destination = event.target.href,
            resource = event.target.innerHTML,
            section = "none",
            highlighted = event.target.className.match( /highlighted/ ) ? 1: 0;

        // getting a parent element's nearest <h2> sibling is non-trivial without a library
        var currentSibling = event.target.parentElement;
        while ( currentSibling !== null ) {
            if ( currentSibling.tagName !== "H2" ) {
                currentSibling = currentSibling.previousElementSibling;
            }
            else {
                section = currentSibling.innerHTML;
                currentSibling = null;
            }
        }

        _gaq.push( [ '_trackEvent', 'database', resource, section, highlighted ] );

        // delay navigation to ensure event is recorded
        setTimeout( function () {
            window.location = destination;
        }, 200 );
        event.preventDefault();
    }

document.addEventListener( 'DOMContentLoaded', function () {
        var dbLinks = document.querySelectorAll( '#databases-list a' ),
            len = dbLinks.length;
        for ( i = 0; i < len; i++ ) {
            dbLinks[ i ].addEventListener( 'click', databaseTracking, false );
        }
    }, false );
Association of College & Research Libraries. (n.d.). ACRL Value of Academic Libraries. Retrieved January 12, 2013, from http://www.acrl.ala.org/value/
Event Tracking – Web Tracking (ga.js) – Google Analytics — Google Developers. (n.d.). Retrieved January 12, 2013, from https://developers.google.com/analytics/devguides/collection/gajs/eventTrackerGuide
Hess, K. (2012). Discovering Digital Library User Behavior with Google Analytics. The Code4Lib Journal, (17). Retrieved from http://journal.code4lib.org/articles/6942
Marek, K. (2011). Using Web Analytics in the Library a Library Technology Report. Chicago, IL: ALA Editions. Retrieved from http://public.eblib.com/EBLPublic/PublicView.do?ptiID=820360
Sattler, K., & Richard, J. (2012, October 30). Learning Web Analytics from the LITA 2012 National Forum Pre-conference. ACRL TechConnect Blog. Blog. Retrieved January 18, 2013, from http://acrl.ala.org/techconnect/?p=2133
Tracking Code: Event Tracking – Google Analytics — Google Developers. (n.d.). Retrieved January 12, 2013, from https://developers.google.com/analytics/devguides/collection/gajs/methods/gaJSApiEventTracking
window.onbeforeprint – Document Object Model (DOM) | MDN. (n.d.). Mozilla Developer Network. Retrieved January 12, 2013, from https://developer.mozilla.org/en-US/docs/DOM/window.onbeforeprint

An Elevator Pitch for File Naming Conventions

As a curator and a coder, I know it is essential to use naming conventions.  It is important to employ a consistent approach when naming digital files or software components such as modules or variables. However, when a student assistant asked me recently why it was important not to use spaces in our image file names, I struggled to come up with an answer.  “Because I said so,” while tempting, is not really an acceptable response.  Why, in fact, is this important?  For this blog entry, I set out to answer this question and to see if, along the way, I could develop an “elevator pitch” – a short spiel on the reasoning behind file naming conventions.

The Conventions

As a habit, I implore my assistants and anyone I work with on digital collections to adhere to the following when naming files:

  • Do not use spaces or special characters (other than “-” and “_”)
  • Use descriptive file names.  Descriptive file names include date information and keywords regarding the content of the file, within a reasonable length.
  • Date information is the following format: YYYY-MM-DD.

So, 2013-01-03-SmithSculptureOSU.jpg would be an appropriate file name, whereas Smith Jan 13.jpg would not.  But, are these modern practices?  Current versions of Windows, for example, will accept a wide variety of special characters and spaces in naming files, so why is it important to restrict the use of these characters in our work?

The Search Results

A quick Google search finds support for my assertions, though often for very specific cases of file management.  For example, the University of Oregon publishes recommendations on file naming for managing research data.  A similar guide is available from the University of Illinois library, but takes a much more technical, detailed stance on the format of file names for the purposes of the library’s digital content.

The Bentley Historical Library at University of Michigan, however, provides a general guide to digital file management very much in line with my practices: use descriptive directory and file names, avoid special characters and spaces.  In addition, this page discusses avoiding personal names in the directory structure and using consistent conventions to indicate the version of a file.

The Why – Dates

The Bentley page also provides links to a couple of sources which help answer the “why” question.  First, there is the ISO date standard (or, officially, “ISO 8601:2004: Data elements and interchange formats — Information interchange — Representation of dates and times”).  This standard dictates that dates be ordered from largest term to smallest term, so instead of the month-day-year we all wrote on our grade school papers, dates should take the form year-month-day.  Further, since we have passed into a new millennium, a four digit year is necessary.  This provides a consistent format to eliminate confusion, but also allows for file systems to sort the files appropriately.  For example, let’s look at the following three files:

1960-11-01_libraryPhoto.jpg
1977-01-05_libraryPhoto.jpg
2000-05-01_libraryPhoto.jpg

If we expressed those dates in another format, say, month-day-year, they would not be listed in chronological order in a file system sorting alphabetically.  Instead, we would see:

01-05-1977_libraryPhoto.jpg
05-01-2000_libraryPhoto.jpg
11-01-1960_libraryPhoto.jpg

This may not be a problem if you are visually searching through three files, but what if there were 100?  Now, if we only used a two digit year, we would see:

00-05-01_libraryPhoto.jpg
60-11-01_libraryPhoto.jpg
77-01-05_libraryPhoto.jpg

If we did not standardize the number of digits, we might see:

00-5-1_libraryPhoto.jpg
77-1-5_libraryPhoto.jpg
60-11-1_libraryPhoto.jpg

You can try this pretty easily on your own system.  Create three text files with the names above, sort the files by name and check the order.  Imagine the problems this might create for someone trying to quickly locate a file.

You might ask, why include the date at all, when dates are also maintained by the operating system?  There are many situations where the operating system dates are unreliable.  In cases where a file moves to a new drive or computer, for example, the Date Created may reflect the date the file moved to the new system, instead of the initial creation date.  Or, consider the case where a user opens a file to view it and the application changes the Date Modified, even though the file content was not modified.  Lastly, consider our earlier example of a photograph from 1960; the Date Created is likely to reflect the date of digitization.  In each of these examples, it would be helpful to include an additional date in the file name.

The Why – Descriptive File Names

So far we have digressed down a date-specific path.  What about our other conventions?  Why are those important?  Also linked from the Bentley Library and in the Google search results are YouTube videos created by the State Library of North Carolina which answer some of these questions.  The Inform U series on file naming has four parts, and is intended to help users manage personal files.  However, the rationale described in Part 1 for descriptive file names in personal file management also applies in our libraries.

First, we want to avoid the accidental overwriting of files.  Image files can provide a good example here: many cameras use the file naming convention of IMG_1234.jpg.  If this name is unique to the system, that works ok, but in a situation where multiple cameras or scanners are generating files for a digital collection, there is potential for problems.  It is better to batch re-name image files with a more descriptive name. (Tutorials on this can be found all over the web, such as the first item in this list on using Mac’s Automator program to re-name a batch of photos).

Second, we want to avoid the loss of files due to non-descriptive names.  While many operating systems will search the text content of files, naming files appropriately makes for more efficient access.  For example, consider mynotes.docx and 2012-01-05WebMeetingNotes.docx – which file’s contents are easier to ascertain?

I should note, however, that there are cases where non-descriptive file names are appropriate.  The use of a unique identifier as a filename is sometimes a necessary approach.  However, in those cases where you must use a non-descriptive filename, be sure that the file names are unique and in a descriptive directory structure.  Overall, it is important that others in the organization have the same ability to find and use the files we currently manage, long after we have moved on to another institution.

The Why – Special Characters & Spaces

We have now covered descriptive names and reasons for including dates, which leaves us with spaces and special characters to address.  Part 3 of the Inform U video series addresses this as well.  Special characters can designate special meaning to programming languages and operating systems, and might be misinterpreted when included in file names.  For instance, the $ character designates the beginning of variable names in the php programming language and the \ character designates file path locations in the Windows operating system.

Spaces may make things easier for humans to read, but systems generally do better without the spaces.  While operating systems attempt to handle spaces gracefully and generally do so, browsers and software programs are not consistent in how they handle spaces.  For example, consider a file stored in a digital repository system with a space in the file name.  The user downloads the file and their browser truncates the file name after the first space.  This equates to the loss of any descriptive information after the first space.  Plus, the file extension is also removed, which may make it harder for less tech savvy users to use a file.

The Pitch

That example leads us to the heart of the issue: we never know where our files are going to end up, especially files disseminated to the public.  Our content is useless if our users cannot open it due to a poorly formatted file name.  And, in the case of non-public files or our personal archives, it is essential to facilitate the discovery of items in the piles and piles of digital material accumulated every day.

So, do I have my elevator pitch?  I think so.  When asked about file naming standards in the future, I think I can safely reply with the following:  “It is impossible to accurately predict all of the situations in which a file might be used.  Therefore, in the interest of preserving access to digital files, we choose file name components that are least likely to cause a problem in any environment.  File names should provide context and be easily understood by humans and computers, now and in the future.”

And, like a good file name, that is much more effective and descriptive than, “Because I said so.”


What exactly does a fiber-based gigabit speed library app look like?

Mozilla and the National Science Foundation are sponsoring an open round of submissions for developers/app designers to create fiber-based gigabit apps. The detailed contest information is available over at Mozilla Ignite (https://mozillaignite.org/about/). Cash prizes to fund promising start up ideas are being award for a total of $500,000 over three rounds of submissions. Note: this is just the start, and these are seed projects to garner interest and momentum in the area. A recent hackathon in Chattanooga lists out what some coders are envisioning for this space:  http://colab.is/2012/hackers-think-big-at-hackanooga/

The video for the Mozilla Ignite Challenge 2012 on Vimeo is slightly helpful for examples of gigabit speed affordances.

If you’re still puzzled after the video, you are not alone. One of the reasons for the contest is that network designers are not quite sure what immense levels of processing in the network and next generation transfer speeds will really mean.

Consider that best case transfer speeds on a network are somewhere along the lines of 10 megabits per second. There are of course variances of this speed across your home line (it may hover closer to 5 mb/s), but this is pretty much the standard that average subscribers can expect. A gigabit speed rate transfers data at 100 times that speed, 1,000 megabits per second. When a whole community is able to achieve 1,000 megabits upstream and downstream, you basically have no need for things like “streaming” video – the data pipes are that massive.

One theory is that gigabit apps could provide public benefit, solve societal issues and usher in the next generation of the Internet. Think of gigabit speed as the difference between getting water (Internet) through a straw, and getting water (Internet) through a fire-hose. The practicality of this contest is to seed startups with ideas that will in some way impact healthcare (realtime health monitoring), the environment and energy challenges. The local Champaign Urbana municipal gigabit speed fiber cause is noble, as it will provide those in areas without access to broadband an awesome pipeline to the Internet. It is a intergovernmental partnership that aims to serve municipal needs, as well as pave the way for research and industry start-ups.

Here are some attributes that Mozilla Ignite Challenge lists as the possible affordances of fiber based gigabit speed apps:

  • Speed
  • Video
  • Big Data
  • Programmable networks

As I read about the Mozilla Ignite open challenge, I wondered about the possibilities for libraries and as a thought experiment I list out here some ideas for library services that live on gigabit speed networks:

* Consider the video data you could provide access to – in libraries that are stewarding any kind of video gigabit speeds would allow you to provide in-library viewing that has few bottlenecks. A fiber-based gigabit speed video viewing app of all library video content available at once. Think about viewing every video in your collection simultaneously. You could have them playing to multiple clusters (grid videos) in the library at multiple stations. Without streaming.

* Consider Sensors and sensor arrays and fiber. One idea that is promulgated for the use of fiber-based gigabit speed networks are the affordances to monitor large amounts of data in real time. The sensor networks that could be installed around the library facility could help to throttle energy consumption in real time, making the building more energy efficient and less costly to maintain. Such a facilities based app would impact savings on the facilities budgets.

* Consider collaborations among libraries with fiber affordances. Libraries that are linked by fiber-based gigabit speeds would be able to transfer large amounts of data in fractions of what it takes now. There are implications here for data curation infrastructure (http://smartech.gatech.edu/handle/1853/28513).

Another way to approach this problem is by asking: “What problem does gigabit speed networking solve?” One of the problems with the current web is the de facto  latency. Your browser needs to request a page from a server which is then sent back to your client. We’ve gotten so accustomed to this latency that we expect it, but what if pages didn’t have to be requested? What if servers didn’t have to send pages? What if a zero latency web meant we needed a new architecture to take advantage of data possibilities?

Is your library poised to take advantage of increased data transfer? What apps do you want to get funding for?

 


Rapid Prototyping Mobile Augmented Reality Applications

This Fall semester the Undergraduate Library at the University of Illinois at Urbana Champaign along with partners from the Graduate School of Library and Information Science and Computer Science graduate students with experience in programming OpenCV, will begin coding an open source mobile Augmented Reality (AR) app for deeper in-library engagement with both print and physical resources. The funding comes from a recently awarded IMLS Sparks! Grant. Our objectives include the following:

  • Create shelf recognition software for mobile devices that integrate print and digital resources into the on-site library experience and experiment with location based recommendation services.
  • Investigate the potential of creating a system that shows users how they are physically navigating an “idea space.”
  • Complete iterative rapid use studies of mobile software with library patrons and communicate results back to programming staff for incremental app design.
  • Work with our Library IT staff to identify skills and technical infrastructure needed in order to make AR an ongoing part of technology in libraries.
  • Make available the AR apps through the Library’s mobile labs experimental apps area (http://m.library.illinois.edu/labs.asp).

There are multiple problems with access to the variety of collections in our networked era (Lee, 2000) including their highly disparate nature (many vended platforms serving licensed library content) and their increasing intangibility (the move to massively electronic, or e-only access in libraries and information centers). Moreover, library collection developers are faced with the challenge of providing increased access to digital while still maintaining print. Lee (2000) argues for library research redefining library collections as information contexts.

This work will address the contextual information needs of library users while leveraging recent advances in mobile-networked technologies, experimenting with a way to increase access to collections of all types. The research team will deploy, test, and evaluate mobile applications that create novel “augmented book stacks.”

 

(a.) Subject of book stack set is identified by app index, and displayed on interface. (b.) Recommendations (e-book, digital items, or databases) brought onto interface in real-time. (c.) Popular books are indicated on title using circulation data from integrated library system historical circulation count (this can be a Z39.50 call or a pre-loaded circulation report database).

 

To create such applications, researchers will make use of video functionality that augment shelves of interest to a user in the library stacks inserting interactive graphics through the video feed of a phone onto the physical book stack environment in real-time. As a comparison to current state of the art mobile AR apps, like the ShelvAR app in development at Miami University, the proposed system does not require 2D tags as targets on books, but rather uses a combination of computer vision software code for feature detection and optical character recognition (OCR) software to parse the text of titles, call numbers, and subjects on the book stacks. A prototype project for OCR running in Android can be implemented following this tutorial. Our research group does not propose a replication of the state of the art, but will implement a system that pushes forward the state of the art in innovation for research and learning with AR in library stacks.

The project team will experiment with overlaying relevant resources from other parts of the library’s collection such as the library’s licensed set of databases, other Internet based resources, or books that are relevant but not shelved nearby. This augmentation will enhance the serendipitous discovery of books so that items relevant to a user’s location, but not shelved near her can be brought into the browsing experience; with this technology books that are checked out, or otherwise unavailable can still be made useful to a users information search. Our staff will experiment with system features that create “idea spaces” for the user, which will serve to help students and library users exploit previous discovery routes and spaces in the book stacks. The premise of “idea spaces” comes from an unspoken assumption among librarians: the intellectual organization of items in library collections are valuable constructs. By presenting graphical overlays of the subject areas of the collection, we make this assumption explicit and assert that as a user navigates the geographic spaces of a library collection, they are actually navigating intellectual spaces. With a user location is paired an idea (or set of related ideas), delivered in our proposed system with a graphical overly in the video feed. The user’s location, her context in the collection, is the query point for the idea spaces system.

This experiment will be valuable for all libraries that support print and digital resources. Underscoring this work is the overarching concern with making all library collections more accessible. Researchers will undertake rapid prototyping (as a test case for the chosen method see: Jones & Richey, 2000) of the augmented reality feature set in order to understand user preferences of mobile interfaces that best support location-based recommendations, and make all results of this experimentation including software code and computing workflows freely available. Such experimentation could lead to profound changes in the way people research and learn in library spaces.

Grant activities will begin in October 2012 and conclude September 2013. The evaluation plan for the grant is a systematic measurement of project outputs against the stated goals with the resulting evaluative outputs communicating what worked and was useful for library patrons in AR apps. By operationalizing a rapid evaluation of augmented reality services the research team hopes to identify the fail points for mobile services in this domain in addition to the most desired and useful feature set for all augmented reality systems in library book stacks.

Cited

Jones, T. & Richey, R.  (2000) “Rapid Prototyping methodology in action: a developmental study,” Educational Technology Research and Development 48, 63-80.

Lee, H. (2000), “What is a collection?” Journal of the American Society for Information Science, 51 (12) 1106-1113.

Suggested Reading

Regarding collocation objectives in library science see: Svenonius, E. (2000), The Intellectual Foundation of Information Organization, MIT Press, Cambridge, MA, pp.21-22

See also

Additional sample code for image processing with Android devices available here, courtesy of openly available lecture notes from Stanford’s Digital Image Processing Course EE368.

Forthcoming this October, a paper detailing additional AR use cases in library services: Hahn, J. (2012). Mobile augmented reality applications for library services. New Library World 113 (9/10)


2012 Eyeo Festival

Tuesday June 5th through Friday June 8th 2012, 500 creatives from numerous fields such as, computer science, art, design, data visualization, gathered together to listen, converse, and participate in the second Eyeo Festival. Held in Minneapolis, MN at the Walker Art Center, the event organizers created an environment of learning, exchange, exploration, and fun. There were various workshops with some top names leading the way. Thoughtfully curated presentations throughout the day complemented keynotes held nightly in party-like atmospheres: Eyeo was an event not to be missed. Ranging from independent artists to the highest levels of innovative companies, Eyeo offered inspiration on many levels.

Why the Eyeo Festival?
As I began to think about what I experienced at the Eyeo Festival, I struggled to express exactly how impactful this event was for me and those I connected with. In a way, Eyeo is like TED and in fact, many presenters have given TED talks. Eyeo has a more targeted focus on art, design, data, and creative code but it is also so much more than that. With an interactive art and sound installation, Zygotes, by Tangible Interaction kicking off the festival, though the video is a poor substitute to actually being there, it still evokes a sense of wonder and possibility. I strongly encourage anyone who is drawn to design, data, art, interaction or to express their creativity through code to attend this outstanding creative event and follow the incredible people that make up the impressive speaker list.

I went to the Eyeo Festival because I like to seek out what professionals in other fields are doing. I like staying curious and stretching outside my comfort zone in big ways, surrounding myself with people doing things I don’t understand, and then trying to understand them. Over the years I’ve been to many library conferences and there are some amazing events with excellent programming but they are, understandably, very library-centric. So, to challenge myself, I decided to go to a conference where there would be some content related to libraries but that was not a library conference. There are many individuals and professions outside of libraries that care about many of the same values and initiatives we do, that work on similar kinds of problems, and have the same drive to make the world a better place. So why not talk to them, ask questions, learn, and see what their perspective is? How do they approach and solve problems? What is their process in creating? What is their perspective and attitude? What kind of communities are they part of and work with?

I was greatly inspired by the group of librarians who have attended the SXSWi Festival which has grown further over the years. There are a now a rather large number of librarians speaking about and advocating for libraries in such an innovative and elevated platform. There is even a Facebook Group where professionals working in libraries, archives, and museums can connect with each other for encouragement, support, and collaborations in relation to SXSWi. Andrea Davis, Reference & Instruction Librarian at the Dudley Knox Library, Naval Postgraduate School in Monterey, CA, has been heavily involved in offering leadership in getting librarians to collaborate at SXSW. She states, “I’ve found it absolutely invigorating to get outside of library circles to learn from others, and to test the waters on what changes and effects are having on those not so intimately involved in libraries. Getting outside of library conferences keeps the blood flowing across tech, publishing, education. Insularity doesn’t do much for growth and learning.”

I’ve also been inspired by librarians who have been involved in the TED community, such as Janie Herman and her leadership with Princeton Public Library’s local TEDx in addition to her participation in the TEDxSummit in Doha, Qatar. Additionally, Chrystie Hill, the Community Relations Director at OCLC, has given more than one TedX talk about libraries. Seeing our library colleagues represent our profession in arenas broader than libraries is energizing and infectious.

Librarians having a seat at the table and a voice at two of the premier innovative gatherings in the world is powerful. This concept of librarians embedding themselves in communities outside of librarianship has been discussed in a number of articles including The Undergraduate Science Librarian and In the Library With the Lead Pipe.

Highlights
Rather than giving detailed comprehensive coverage of Eyeo, you’ll see a glimpse of a few presentations plus a number of resources so that you can see for yourself some of the amazing, collaborative work being done. Presenter’s names link to the full talk that you can watch for yourself. Because a lot of the work being done is interactive and participatory in some way, I encourage you to seek these projects out and interact with them. The organizers are in the midst of processing a lot of videos and putting them up on the Eyeo Festival Vimeo channel; I highly recommend watching them and checking back for more.

Ben Fry
Principal of Fathom, a Boston based design and data visualization firm, and co-initiator of the programming language Processing, Ben Fry’s work in data visualization and design is worth delving into. In his Eyeo presentation, 3 Things, the project that most stood out was the digitization project Fathom produced for GE: http://fathom.info/latest/category/ge. Years of annual reports were beautifully digitized and incorporated into an interactive web application they built from scratch. When faced with scanning issues, they built a tool that improved the scanned results.

Jer Thorp
Data artist in residence for the New York Times, and former geneticist, Jer Thorp’s range in working with data, art, and design is far and wide. Thorp is one of the few founders of the Eyeo Festival and in his presentation Near/Far he discussed several data visualization projects with the focus on storytelling. The two main pieces that stood out from Jer’s talk was his encouragement to dive into data visualization. He even included 10 year old, Theodore Zaballos’ handmade visualization of The Illiad which was rather impressive. The other piece that stood out was his focus on data visualization in context to location and people owning their own data versus a third party. This lead into the Open Paths project he showcased. He has also presented to librarians at the Canadian library conference, Access 2011.

Jen Lowe
Jen Lowe was by far the standout from all of the amazing Ignite Eyeo talks. She spoke about how people are intrinsically inspired by storytelling and the need for those working with data to focus on storytelling through the use of visualizing data and the story it tells. She works for the Open Knowledge Foundation in addition to running Datatelling and she has her library degree (she’s one of us!).

Jonathan Harris
Jonathan Harris gave one of the most personal and poignant presentations at Eyeo. In a retrospective of his work, Jonathan covered years of work interwoven with personal stories from his life. Jonathan is an artist and designer and his work life and personal life are rarely separated. Each project began with the initial intention and ended with a more critical inward examination from the artist. The presentation led to his most recent endeavor, the Cowbird project, where storytelling once again emerges strongly. In describing this project he focused on the idea that technology and software could be used for good, in a more human way, created by “social engineers” to build a community of storytellers. He describes Cowbird as “a community of storytellers working to build a public library of human experience.”

Additional people + projects to delve into:

Fernanda Viegas and Martin Wattenberg of the Google Big Picture data visualization group. Wind Map: http://hint.fm/wind/

Kyle McDonald: http://kylemcdonald.net/

Tahir Temphill: http://tahirhemphill.com/ and his latest work, Hip Hop Word Count: http://staplecrops.com/index.php/hiphop_wordcount/

Julian Oliver: http://julianoliver.com/

Nicholas Felton of Facebook: http://feltron.com/

Aaron Koblin of the Google Data Arts Group: http://www.aaronkoblin.com/ and their latest project with the Tate Modern: http://www.exquisiteforest.com/

Local Projects: http://localprojects.net/

Oblong Industries: http://oblong.com/

Eyebeam Art + Technology Center: http://eyebeam.org/

What can libraries get from the Eyeo Festival?

Libraries and library work are everywhere at this conference. That this eclectic group of creative people were often thinking about and producing work similar to librarians is thrilling. There is incredible potential for libraries to embrace some of the concepts and problems in many of the presentations I saw and conversations I was part of. There are multiple ways that libraries could learn from and perhaps participate in this broader community and work across fields.

People love libraries and these attendees were no exception. There were attendees from numerous private/corporate companies, newspapers, museums, government, libraries, and more. I was not the only library professional in attendance so I suspect those individuals might see the potential I see, which I also find really exciting. The drive behind every presenter and attendee was by far creativity in some form, the desire to make something, and communicate. The breadth of creativity and imagination that I saw reminded me of a quote from David Lankes in his keynote from the New England Library Association Annual Conference:

“What might kill our profession is not ebooks, Amazon or Google, but a lack of imagination. We must envision a bright future for librarians and the communities they serve, then fight to make that vision a reality. We need a new activist librarianship focused on solving the grand challenges of our communities. Without action we will kill librarianship.”

If librarianship is in need of more imagination and perhaps creativity too, there is a world of wonder out there in terms of resources to help us achieve this vision.

The Eyeo Festival is but one place where we can become inspired, learn, and dream and then bring that experience back to our libraries and inject our own imagination, ideas, experimentation, and creativity into the work we do. By doing the most creative, imaginative library work we can do will inspire our communities; I have seen it first hand. Eyeo personally taught me that I need to fail more, focus more, make more, and have more fun doing it all.