Themes from the Knight Foundation’s News Challenge Grant Competition

This year’s Knight Foundation’s News Challenge project grant competition focused on the premise that libraries are “key for improving Americans’ ability to know about and to be involved with what takes place around them.”  While the Knight Foundation’s mission traditionally focuses on journalism and media, the Foundation has funded several projects related to libraries in the past, such as projects enabling Chicago Public Library and New York Public Library to lend Wi-Fi hot-spots to patrons, Jason Griffey’s LibraryBox Project, and a DPLA project to clarify intellectual property rights for libraries sharing digital materials.  Winners receive a portion of around $5 million in funding, and typical awards range between $200,000 and $400,000, with some additional funding available for smaller (~$35,000) start-up projects.

The News Challenge this year called for ideas that “leverage libraries as a platform to build more knowledgeable communities”.  There were over 600 proposals in this years’ News Challenge (all of which you can find here), and 42 proposals made it into the semi-finals, with winners to be announced January 30th, 2015.  While not all of the proposed ideas specifically involve digital technologies, most propose the creation of some kind of application or space where library users can access and learn about new technologies and digital skills.

Notably, while the competition was open to anyone (not just libraries or librarians), around 30 of the 42 finalist projects have at least one librarian or library organization sponsor on the team.1  Nearly all the projects involve the intersection of multiple disciplines, including history, journalism, engineering, education/instructional design, music, and computer science.  Three general themes seem to have emerged from this year’s News Challenge finalists:  1) maker spaces tailored to specific community needs; 2) libraries innovating new ways to publish, curate, and lend DRM-free ebooks and other content; and 3) facilitating the preservation of born-digital user-generated histories.

MakerSpaces Made for Communities

Given how popular makerspaces have become in libraries2 over the past several years, it’s not surprising to see that many News Challenge proposals  seek funding for the space and/or equipment to create library makerspaces.  What is interesting about many of these makerspace proposals is that many of them highlight the need to develop makerspaces that are specifically relevant to to the interests and needs of the local communities a library serves.

For example, one proposal out of Philadelphia – Libraries as Hip-Hop Techspace – proposes equipping a library space with tools for learning about and creating digital music via hip-hop.  Another proposal out of Vermont focuses on teaching digital literacies and skillsets via makerspace technologies in rural communities.  Of course, many proposals include 3D printers, but what stands out about proposals that have made it through to the finals (like this one that proposes 3D printed books for blind children, for example, or this one that proposes a tiny makerspace for a small community) are those that emphasize how the project would use 3D printing and associated technologies in an innovative way that is still meaningful to specific community learning needs and interests.

One theme that runs through these maker-related proposals, whether they come from cities or rural locations, is the relevance of these spaces to the business and economic needs of the communities in which they are proposed.  Several entries point out the potential for library maker-spaces to be entrepreneurial incubators designed to enable users to develop, prototype, and market their own software or other products.  Many of these proposals specifically mention how the current “digital divide” is, at its heart, an economic divide – such as this maker space proposal from San Jose, which argues that “In order for the next generation of Silicon Valley’s leaders to rise from its neighborhoods, access to industry knowledge and tools should begin early to inspire participation, experimentation, and collaboration – essential skills of the thriving economy in the area.”  Creativity and experimentation are increasingly essential skills in the labor force, and these proposals highlight how a library’s role in fostering creative expression ultimately provides an economic benefit.

Publishing and Promoting DRM-Free Content

Like maker spaces, libraries serving as publishers is not a new trend.3  What is interesting about Knight Foundation proposals that that center on this topic are the projects that position libraries as publishers of content that otherwise would struggle to get published in the current marketplace, while recognizing the desire for libraries to more easily lend digital content to their users.

A great example of this is a proposal by the developers of JukePop to leverage “libraries’ ebook catalogs so they become THE publishing platform for indie authors looking to be discovered by the most avid readers”  JukePop is a publishing platform for indie authors that enables distribution of ebooks – often in serial form – to readers who can provide feedback to authors.  Downloads from library websites are DRM-free.4 Other proposals – like this proposal for improving the licensing models of indie games – also have a core value of figuring out new ways to streamline user access to content while providing benefits and compensation to creators.

A somewhat different – and super interesting – manifestation of this theme is the Gitenberg project, which proposes utilizing Github as a platform for version control to enable libraries to contribute to improving digital manuscripts and metadata for Project Gutenberg.  The project already has a repository that you can start forking from and making pull requests to improve Project Gutenberg data.

Telling Community Stories through Born-Digital Media Preservation

Libraries and archives have always held the role of preserving the cultural heritage of the communities they serve, but face challenges in easily  gathering, preserving, and curating born-digital media.  While individual users capture huge amounts of media on mobile devices, as the Open Archive proposal puts it, “the most common destination for this media is on social media platforms that can chill free speech and are not committed to privacy, authentication, or long-term preservation.”  StoryCorps, for example, proposes the creation of better tools and distributed education for libraries and archives to record and preserve diverse community stories through interviews.  CurateScape “seeks to remake public life through a distributed and participatory network of digital storytelling,” emphasizing the need for technology that can be adopted by libraries to document local community histories.  The Internet Archive proposes more streamlined tools and frameworks to enable users to more easily submit content for preservation to archive.org.  The Open Archive project, also associated with the Internet Archive, emphasizes empowering users through their local libraries and archives to capture and submit their media to community collections.  Recognizing that there is still an enormous amount of pre-digital content waiting to be preserved New York Public Library seeks to “democratize the digitization process” through mobile and distributed digitization tools.

The element that seems to motivate many of these proposals is an emphasis on unique, community-centric collections that tell a story.  They focus on the importance of local institutions as connection points for users to share their content, and their experiences, with the world, while also documenting the context surrounding those experiences.  Libraries and archives are uniquely positioned as the logical place to document local histories in the interest of long-term preservation, but need better solutions.  It’s definitely exciting to see the energy behind these proposals and the innovative solutions that are on the horizon.

Start Thinking about Next Year!

Full disclosure:  I was part of a team that submitted a proposal to the News Challenge, and while we’ve been notified it won’t be funded, it was a fantastic learning experience.  I definitely made some great connections with other Challenge participants (many tweets have been exchanged!) – and it’s not surprising to me that even in a competition, library people find ways to work together and collaborate.  While winners for this years’ competition won’t be announced until January 30th, and the specific “challenge” prompt for the next competition won’t be identified until late next year (and may not be specifically library-related), I would definitely encourage anyone with a good, relevant idea to think about applying to next years’ News Challenge.

There’s also some interesting stuff to be found in the “Inspiration” section of the News Challenge, where people could submit articles, discussions, and ideas relevant to this year’s Challenge.  One of the articles linked to in that section is the transcript of a 2013 speech by author Neil Gaiman, which features a lot of lovely bits of motivation for every library enthusiast – like this one, which I think captures a central thread running through almost every News Challenge proposal:  “A library is a place that is a repository of information and gives every citizen equal access to it… It’s a community space. It’s a place of safety, a haven from the world. It’s a place with librarians in it. What the libraries of the future will be like is something we should be imagining now.”

Notes

  1. The projects that do not have direct library/librarian involvement listed on the project team – including “Cowbird” (https://newschallenge.org/challenge/libraries/evaluation/building-libraries-of-human-experience-transforming-america-s-libraries-into-community-storytelling-centers-for-the-digital-age) and “Libraries as Hip-Hop TechSpace” (https://newschallenge.org/challenge/libraries/refinement/libraries-as-hip-hop-techspace) are definitely interesting to read.  I do think they provide a unique perspective on how the ethos and mission of libraries is perceived by those not necessarily embedded in the profession – and I’m encouraged that these proposals do show a fairly clear understanding of the directions libraries are moving in.
  2. http://acrl.ala.org/techconnect/?p=2340
  3. See http://lj.libraryjournal.com/2014/03/publishing/the-public-library-as-publisher and http://dx.doi.org/10.3998/3336451.0017.207
  4. Santa Clara County Library district is a development partner with JukePop – you can check out their selection of JukePop-sourced DRM-free titles available here:  http://www.sccl.org/Browse/eBooks-Downloads/Episodic-Fiction

This is How I Work (Lauren Magnuson)

Editor’s Note: This post is part of ACRL TechConnect’s series by our regular and guest authors about The Setup of our work.

Lauren Magnuson, @lpmagnuson

Location: Los Angeles, CA

Current Gig:

Systems & Emerging Technologies Librarian, California State University Northridge (full-time)

Development Coordinator, Private Academic Library Network of Indiana (PALNI) Consortium (part-time, ~10/hrs week)

Current Mobile Device: iPhone 4.  I recently had a chance to upgrade from an old slightly broken iPhone 4, so I got….another iPhone4.  I pretty much only use my phone for email and texting (and rarely, phone calls), so even an old iPhone is kind of overkill for me.

Current Computer:

  • Work:  work-supplied HP Z200 Desktop, Windows 7, dual monitors
  • Home: (for my part-time gig): Macbook Air 11”

Current Tablet: iPad 2, work-issued, never used

One word that best describes how you work: relentlessly

What apps/software/tools can’t you live without?

  • Klok – This is time-tracking software that allows you to ‘clock-in’ when working on a project.  I use it primarily to track time spent working my part-time gig.  My part-time gig is hourly, so I need to track all the time I spend working that job.  Because I love the work I do for that job, I also need to make sure I work enough hours at my full-time job.  Klok allows me to track hours for both and generate Excel timesheets for billing.  I use the free version, but the pro version looks pretty cool as well.
  • Trello – I use this for the same reasons everyone else does – it’s wonderfully simple but does exactly what I need to do.  People often drop by my office to describe a problem to me, and unless I make a Trello card for it, the details of what needs to be done can get lost.  I also publish my CSUN Trello board publically and link it from my email signature.
  • Google Calendar – I stopped using Outlook for my primary job and throw everything into Google Calendar now.  I also dig Google Calendar’s new feature that integrates with Gmail so that hotel reservations and flights are automatically added to your Google Calendar.
  • MAMP/XAMPP – I used to only do development work on my Macbook Air with MAMP and Terminal, which meant I carted it around everywhere – resulting in a lot of wear and tear.  I’ve stopped doing that and invested some time in in setting up a development environment with XAMPP and code libraries on my Windows desktop.  Obviously I then push everything to remote git repositories so that I can pull code from either machine to work on it whether I’m at home or at work.
  • Git (especially Git Shell, which comes with Git for Windows) – I was initially intimidated about learning git – it definitely takes some trial and error to get used to the commands and how fetching/pulling/forking/merging all work together.  But I’m really glad I took the time to get comfortable with it.  I use both GitHub (for code that actually works and is shared publically) and BitBucket (for hacky stuff that doesn’t work yet and needs to be in a private repo).
  • Oxygen XML Editor – I don’t always work with XML/XSLT, but when I have to, Oxygen makes it (almost) enjoyable.
  • YouMail – This is a mobile app that, in the free version, sends you an email every time you have a voicemail or missed call on your phone.  At work, my phone is usually buried in the nether-regions of of my bag, and I usually keep it on silent, so I probably won’t be answering my mobile at work.  YouMail allows me to not worry where my phone is or if I’m missing any calls.  (There is a Pro version that transcribes your voicemail that I do not pay for, but seems like it might be cool if you need that kind of thing).
  • Infinite Storm – It rarely rains in southern California.  Sometimes you just need some weather to get through the day.  This mobile app makes rain and thunder sounds.

Physical:

  • Post It notes (though I’m trying to break this habit)
  • Basic Logitech headset for webinars / Google hangouts.  I definitely welcome suggestions for a headset that is more comfortable – the one I have weirdly crushes my ears.
  • A white board I use to track information literacy sessions that I teach

What’s your workspace like?

I’m on the fourth floor of the Oviatt Library at CSUN, which is a pretty awesome building.  Fun fact:  the library building was the shooting location for Star Fleet Academy scenes in JJ Abrams’ 2009 Star Trek movie, (but I guess it got destroyed by Romulans because they have a different Academy in Into Darkness):

Oviatt Library as Star Fleet Academy

My office has one of the very few windows available in the building, which I’m ambivalent about.  I truly prefer working in a cave-like environment with only the warm glow of my computer screen illuminating the space, but I also do enjoy the sunshine.

I have nothing on my walls and keep few personal effects in my office – I try to keep things as minimal as possible.  One thing I do have though is my TARDIS fridge, which I keep well-stocked with caffeinated beverages (yes, it does make the whoosh-whoosh sound, and I think it is actually bigger on the inside).

tardis

I am a fan of productivity desktop wallpapers – I’m using these right now, which help peripherally see how much time has elapsed when I’m really in the zone.

When I work from home, I mostly work from my living room couch.

What’s your best time saving trick  When I find I don’t know how to do (like when I recently had to wrangle my head around Fedora Commons content models, or learning Ruby on Rails for Hydra), I assign myself some ‘homework’ to read about it later rather than trying to learn the new thing during working hours.  This helps me avoid getting lost in a black hole of Stack Overflow for several hours a day.

What’s your favorite to do list manager Trello

Besides your phone and computer, what gadget can’t you live without?

Mr. Coffee programmable coffee maker

What everyday thing are you better at than everyone else? Troubleshooting

What are you currently reading?  I listen to audiobooks I download from LAPL (Thanks, LAPL!), and I particularly like British mystery series.  To be honest, I kind of tune them out when I listen to them at work, but they keep the part of my brain that likes to be distracted occupied.

In print, I’m currently reading:

What do you listen to while at work?  Mostly EDM now, which is pretty motivating and helps me zone in on whatever I’m working on.  My favorite Spotify station is mostly Deadmau5.

Are you more of an introvert or an extrovert? Introvert

What’s your sleep routine like?  I love sleep.  It is my hobby.  Usually I sleep from around 11 PM to 7 AM; but my ideal would be sleeping between like 9 PM and 9 AM.  Obviously that would be impractical.

Fill in the blank: I’d love to see _________ answer these same questions.  David Walker @ the CSU Chancellor’s Office

What’s the best advice you’ve ever received? 

Do, or Do Not, There is no Try.
Applies equally to using the Force and programming.

Analyzing EZProxy Logs

Analyzing EZProxy logs may not be the most glamorous task in the world, but it can be illuminating. Depending on your EZProxy configuration, log analysis can allow you to see the top databases your users are visiting, the busiest days of the week, the number of connections to your resources occurring on or off-campus, what kinds of users (e.g., staff or faculty) are accessing proxied resources, and more.

What’s an EZProxy Log?

EZProxy logs are not significantly different from regular server logs.  Server logs are generally just plain text files that record activity that happens on the server.  Logs that are frequently analyzed to provide insight into how the server is doing include error logs (which can be used to help diagnose problems the server is having) and access logs (which can be used to identify usage activity).

EZProxy logs are a kind of modified access log, which record activities (page loads, http requests, etc.) your users undertake while connected in an EZProxy session. This article will briefly outline five potential methods for analyzing EZProxy logs:  AWStats, Piwik, EZPaarse, a custom Python script for parsing starting-point URLS (SPU) logs, and a paid option called Splunk.

The ability of  any log analyzer will of course depend upon how your EZProxy log directives are configured.  You will need to know your LogFormat and/or LogSPU directives in order to configure most log file analyzing solutions.  In EZProxy, you can see how your logs are formatted in config.txt/ezproxy.cfg by looking for the LogFormat directive, 1  e.g.,

LogFormat %h %l %u %t “%r” %s %b “%{user-agent}i”

and / or, to log Starting Point URLs (SPUs):

LogSPU -strftime log/spu/spu%Y%m.log %h %l %u %t “%r” %s %b “%{ezproxy-groups}i”

Logging Starting Point URLs can be useful because those tend to be users either clicking into a database or the full-text of an article, but no activity after that initial contact is logged.  This type of logging does not log extraneous resource loading, such as loading scripts and images – which often clutter up your traditional LogFormat logs and lead to misleadingly high hits.  LogSPU directives can be defined in addition to traditional LogFormat to provide two different possible views of your users’ data.  SPULogs can be easier to analyze and give more interesting data, because they can give a clearer picture of which links and databases are most popular  among your EZProxy users.  If you haven’t already set it up, SPULogs can be a very useful way to observe general usage trends by database.

You can find some very brief anonymized EZProxy log sample files on Gist:

On a typical EZProxy installation, historical monthly logs can be found inside the ezproxy/log directory.  By default they will rotate out every 12 months, so you may only find the past year of data stored on your server.

AWStats

Get It:  http://www.awstats.org/#DOWNLOAD

Best Used With:  Full Logs or SPU Logs

Code / Framework:  Perl

    An example AWStats monthly history report. Can you tell when our summer break begins?
An example AWStats monthly history report. Can you tell when our summer break begins?

AWStats Pros:

  • Easy installation, including on localhost
  • You can define your unique LogFormat easily in AWStats’ .conf file.
  • Friendly, albeit a little bit dated looking, charts show overall usage trends.
  • Extensive (but sometimes tricky) customization options can be used to more accurately represent sometimes unusual EZProxy log data.
Hourly traffic distribution in AWStats.  While our traffic peaks during normal working hours, we have steady usage going on until about 1 AM, after which point it crashes pretty hard.  We could use this data to determine  how much virtual reference staffing we should have available during these hours.
Hourly traffic distribution in AWStats. While our traffic peaks during normal working hours, we have steady usage going on until about Midnight, after which point it crashes pretty hard. We could use this data to determine how much virtual reference staffing we should have available during these hours.

 

AWStats Cons:

  • If you make a change to .conf files after you’ve ingested logs, the changes do not take effect on already ingested data.  You’ll have to re-ingest your logs.
  • Charts and graphs are not particularly (at least easily) customizable, and are not very modern-looking.
  • Charts are static and not interactive; you cannot easily cross-section the data to make custom charts.

Piwik

Get It:  http://piwik.org/download/

Best Used With:  SPULogs, or embedded on web pages web traffic analytic tool

Code / Framework:  Python

piwik visitor dashboard
The Piwik visitor dashboard showing visits over time. Each point on the graph is interactive. The report shown actually is only displaying stats for a single day. The graphs are friendly and modern-looking, but can be slow to load.

Piwik Pros:

  • The charts and graphs generated by Piwik are much more attractive and interactive than those produced by AWStats, with report customizations very similar to what’s available in Google Analytics.
  • If you are comfortable with Python, you can do additional customizations to get more details out of your logs.
Piwik file ingestion in PowerShell
To ingest a single monthly log took several hours. On the plus side, with this running on one of Lauren’s monitors, anytime someone walked into her office they thought she was doing something *really* technical.

Piwik Cons:

  • By default, parsing of large log files seems to be pretty slow, but performance may depend on your environment, the size of your log files and how often you rotate your logs.
  • In order to fully take advantage of the library-specific information your logs might contain and your LogFormat setup, you might have to do some pretty significant modification of Piwik’s import_logs.py script.
When looking at popular pages in Piwik you’re somewhat at the mercy that the subdirectories of databases have meaningful labels; luckily EBSCO does, as shown here.  We have a lot of users looking at EBSCO Ebooks, apparently.
When looking at popular pages in Piwik you’re somewhat at the mercy that the subdirectories of database URLs have meaningful labels; luckily EBSCO does, as shown here. We have a lot of users looking at EBSCO Ebooks, apparently.

EZPaarse

Get Ithttp://analogist.couperin.org/ezpaarse/download

Best Used With:  Full Logs or SPULogs

Code / Framework:  Node.js

ezPaarse’s friendly drag and drop interface.  You can also copy/paste lines for your logs to try out the functionality by creating an account at http://ezpaarse.couperin.org.
ezPaarse’s friendly drag and drop interface. You can also copy/paste lines for your logs to try out the functionality by creating an account at http://ezpaarse.couperin.org.

EZPaarse Pros:

  • Has a lot of potential to be used to analyze existing log data to better understand e-resource usage.
  • Drag-and-drop interface, as well as copy/paste log analysis
  • No command-line needed
  • Its goal is to be able to associate meaningful metadata (domains, ISSNs) to provide better electronic resource usage statistics.
ezPaarse Excel output generated from a sample log file, showing type of resource (article, book, etc.) ISSN, publisher, domain, filesize, and more.
ezPaarse Excel output generated from a sample log file, showing type of resource (article, book, etc.) ISSN, publisher, domain, filesize, and more.

EZPaarse Cons:

  • In Lauren’s testing, we couldn’t get of the logs to ingest correctly (perhaps due to a somewhat non-standard EZProxy logformat) but the samples files provided worked well. UPDATE 11/26:  With some gracious assistance from EZPaarse’s developers, we got EZPaarse to work!  It took about 10 minutes to process 2.5 million log lines, which is pretty awesome. Lesson learned – if you get stuck, reach out to ezpaarse [at] couperin.org or tweet for help @ezpaarse.  Also be sure to try out some of the pre-defined parameters set up by other institutions under Parameters. Check out the comments below for some more detail from ezpaarse’s developers.
  • Output is in Excel Sheets rather than a dashboard-style format – but as pointed out in the comments below, you can optionally output the results in JSON.

Write Your Own with Python

Get Started With:  https://github.com/robincamille/ezproxy-analysis/blob/master/ezp-analysis.py

Best used with: SPU logs

Code / Framework:  Python

code
Screenshot of a Python script, available at Robin Davis’ Github

 

Custom Script Pros:

  • You will have total control over what data you care about. DIY analyzers are usually written up because you’re looking to answer a specific question, such as “How many connections come from within the Library?”
  • You will become very familiar with the data! As librarians in an age of user tracking, we need to have a very good grasp of the kinds of data that our various services collect from our patrons, like IP addresses.
  • If your script is fairly simple, it should run quickly. Robin’s script took 5 minutes to analyze almost 6 years of SPU logs.
  • Your output will probably be a CSV, a flexible and useful data format, but could be any format your heart desires. You could even integrate Python libraries like Plotly to generate beautiful charts in addition to tabular data.
  • If you use Python for other things in your day-to-day, analyzing structured data is a fun challenge. And you can impress your colleagues with your Pythonic abilities!

 

Action shot: running the script from the command line. (Source)
Action shot: running the script from the command line.

Custom Script Cons:

  • If you have not used Python to input/output files or analyze tables before, this could be challenging.
  • The easiest way to run the script is within an IDE or from the command line; if this is the case, it will likely only be used by you.
  • You will need to spend time ascertaining what’s what in the logs.
  • If you choose to output data in a CSV file, you’ll need more elbow grease to turn the data into a beautiful collection of charts and graphs.
output
Output of the sample script is a labeled CSV that divides connections by locations and user type (student or faculty). (Source)

Splunk (Paid Option)

Best Used with:  Full Logs and SPU Logs

Get It (as a free trial):  http://www.splunk.com/download

Code / Framework:  Various, including Python

A Splunk distribution showing traffic by days of the week.  You can choose to visualize this data in several formats, such as a bar chart or scatter plot.  Notice that this chart was generated by a syntactical query in the upper left corner:  host=lmagnuson| top limit=20 date_wday
A Splunk distribution showing traffic by days of the week. You can choose to visualize this data in several formats, such as a bar chart or scatter plot. Notice that this chart was generated by a syntactical query in the upper left corner: host=lmagnuson| top limit=20 date_wday

Splunk Pros:  

  • Easy to use interface, no scripting/command line required (although command line interfacing (CLI) is available)
  • Incredibly fast processing.  As soon as you import a file, splunk begins ingesting the file and indexing it for searching
  • It’s really strong in interactive searching.  Rather than relying on canned reports, you can dynamically and quickly search by keywords or structured queries to generate data and visualizations on the fly.
Here's a search for log entries containing a URL (digital.films.com), which Splunk uses to create a chart showing the hours of the day that this URL is being accessed.  This particular database is most popular around 4 PM.
Here’s a search for log entries containing a URL (digital.films.com), which Splunk uses to display a chart showing the hours of the day that this URL is being accessed. This particular database is most popular around 4 PM.

Splunk Cons:

    • It has a little bit of a learning curve, but it’s worth it for the kind of features and intelligence you can get from Splunk.
    • It’s the only paid option on this list.  You can try it out for 60 days with up to 500MB/day a day, and certain non-profits can apply to continue using Splunk under the 500MB/day limit.  Splunk can be used with any server access or error log, so a library might consider partnering with other departments on campus to purchase a license.2

What should you choose?

It depends on your needs, but AWStats is always a tried and true easy to install and maintain solution.  If you have the knowledge, a custom Python script is definitely better, but obviously takes time to test and develop.  If you have money and could partner with others on your campus (or just need a one-time report generated through a free trial), Splunk is very powerful, generates some slick-looking charts, and is definitely work looking into.  If there are other options not covered here, please let us know in the comments!

About our guest author: Robin Camille Davis is the Emerging Technologies & Distance Services Librarian at John Jay College of Criminal Justice (CUNY) in New York City. She received her MLIS from the University of Illinois Urbana-Champaign in 2012 with a focus in data curation. She is currently pursuing an MA in Computational Linguistics from the CUNY Graduate Center.

Notes
  1. Details about LogFormat and what each %/lettter value means can be found at http://www.oclc.org/support/services/ezproxy/documentation/cfg/logformat.en.html; LogSPU details can be found http://oclc.org/support/services/ezproxy/documentation/cfg/logspu.en.html
  2. Another paid option that offers a free trial, and comes with extensions made for parsing EZProxy logs, is Sawmill: https://www.sawmill.net/downloads.html

Hacking in Python with PyMARC

marc icon with an arrow pointing to a spreadsheet icon

The pymarc Python library is an extremely handy library that can be used for writing scripts to read, write, edit, and parse MARC records.  I was first introduced to pymarc through Heidi Frank’s excellent Code4Lib journal article on using pymarc in conjunction with MARCedit.1.  Here at ACRL TechConnect, Becky Yoose has also written about some cool applications of pymarc.

In this post, I’m going to walk through a couple of applications of using pymarc to make comma-separated (.csv) files for batch loading in DSpace and in the KBART format (e.g., for OCLC KnowledgeBase purposes).  I know what you’re thinking – why write a script when I can use MARCedit for that?  Among MARCedit’s many indispensable features is its Export Tab-Delimited Data feature. 2.  But there might be a couple of use cases where scripts can come in handy:

  • Your MARC records lack data that you need in the tab-delimited or .csv output – for example, a field used for processing in whatever system you’re loading the data into that isn’t found in the MARC record source.
  • Your delimited file output requires things like empty fields or parsed, modified fields.  MARCedit is great for exporting whole MARC fields and subfields into delimited files, but you may need to pull apart a MARC element or parse a MARC element out with some additional data.  Pymarc is great for this.
  • You have to process on a lot of records frequently, and just want something tailor-made for your use case.  While you can certainly use MARCedit to save a frequently used export template, editing that saved template when you need to change the output parameters of your delimited file isn’t easy.  With a Python script, you can easily edit your script if your delimited file requirements change.

 

Some General Notes about Python

Python is a useful scripting language to know for a lot of reasons, and for small scripting projects can have a fairly shallow learning curve.  In order to start using Python, you’ll need to set your development environment.  Macs already come with Python installed, but to double-check which version you have installed, launch Terminal and type Python -v.  In a Windows environment, you’ll need to do a few more steps, detailed here.  Full Python documentation for 2.x can be found here (though personally, I find it a little dense at times!), and CodeAcademy features some pretty extensive tutorials to help you learn the syntax quickly.  Google also has some pretty extensive tutorials on Python.  Personally, I find it easiest to learn when I have a project that actually means something to me, so if you’re familiar with MARC records,  just downloading the Python scripts mentioned in this post below and learning how to run them can be a good start.

Spacing and indentation in Python is very important, so if you’re getting errors running scripts make sure that your conditional statements are indented properly 3. Also the version of Python on your machine makes a really big difference, and the version will determine whether your code runs successfully.  These examples have all been tested with Python 2.6 and 2.7, but not with Python 3 or above.  I find that Python 2.x has more help resources out on the web, which is nice to be able to take advantage of when you’re first starting out.

Use Case 1:  MARC to KBART

The complete script described below, along with some sample MARC records, is on GitHub.

The KBART format is a NISO/United Kingdom Serials Group (UKSG) initiative designed to standardize information for use with Knowledge Base systems.  The KBART format is a series of standardized fields that can be used to identify serial coverage (e.g., start date and end date) URLs, publisher information, and more.4  Notably, it’s the required format for adding and modifying customized collections in OCLC’s Knowledge Base. 5.

In this use case, I’m using OCLC’s modified KBART file – most of the fields are KBART standard but a few fields are specific to OCLC’s Knowledge Base, e.g., oclc_collection_name.6.  Typically, these KBART files would be created either manually, or generated by using some VLOOKUP Excel magic with an existing KBART file 7.  In my use case, I needed to batch migrate a bunch of data stored in MARC records to the OCLC Knowledge Base, and these particular records didn’t always correspond neatly to existing OCLC Knowledge Base Collections.  For example, one collection we wanted to manage in the OCLC Knowledge Base was the library’s print holdings so that these titles were displayed in OCLC’s A-Z Journal lookup tool.

First, I’ll need a few helper libraries for my script:

import csv
from pymarc import MARCReader
from os import listdir
from re import search

Then, I’ll declare the source directory where my MARC records are (in this case, in the same directory the script lives in a folder called marc) and instruct the script to go find all the .mrc files in that directory.  Note that this enables the script to process either a single file of multiple MARC records, or multiple distinct files, all at once.

# change the source directory to whatever directory your .mrc files are in
SRC_DIR = 'marc/'

The script identifies MARC records by looking for all files ending in .mrc using the Python filter function.  Within the filter function, lambda creates a one-off anonymous function to define the filter parameters: 8

# get a list of all .mrc files in source directory 
file_list = filter(lambda x: search('.mrc', x), listdir(SRC_DIR))

Now I’ll define the output parameters.  I need a tab-delimited file, not a comma-delimited file, but either way I’ll use the csv.writer function to create a .txt file and define the tab delimiter (\t).  I also don’t really want the fields quoted unless there’s a special character or something that might cause a problem reading the file, so I’ll set quoting to minimal:

#create tab delimited text file that quotes if special characters are present
csv_out = csv.writer(open('kbart.txt', 'w'), delimiter = '\t', 
quotechar = '"', quoting = csv.QUOTE_MINIMAL)


And I’ll also create the header row, which includes the KBART fields required by the OCLC Knowledge Base:

#create the header row
csv_out.writerow(['publication_title', 'print_identifier', 
'online_identifier', 'date_first_issue_online', 
'num_first_vol_online', 'num_first_issue_online', 
'date_last_issue_online', 'num_last_vol_online', 
'num_last_issue_online', 'title_url', 'first_author', 
'title_id', 'coverage_depth', 'coverage_notes', 
'publisher_name', 'location', 'title_notes', 'oclc_collection_name', 
'oclc_collection_id', 'oclc_entry_id', 'oclc_linkscheme', 
'oclc_number', 'action'])

Next, I’ll start a loop for writing a row into the tab-delimited text file for every MARC record found.

for item in file_list:
 fd = file(SRC_DIR + '/' + item, 'r')
 reader = MARCReader(fd)

By default, we’ll need to set each field’s value to blank (”), so that errors are avoided if the value is not present in the MARC record.  We can set all the values to blank to start with in one statement:

for record in reader:
    publication_title = print_identifier = online_identifier 
    = date_first_issue_online = num_first_vol_online 
    = num_first_issue_online = date_last_issue_online 
    = num_last_vol_online = num_last_issue_online 
    = title_url = first_author = title_id = coverage_depth 
    = coverage_notes = publisher_name = location = title_notes 
    = oclc_collection_name = oclc_collection_id = oclc_entry_id 
    = oclc_linkscheme = oclc_number = action = ''

Now I can start pulling in data from MARC fields.  At its simplest, for each field defined in my CSV, I can pull data out of MARC fields using a construction like this:

    #title_url
    if record ['856'] is not None:
      title_url = record['856']['u']

I can use generic python string parsing tools to clean up fields.  For example, if I need to strip the ending / out of a MARC 245$a field, I can use .rsplit to return everything before that final slash (/):

 
   # publication_title
    if record['245'] is not None:
      publication_title = record['245']['a'].rsplit('/', 1)[0]
      if record['245']['b'] is not None:
          publication_title = publication_title + " " + record['245']['b']

Also note that once you’ve declared a variable value (like publication_title) you can re-use that variable name to add more onto the string, as I’ve done with the 245$b subtitle above.

As is usually the case for MARC records, the trickiest business has to do with parsing out serial ranges.  The KBART format is really designed to capture, as accurately as possible, thresholds for beginning and ending dates.  MARC makes this…complicated.  Many libraries might use the 866 summary field to establish summary ranges, but there are varying local practices that determine what these might look like.  In my case, the minimal information I was looking for included beginning and ending years, and the records I was processing with the script stored summary information fairly cleanly in the 863 and 866 fields:

    # date_first_issue_online
    if record ['866'] is not None:
      date_first_issue_online = record['863']['a'].rsplit('-', 1)[0]
 
    # date_last_issue_online
    if record ['866'] is not None:
      date_last_issue_online = record['866']['a'].rsplit('-', 1)[-1]

Where further adjustments were necessary, and I couldn’t easily account for the edge cases programmatically, I imported the KBART file into OpenRefine for further cleanup.  A sample KBART file created by this script is available here.

Use Case 2:  MARC to DSpace Batch Upload

Again, the complete script for transforming MARC records for DSpace ingest is on GitHub.

This use case is very similar to the KBART scenario.  We use DSpace as our institutional repository, and we had about 2000 historical theses and dissertation MARC  records to transform and ingest into DSpace.  DSpace facilitates batch-uploading metadata as CSV files according to the RFC4180 format, which basically means all the field elements use double quotes. 9  While the fields being pulled from are different, the structure of the script is almost exactly the same.

When we first define the CSV output, we need to ensure that quoting is used for all elements – csv.QUOTE_ALL:

csv_out = csv.writer(open('output/theses.txt', 'w'), delimiter = '\t', 
quotechar = '"', quoting = csv.QUOTE_ALL)

The other nice thing about using Python for this transformation is the ability to program in static text that would be the same for all the lines in the CSV file.  For example, I parsed together a more friendly display of the department the thesis was submitted to like so:

# dc.contributor.department
if record ['690']['x'] is not None:
dccontributordepartment = ('UniversityName.  Department of ') + 
record['690']['x'].rsplit('.', 1)[0]

You can view a sample output file created by this script on here on Github.

Other PyMARC Projects

There are lots of other, potentially more interesting things that can be done with pymarc, although certainly one of its most common applications is converting MARC to more flexible formats, such as MARCXML 10.  If you’re interested in learning more,  join the pymarc Google Group, where you can often get help by the original pymarc developers (Ed Summers, Mark Matienzo, Gabriel Farrell and Geoffrey Spear).

Notes
  1. Frank, Heidi. “Augmenting the Cataloger’s Bag of Tricks: Using MarcEdit, Python, and pymarc for Batch-Processing MARC Records Generated from the Archivists’ Toolkit.” Code4Lib Journal, 20 (2013):  http://journal.code4lib.org/articles/8336
  2. https://www.youtube.com/watch?v=qkzJmNOvY00
  3. For the specifications on indentation, take a look at http://legacy.python.org/dev/peps/pep-0008/#id12
  4. http://www.uksg.org/kbart/s5/guidelines/data_fields
  5. http://oclc.org/content/dam/support/knowledge-base/kb_modify.pdf
  6.  A sample blank KBART Excel sheet for OCLC’s Knowledge Base can be found here:  http://www.oclc.org/support/documentation/collection-management/kb_kbart.xlsx.  KBART files must be saved as tab-delimited .txt files prior to upload, but are obviously more easily edited manually in Excel
  7. If you happen to be in need of this for OCLC’s KB, or  are in need of comparing two Excel files, here’s a walkthrough of how to use VLOOKUP to select owned titles from a large OCLC KBART file: http://youtu.be/mUhkMzpPnBE
  8. A nice tutorial on using lambda and filter together can be found here:  http://www.u.arizona.edu/~erdmann/mse350/topics/list_comprehensions.html#filter
  9.  https://wiki.duraspace.org/display/DSDOC18/Batch+Metadata+Editing
  10.  http://docs.evergreen-ils.org/2.3/_migrating_your_bibliographic_records.html

Two Free Methods for Sharing Google Analytics Data Visualizations on a Public Dashboard

UPDATE (October 15th, 2014):

OOCharts as an API and service, described below, will be shut down November 15th, 2014.  More information about the decision by the developers to shut down the service is here. It’s not entirely surprising that the service going away, considering the Google SuperProxy option described in this post.  I’m leaving all the instructions here for OOCharts for posterity, but as of November 15th you can only use the SuperProxy or build your own with the Core Reporting API.

At this point, Google Analytics is arguably the standard way to track website usage data.  It’s easy to implement, but powerful enough to capture both wide general usage trends and very granular patterns.  However, it is not immediately obvious how to share Google Analytics data either with internal staff or external stakeholders – both of whom often have widespread demand for up-to-date metrics about library web property performance.  While access to Google Analytics can be granted by Google Analytics administrators to individual Google account-holders through the Google Analytics Admin screen, publishing data without requiring authentication requires some intermediary steps.

There are two main free methods of publishing Google Analytics visualizations to the web, and both involve using the Google Analytics API: OOCharts and the Google Analytics superProxy.  Both methods rely upon an intermediary service to retrieve data from the API and cache it, both to improve retrieval time and to avoid exceeding limits in API requests.1  The first method – OOCharts – requires much less time to get running initially. However, OOCharts’ long-standing beta status, and its status as a stand-alone service has less potential for long-term support than the second method, the Google Analytics superProxy.  For that reason, while OOCharts is certainly easier to set up, the superProxy method is definitely worth the investment in time (depending on what your needs).  I’ll cover both methods.

OOCharts Beta

OOCharts’ service facilitates the creation and storage of Google Analytics API Keys, which are required for sending secure requests to the API in order to retrieve data.

When setting up an OOCharts account, create your account utilizing the email address you use to access Google Analytics.  For example, if you log into Google Analytics using admin@mylibrary.org, I suggest you use this email address to sign up for OOCharts.  After creating an account with OOCharts, you will be directed to Google Analytics to authorize the OOCharts service to access your Google Analytics data on your behalf.

Let OOCharts gather Google Analytics Data on your Behalf.
Let OOCharts gather Google Analytics Data on your Behalf.

After authorizing the service, you will be able to generate an API key for any of the properties your to which your Google Analytics account has access.

In the OOCharts interface, click your name in the upper right corner, and then click inside the Add API Key field to see a list of available Analytics Properties from your account.

 

Once you’ve select a property, OOCharts will generate a key for your OOCharts application.  When you go back to the list of API Keys, you’ll see your keys along with property IDs (shown in brackets after the URL of your properties, e.g., [9999999].

APIkeys
Your API Keys now show your key values and your Google Analytics property IDs, both of which you’ll need to create visualizations with the OOCharts library.

 

Creating Visualizations with the OOCharts JavaScript Library

OOCharts appears to have started as a simple chart library for visualization data returned from the Google Analytics API. After you have set up your OOCharts account, download the front-end JavaScript library (available at http://docs.oocharts.com/) and upload it to your server or localhost.   Navigate to the /examples directory and locate the ‘timeline.html’ file.  This file contains a simplified example that displays web visits over time in Google Analytics’ familiar timeline format.

The code for this page is very simple, and contains two methods – JavaScript-only and with HTML Attributes – for creating OOCharts visualizations.  Below, I’ve separated out the required elements for both methods.  While either methods will work on their own, using HTML attributes allows for additional customizations and styling:

JavaScript-Only
		<h3>With JS</h3>
		<div id='chart'></div>
		
		<script src='../oocharts.js'></script>
		<script type="text/javascript">

			window.onload = function(){

				oo.setAPIKey("{{ YOUR API KEY }}");

				oo.load(function(){

					var timeline = new oo.Timeline("{{ YOUR PROFILE ID }}", "180d");

					timeline.addMetric("ga:visits", "Visits");

					timeline.addMetric("ga:newVisits", "New Visits");

					timeline.draw('chart');

				});
			};

		</script>
With HTML Attributes:
<h3>With HTML Attributes</h3>
	<div data-oochart='timeline' 
        data-oochart-metrics='ga:visits,Visits,ga:newVisits,New Visits' 
        data-oochart-start-date='180d' 
        data-oochart-profile='{{ YOUR PROFILE ID }}'></div>
		
			<script src='../oocharts.js'></script>
			<script type="text/javascript">

			window.onload = function(){

				oo.setAPIKey("{{ YOUR API KEY }}");

				oo.load(function(){
				});
			};

		</script>

 

For either method, plugin {{ YOUR API KEY }} where indicated with the API Key generated in OOCharts and replace {{ YOUR PROFILE ID }} with the associated eight-digit profile ID.  Load the page in your browser, and you get this:

With the API Key and Profile ID in place, the timeline.html example looks like this.  In this example I also adjusted the date paramter (30d by default) to 180d for more data.
With the API Key and Profile ID in place, the timeline.html example looks like this. In this example I also adjusted the date parameter (30d by default) to 180d for more data.

This example shows you two formats for the chart – one that is driven solely by JavaScript, and another that can be customized using HTML attributes.  For example, you could modify the <div> tag to include a style attribute or CSS class to change the width of the chart, e.g.:

<h3>With HTML Attributes</h3>

<div style=”width:400px” data-oochart='timeline' 
data-oochart-start-date='180d' 
data-oochart-metrics='ga:visits,Visits,ga:newVisits,New Visits' 
data-oochart-profile='{{ YOUR PROFILE ID }}'></div>

Here’s the same example.html file showing both the JavaScript-only format and the HTML-attributes format, now with a bit of styling on the HTML attributes chart to make it smaller:

You can use styling to adjust the HTML attributes example.
You can use styling to adjust the HTML attributes example.

Easy, right?  So what’s the catch?

OOCharts only allows 10,000 requests a month – which is even easier to exceed than the 50,000 limit on the Google Analytics API.  Each time your page loads, you use another request.  Perhaps more importantly, your Analytics API key and profile ID are pretty much ‘out there’ for the world to see if they view your page source, because those values are stored in your client-side JavaScript2.  If you’re making a private intranet for your library staff, that’s probably not a big deal; but if you want to publish your dashboard fully to the public, you’ll want to make sure those values are secure.  You can do this with the Google Analytics superProxy.

Google Analytics superProxy

In 2013, Google Analytics released a method of accessing Google Analytics API data that doesn’t require end-users to authenticate in order to view data, known as the Google Analytics superProxy.  Much like OOCharts, the superProxy facilitates the creation of a query engine that retrieves Google Analytics statistics through the Google Analytics Core Reporting API, caches the statistics in a separate web application service, and enables the display of Google Analytics data to end users without requiring individual authentication. Caching the data has the additional benefit of ensuring that your application will not exceed the Google Core Reporting API request limit of 50,000 requests each day. The superProxy can be set up to refresh a limited number of times per day, and most dashboard applications only need a daily refresh of data to stay current.
The required elements of this method are available on the superProxy Github page (Google Analytics, “Google Analytics superProxy”). There are four major parts to the setup of the superProxy:

  1. Setting up Google App Engine hosting,
  2. Preparing the development environment,
  3. Configuring and deploying the superProxy to Google’s App Engine Appspot host; and
  4. Writing and scheduling queries that will be used to populate your dashboard.
Set up Google App Engine hosting

First, your Google Analytics account credentials to access the Google App Engine at https://appengine.google.com/start. The superProxy application you will be creating will be freely hosted by the Google App Engine. Create your application and designate an Application Identifier that will serve as the endpoint domain for queries to your Google Analytics data (e.g., mylibrarycharts.appspot.com).

Create your App Engine Application
Create your App Engine Application

You can leave the default authentication option, Open to All Google Users, selected.  This setting only reflects access to your App Engine administrative screen and does not affect the ability for end-users to view the dashboard charts you create.  Only those Google users who have been authorized to access Google Analytics data will be able to access any Google Analytics information through the Google App Engine.

Ensure that API access to Google Analytics is turned on under the Services pane of the Google Developer’s Console. Under APIs and Auth for your project, visit the APIs menu and ensure that the Analytics API is turned on.

Turn on the Google Analytics API.
Turn on the Google Analytics API.  Make sure the name under the Projects label in the upper left corner is the same as your newly created Google App Engine project (e.g., librarycharts).

 

Then visit the Credentials menu to set up an OAuth 2.0 Client ID. Set the Authorized JavaScript Origins value to your appspot domain (e.g., http://mylibrarycharts.appspot.com). Use the same value for the Authorized Redirect URI, but add /admin/auth to the end (e.g., http://mylibrarycharts.appspot.com/admin/auth). Note the OAuth Client ID, OAuth Client Secret, and OAuth Redirect URI that are stored here, as you will need to reference them later before you deploy your superProxy application to the Google App Engine.

Finally, visit the Consent Screen menu and choose an email address (such as your Google account email address), fill in the product name field with your Application Identifier (e.g., mylibrarycharts) and save your settings. If you do not include these settings, you may experience errors when accessing your superProxy application admin menu.

Prepare the Development Environment

In order to configure superProxy and deploy it to Google App Engine you will need Python 2.7 installed and the Google App Engine Launcher (AKA the Google App Engine SDK).  Python just needs to be installed for the App Engine Launcher to run; don’t worry, no Python coding is required.

Configure and Deploy the superProxy

The superProxy application is available from the superProxy Github page. Download the .zip files and extract them onto your computer into a location you can easily reference (e.g., C:/Users/yourname/Desktop/superproxy or /Applications/superproxy). Use a text editor such as Notepad or Notepad++ to edit the src/app.yaml to include your Application ID (e.g., mylibrarycharts). Then use Notepad to edit src/config.py to include the OAuth Client ID, OAuth Client Secret, and the OAuth Redirect URI that were generated when you created the Client ID in the Google Developer’s Console under the Credentials menu. Detailed instructions for editing these files are available on the superProxy Github page.

After you have edited and saved src/app.yaml and src/config.py, open the Google App Engine Launcher previously downloaded. Go to File > Add Existing Application. In the dialogue box that appears, browse to the location of your superProxy app’s /src directory.

To upload your superProxy application, use the Google App Engine Launcher and browse to the /src directory where you saved and configured your superProxy application.
To upload your superProxy application, use the Google App Engine Launcher and browse to the /src directory where you saved and configured your superProxy application.

Click Add, then click the Deploy button in the upper right corner of the App Engine Launcher. You may be asked to log into your Google account, and a log console may appear informing you of the deployment process. When deployment has finished, you should be able to access your superProxy application’s Admin screen at http://[yourapplicationID].appspot.com/admin, replacing [yourapplicationID] with your Application Identifier.

Creating superProxy Queries

SuperProxy queries request data from your Google Analytics account and return that data to the superProxy application. When the data is returned, it is made available to an end-point that can be used to populate charts, graphs, or other data visualizations. Most data available to you through the Google Analytics native interface is available through superProxy queries.

An easy way to get started with building a query is to visit the Google Analytics Query Explorer. You will need to login with your Google Analytics account to use the Query Explorer. This tool allows you to build an example query for the Core Reporting API, which is the same API service that your superProxy application will be using.

Google Analytics API Query Explorer
Running example queries through the Google Analytics Query Explorer can help you to identify the metrics and dimensions you would like to use in superProxy queries. Be sure to note the metrics and dimensions you use, and also be sure to note the ids value that is populated for you when using the API Explorer.

 

When experimenting with the Google Analytics Query explorer, make note of all the elements you use in your query. For example, to create a query that retrieves the number of users that visited your site between July 4th and July 18th 2014, you will need to select your Google Account, Property and View from the drop-down menus, and then build a query with the following parameters:

  • ids = this is a number (usually 8 digits) that will be automatically populated for you when you choose your Google Analytics Account, Property and View. The ids value is your property ID, and you will need this value later when building your superProxy query.
  • dimensions = ga:browser
  • metrics = ga:users
  • start-date = 07-04-2014
  • end-date = 07-18-2014

You can set the max-results value to limit the number of results returned. For queries that could potentially have thousands of results (such as individual search terms entered by users), limiting to the top 10 or 50 results will retrieve data more quickly. Clicking on any of the fields will generate a menu from which you can select available options. Click Get Data to retrieve Google Analytics data and verify that your query works

Successful Google Analytics Query Explorer query result showing visits by browser.
Successful Google Analytics Query Explorer query result showing visits by browser.

After building a successful query, you can replicate the query in your superProxy application. Return to your superProxy application’s admin page (e.g., http://[yourapplicationid].appspot.com/admin) and select Create Query. Name your query something to make it easy to identify later (e.g., Users by Browser). The Refresh Interval refers to how often you want the superProxy to retrieve fresh data from Google Analytics. For most queries, a daily refresh of the data will be sufficient, so if you are unsure, set the refresh interval to 86400. This will refresh your data every 86400 seconds, or once per day.

Create a superProxy Query
Create a superProxy Query

We can reuse all of the elements of queries built using the Google Analytics API Explorer to build the superProxy query encoded URI.  Here is an example of an Encoded URI that queries the number of users (organized by browser) that have visited a web property in the last 30 days (you’ll need to enter your own profile ID in the ids value for this to work):

https://www.googleapis.com/analytics/v3/data/ga?ids=ga:99999991
&metrics=ga:users&dimensions=ga:browser&max-results=10
&start-date={30daysago}&end-date={today}

Before saving, be sure to run Test Query to see a preview of the kind of data that is returned by your query. A successful query will return a json response, e.g.:

{"kind": "analytics#gaData", "rows": 
[["Amazon Silk", "8"], 
["Android Browser", "36"], 
["Chrome", "1456"], 
["Firefox", "1018"], 
["IE with Chrome Frame", "1"], 
["Internet Explorer", "899"], 
["Maxthon", "2"], 
["Opera", "7"], 
["Opera Mini", "2"], 
["Safari", "940"]], 
"containsSampledData": false, 
"totalsForAllResults": {"ga:users": "4398"}, 
"id": "https://www.googleapis.com/analytics/v3/data/ga?ids=ga:84099180&dimensions=ga:browser&metrics=ga:users&start-date=2014-06-29&end-date=2014-07-29&max-results=10", 
"itemsPerPage": 10, "nextLink": "https://www.googleapis.com/analytics/v3/data/ga?ids=ga:84099180&dimensions=ga:browser&metrics=ga:users&start-date=2014-07-04&end-date=2014-07-18&start-index=11&max-results=10", 
"totalResults": 13, "query": {"max-results": 10, "dimensions": "ga:browser", "start-date": "2014-06-29", "start-index": 1, "ids": "ga:84099180", "metrics": ["ga:users"], "end-date": "2014-07-18"}, 
"profileInfo": {"webPropertyId": "UA-9999999-9", "internalWebPropertyId": "99999999", "tableId": "ga:84099180", "profileId": "9999999", "profileName": "Library Chart", "accountId": "99999999"}, 
"columnHeaders": [{"dataType": "STRING", "columnType": "DIMENSION", "name": "ga:browser"}, {"dataType": "INTEGER", "columnType": "METRIC", "name": "ga:users"}], 
"selfLink": "https://www.googleapis.com/analytics/v3/data/ga?ids=ga:84099180&dimensions=ga:browser&metrics=ga:users&start-date=2014-06-29&end-date=2014-07-29&max-results=10"}

Once you’ve tested a successful query, save it, which will allow the json string to become accessible to an application that can help to visualize this data. After saving, you will be directed to the management screen for your API, where you will need to click Activate Endpoint to begin publishing the results of the query in a way that is retrievable. Then click Start Scheduling so that the query data is refreshed on the schedule you determined when you built the query (e.g., once a day). Finally, click Refresh Data to return data for the first time so that you can start interacting with the data returned from your query. Return to your superProxy application’s Admin page, where you will be able to manage your query and locate the public end-point needed to create a chart visualization.

Using the Google Visualization API to visualize Google Analytics data

Included with the superProxy .zip file downloaded to your computer from the Github repository is a sample .html page located under /samples/superproxy-demo.html. This file uses the Google Visualization API to generate two pie charts from data returned from superProxy queries. The Google Visualization API is a service that can ingest raw data (such as json arrays that are returned by the superProxy) and generate visual charts and graphs. Save superproxy-demo.html onto a web server or onto your computer’s localhost.  We’ll set up the first pie chart to use the data from the Users by Browser query saved in your superProxy app.

Open superproxy-demo.html and locate this section:

var browserWrapper = new google.visualization.ChartWrapper({
// Example Browser Share Query
"containerId": "browser",
// Example URL: http://your-application-id.appspot.com/query?id=QUERY_ID&format=data-table-response
"dataSourceUrl": "REPLACE WITH Google Analytics superProxy PUBLIC URL, DATA TABLE RESPONSE FORMAT",
"refreshInterval": REPLACE_WITH_A_TIME_INTERVAL,
"chartType": "PieChart",
"options": {
"showRowNumber" : true,
"width": 630,
"height": 440,
"is3D": true,
"title": "REPLACE WITH TITLE"
}
});

Three values need to be modified to create a pie chart visualization:

  • dataSourceUrl: This value is the public end-point of the superProxy query you have created. To get this value, navigate to your superProxy admin page and click Manage Query on the Users by Browser query you have created. On this page, right click the DataTable (JSON Response) link and copy the URL (Figure 8). Paste the copied URL into superproxy-demo.html, replacing the text REPLACE WITH Google Analytics superProxy PUBLIC URL, DATA TABLE FORMAT. Leave quotes around the pasted URL.
Right-click the DataTable (JSON Response) link and copy the URL to your clipboard. The copied link will serve as the dataSouceUrl value in superproxy-demo.html.
Right-click the DataTable (JSON Response) link and copy the URL to your clipboard. The copied link will serve as the dataSouceUrl value in superproxy-demo.html.
  • refreshInterval – you can leave this value the same as the refresh Interval of your superProxy query (in seconds) – e.g., 86400.
  • title – this is the title that will appear above your pie chart, and should describe the data your users are looking at – e.g., Users by Browser.

Save the modified file to your server or local development environment, and load the saved page in a browser.  You should see a rather lovely pie chart:

Your pie chart's data will refresh automatically on the refresh schedule you set in your query.
Your pie chart’s data will refresh automatically based upon the Refresh Interval you specify in your superProxy query and your page’s JavaScript parameters.

That probably seemed like a lot of work just to make a pie chart.  But now that your app is set up, making new charts from your Google Analytics data just involves visiting your App Engine site, scheduling a new query, and referencing that with the Google Visualization API.  To me, the Google superProxy method has three distinct advantages over the simpler OOCharts method:

  • Security – Users won’t be able to view your API Keys by viewing the source of your dashboard’s web page
  • Stability – OOCharts might not be around forever.  For that matter, Google’s free App Engine service might not be around forever, but betting on Google is [mostly] a safe bet
  • Flexibility – You can create a huge range of queries, and test them out easily using the API explorer, and the Google Visualization API has extensive documentation and a fairly active user group from whom to gather advice and examples.

 

Notes

  1. There is a 50,000 request per day limit on the Analytics API.  That sounds like a lot, but it’s surprisingly easy to exceed. Consider creating a dashboard with 10 charts, each making a call to the Analytics API.  Without a service that caches the data, the data is refreshed every time a user loads a page.  After just 5,000 visits to the page (which makes 10 API calls – one for each chart – each time the page is loaded), the API limit is exceeded:  5,000 page loads x 10 calls per page = 50,000 API requests.
  2. You can use pre-built OOCharts Queries – https://app.oocharts.com/mc/query/list – to hide your profile ID (but not your API Key). There are many ways to minify and obfuscate client-side JavaScript to make it harder to read, but it’s still pretty much accessible to someone who wants to get it

Getting Started with APIs

There has been a lot of discussion in the library community regarding the use of web service APIs over the past few years.  While APIs can be very powerful and provide awesome new ways to share, promote, manipulate and mashup your library’s data, getting started using APIs can be overwhelming.  This post is intended to provide a very basic overview of the technologies and terminology involved with web service APIs, and provides a brief example to get started using the Twitter API.

What is an API?

First, some definitions.  One of the steepest learning curves with APIs involves navigating the terminology, which unfortunately can be rather dense – but understanding a few key concepts makes a huge difference:

  • API stands for Application Programming Interface, which is a specification used by software components to communicate with each other.  If (when?) computers become self-aware, they could use APIs to retrieve information, tweet, post status updates, and essentially run most day-to-do functions for the machine uprising. There is no single API “standard” though one of the most common methods of interacting with APIs involves RESTful requests.
  • REST / RESTful APIs  – Discussions regarding APIs often make references to “REST” or “RESTful” architecture.  REST stands for Representational State Transfer, and you probably utilize RESTful requests every day when browsing the web. Web browsing is enabled by HTTP (Hypertext Transfer Protocol) – as in http://example.org.  The exchange of information that occurs when you browse the web uses a set of HTTP methods to retrieve information, submit web forms, etc.  APIs that use these common HTTP methods (sometimes referred to as HTTP verbs) are considered to be RESTful.  RESTful APIs are simply APIs that leverage the existing architecture of the web to enable communication between machines via HTTP methods.

HTTP Methods used by RESTful APIs

Most web service APIs you will encounter utilize at the core the following HTTP methods for creating, retrieving, updating, and deleting information through that web service.1  Not all APIs allow each method (at least without authentication) but some common methods for interacting with APIs include:

    • GET – You can think of GET as a way to “read” or retrieve information via an API.  GET is a good starting point for interacting with an API you are unfamiliar with.  Many APIs utilize GET, and GET requests can often be used without complex authentication.  A common example of a GET request that you’ve probably used when browsing the web is the use of query strings in URLs (e.g., www.example.org/search?query=ebooks).
    • POST – POST can be used to “write” data over the web.  You have probably generated  POST requests through your browser when submitting data on a web form or making a comment on a forum.  In an API context, POST can be used to request that an API’s server accept some data contained in the POST request – Tweets, status updates, and other data that is added to a web service often utilize the POST method.
    • PUT – PUT is similar to POST, but can be used to send data to a web service that can assign that data a unique uniform resource identifier (URI) such as a URL.  Like POST, it can be used to create and update information, but PUT (in a sense) is a little more aggressive. PUT requests are designed to interact with a specific URI and can replace an existing resource at that URI or create one if there isn’t one.
    • DELETE – DELETE, well, deletes – it removes information at the URI specified by the request.  For example, consider an API web service that could interact with your catalog records by barcode.2 During a weeding project, an application could be built with DELETE that would delete the catalog records as you scanned barcodes.3

Understanding API Authentication Methods

To me, one of the trickiest parts of getting started with APIs is understanding authentication. When an API is made available, the publishers of that API are essentially creating a door to their application’s data.  This can be risky:  imagine opening that door up to bad people who might wish to maliciously manipulate or delete your data.  So often APIs will require a key (or a set of keys) to access data through an API.

One helpful way to contextualize how an API is secured is to consider access in terms of identification, authentication, and authorization.4  Some APIs only want to know where the request is coming from (identification), while others require you to have a valid account (authentication) to access data.  Beyond authentication, an API may also want to ensure your account has permission to do certain functions (authorization).  For example, you may be an authenticated user of an API that allows you to make GET requests of data, but your account may still not be authorized to make POST, PUT, or DELETE requests.

Some common methods used by APIs to store authentication and authorization include OAuth and WSKey:

  • OAuth – OAuth is a widely used open standard for authorization access to HTTP services like APIs.5  If you have ever sent a tweet from an interface that’s not Twitter (like sharing a photo directly from your mobile phone) you’ve utilized the OAuth framework.  Applications that already store authentication data in the form of user accounts (like Twitter and Google) can utilize their existing authentication structures to assign authorization for API access.  API Keys, Secrets, and Tokens can be assigned to authorized users, and those variables can be used by 3rd party applications without requiring the sharing of passwords with those 3rd parties.
  • WSKey (Web Services Key) – This is an example from OCLC, that is conceptually very similar to OAuth.  If you have an OCLC account (either through worldcat.org or oclc.org account) you can request key access.  Your authorization – in other words, what services and REST requests you are permitted to access – may be dependent upon your relationship with an OCLC member organization.

Keys, Secrets, Tokens?  HMAC?!

API authorization mechanisms often require multiple values in order to successfully interact with the API.  For example, with the Twitter API, you may be assigned an API Key and a corresponding Secret.  The topic of secret key authentication can be fairly complex,6 but fundamentally a Key and its corresponding Secret are used to authenticate requests in a secure encrypted fashion that would be difficult to guess or decrypt by malicious third-parties.  Multiple keys may be required to perform particular requests – for example, the Twitter API requires a key and secret to access the API itself, as well as a token and secret for OAuth authorization.

Probably the most important thing to remember about secrets is to keep them secret.  Do not share them or post them anywhere, and definitely do not store secret values in code uploaded to Github 7 (.gitignore – a method to exclude files from a git repository – is your friend here). 8  To that end, one strategy that is used by RESTful APIs to further secure secret key value is an HMAC header (hash-based method authentication code).  When requests are sent, HMAC uses your secret key to sign the request without actually passing the secret key value in the request itself. 9

Case Study:  The Twitter API

It’s easier to understand how APIs work when you can see them in action.  To do these steps yourself, you’ll need a Twitter account.  I strongly recommend creating a Twitter account separate from your personal or organizational accounts for initial experimentation with the API.  This code example is a very simple walkthrough, and does not cover securing your applications’ server (and thus securing the keys that may be stored on that server).  Anytime you authorize access to a Twitter account to API access you may be exposing it to some level of vulnerability.  At the end of the walkthrough, I’ll list the steps you would need to take if your account does get compromised.

1.  Activate a developer account

Visit dev.twitter.com and click the sign in area in the upper right corner.  Sign in with your Twitter account. Once signed in, click on your account icon (again in the upper right corner of the page) and then select the My Applications option from the drop-down menu.

Screenshot of the Twitter Developer Network login screen

2.  Get authorization

In the My applications area, click the Create New App button, and then fill out the required fields (Name, Description, and Website where the app code will be stored).  If you don’t have a web server, don’t worry, you can still get started testing out the API without actually writing any code.

3.  Get your keys

After you’ve created the application and are looking at its settings, click on the API Keys tab.  Here’s where you’ll get the values you need.  Your API Access Level is probably limited to read access only.  Click the “modify app permissions” link to set up read and write access, which will allow you to post through the API.  You will have to associate a mobile phone number with your Twitter account to get this level of authorization.

Screenshot of Twitter API options that allow for configuraing API read and write access.

Scroll down and note that in addition to an API Key and Secret, you also have an access token associated with OAUTH access.  This Token Key and Secret are required to authorize account activity associated with your Twitter user account.

4.  Test Oauth Access / Make a GET call

From the application API Key page, click the Test OAuth button.  This is a good way to get a sense of the API calls.  Leave the key values as they are on the page, and scroll down to the Request Settings Area.  Let’s do a call to return the most recent tweet from our account.

With the GET request checked, enter the following values:

Request URI:

Request Query (obviously replace yourtwitterhandle with… your actual Twitter handle):

  • screen_name=yourtwitterhandle&count=1

For example, my GET request looks like this:

Screenshot of the GET request setup screen for OAuth testing.

Click “See OAuth signature for this request”.  On the next page, look for the cURL request.  You can copy and paste this into a terminal or console window to execute the GET request and see the response (there will be a lot more of response text than what’s posted here):

* SSLv3, TLS alert, Client hello (1):
[{"created_at":"Sun Apr 20 19:37:53 +0000 2014","id":457966401483845632,
"id_str":"457966401483845632",
"text":"Just Added: The Fault in Our Stars by John Green; 
2nd Floor PZ7.G8233 Fau 2012","

As you can see, the above response to my cURL request includes the text of my account’s last tweet:

image00

What to do if your Twitter API Key or OAuth Security is Compromised

If your Twitter account suddenly starts tweeting out spammy “secrets to weight loss success” that you did not authorize (or other tweets that you didn’t write), your account has been compromised.  If you can still login with your username and password, it’s likely that your OAuth Keys have been compromised.  If you can’t log in, your account has probably been hacked.10  Your account can be compromised if you’ve authorized a third party app to tweet, but if your Twitter account has an active developer application on dev.twitter.com, it could be your own application’s key storage that’s been compromised.

Here are the immediate steps to take to stop the spam:

  1. Revoke access to third party apps under Settings –> Apps.  You may want to re-authorize them later – but you’ll probably want to reset the password for the third-party accounts that you had authorized.
  2. If you have generated API keys, log into dev.twitter.com and re-generate your API Keys and Secrets and your OAuth Keys and Secrets.  You’ll have to update any apps using the keys with the new key and secret information – but only if you have verified the server running the app hasn’t also been compromised.
  3. Reset your Twitter account password.11
5.  Taking it further:  Posting a new titles Twitter feed

So now you know a lot about the Twitter API – what now?  One way to take this further might involve writing an application to post new books that are added to your library’s collection.  Maybe you want to highlight a particular subject or collection – you can use some text output from your library catalog to post the title, author, and call number of new books.

The first step to such an application could involve creating an app that can post to the Twitter API.  If you have access to a  server that can run PHP, you can easily get started by downloading this incredibly helpful PHP wrapper.

Then in the same directory create two new files:

  • settings.php, which contains the following code (replace all the values in quotes with your actual Twitter API Key information):
<?php

$settings = array {
 ‘oath_access_token’ => “YOUR_ACCESS_TOKEN”,
 ‘oath_access_token_secret’ => “YOUR_ACCESS_TOKEN_SECRET”,
 ‘consumer_key’ => “YOUR_API_KEY”,
 ‘consumer_secret’ => “YOUR_API_KEY_SECRET”,
);

?>
  • and twitterpost.php, which has the following code, but swap out the values of ‘screen_name’ with your Twitter handle, and change the ‘status’ value if desired:
<?php

//call the PHP wrapper and your API values
require_once('TwitterAPIExchange.php');
include 'settings.php';

//define the request URL and REST request type
$url = "https://api.twitter.com/1.1/statuses/update.json";
$requestMethod = "POST";

//define your account and what you want to tweet
$postfields = array(
  'screen_name' => 'YOUR_TWITTER_HANDLE',
  'status' => 'This is my first API test post!'
);

//put it all together and build the request
$twitter = new TwitterAPIExchange($settings);
echo $twitter->buildOauth($url, $requestMethod)
->setPostfields($postfields)
->performRequest();

?>

Save the files and run the twitterpost.php page in your browser. Check the Twitter account referenced by the screen_name variable.  There should now be a new post with the contents of the ‘status’ value.

This is just a start – you would still need to get data out of your ILS and feed it to this application in some way – which brings me to one final point.

Is there an API for your ILS?  Should there be? (Answer:  Yes!)

Getting data out of traditional, legacy ILS systems can be a challenge.  Extending or adding on to traditional ILS software can be impossible (and in some cases may have been prohibited by license agreements).  One of the reasons for this might be that the architecture of such systems was designed for a world where the kind of data exchange facilitated by RESTful APIs didn’t yet exist.  However, there is definitely a major trend by ILS developers to move toward allowing access to library data within ILS systems via APIs.

It can be difficult to articulate exactly why this kind of access is necessary – especially when looking toward the future of rich functionality in emerging web-based library service platforms.  Why should we have to build custom applications using APIs – shouldn’t our ILS systems be built with all the functionality we need?

While libraries should certainly promote comprehensive and flexible architecture in the ILS solutions they purchase, there will almost certainly come a time when no matter how comprehensive your ILS is, you’re going to wonder, “wouldn’t it be nice if our system did X”?  Moreover, consider how your patrons might use your library’s APIs; for example, integrating your library’s web services other apps and services they already to use, or to build their own applications with your library web services. If you have web service API access to your data – bibliographic, circulation, acquisition data, etc. – you have the opportunity to meet those needs and to innovate collaboratively.  Without access to your data, you’re limited to the development cycle of your ILS vendor, and it may be years before you see the functionality you really need to do something cool with your data.  (It may still be years before you can find the time to develop your own app with an API, but that’s an entirely different problem.)

Examples of Library Applications built using APIs and ILS API Resources

Further Reading

Michel, Jason P. Web Service APIs and Libraries. Chicago, IL:  ALA Editions, 2013. Print.

Richardson, Leonard, and Michael Amundsen. RESTful Web APIs. Sebastopol, Calif.: O’Reilly, 2013.

 

About our Guest Author:

Lauren Magnuson is Systems & Emerging Technologies Librarian at California State University, Northridge and a Systems Coordinator for the Private Academic Library Network of Indiana (PALNI).  She can be reached at lauren.magnuson@csun.edu or on Twitter @lpmagnuson.

 

Notes

  1. create, retrieve, update, and delete is sometimes referred to by acronym: CRUD
  2. For example, via the OCLC Collection Management API: http://www.oclc.org/developer/develop/web-services/wms-collection-management-api.en.html
  3. For more detail on these and other HTTP verbs, http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
  4. https://blog.apigee.com/detail/do_you_need_api_keys_api_identity_vs._authorization
  5. Google, for example: https://developers.google.com/accounts/docs/OAuth2
  6. To learn a lot more about this, check out this web series: http://www.youtube.com/playlist?list=PLB4D701646DAF0817
  7. http://www.securityweek.com/github-search-makes-easy-discovery-encryption-keys-passwords-source-code
  8. Learn more about .gitignore here:  https://help.github.com/articles/ignoring-files
  9. An nice overview of HMAC is here: http://www.wolfe.id.au/2012/10/20/what-is-hmac-and-why-is-it-useful
  10. Here’s what to do if you’re account is hacked and you can’t log in:  https://support.twitter.com/articles/185703-my-account-has-been-hacked
  11. More information, and further steps you can take are here:  https://support.twitter.com/articles/31796-my-account-has-been-compromised