Hacking in Python with PyMARC

marc icon with an arrow pointing to a spreadsheet icon

The pymarc Python library is an extremely handy library that can be used for writing scripts to read, write, edit, and parse MARC records.  I was first introduced to pymarc through Heidi Frank’s excellent Code4Lib journal article on using pymarc in conjunction with MARCedit.1.  Here at ACRL TechConnect, Becky Yoose has also written about some cool applications of pymarc.

In this post, I’m going to walk through a couple of applications of using pymarc to make comma-separated (.csv) files for batch loading in DSpace and in the KBART format (e.g., for OCLC KnowledgeBase purposes).  I know what you’re thinking – why write a script when I can use MARCedit for that?  Among MARCedit’s many indispensable features is its Export Tab-Delimited Data feature. 2.  But there might be a couple of use cases where scripts can come in handy:

  • Your MARC records lack data that you need in the tab-delimited or .csv output – for example, a field used for processing in whatever system you’re loading the data into that isn’t found in the MARC record source.
  • Your delimited file output requires things like empty fields or parsed, modified fields.  MARCedit is great for exporting whole MARC fields and subfields into delimited files, but you may need to pull apart a MARC element or parse a MARC element out with some additional data.  Pymarc is great for this.
  • You have to process on a lot of records frequently, and just want something tailor-made for your use case.  While you can certainly use MARCedit to save a frequently used export template, editing that saved template when you need to change the output parameters of your delimited file isn’t easy.  With a Python script, you can easily edit your script if your delimited file requirements change.

 

Some General Notes about Python

Python is a useful scripting language to know for a lot of reasons, and for small scripting projects can have a fairly shallow learning curve.  In order to start using Python, you’ll need to set your development environment.  Macs already come with Python installed, but to double-check which version you have installed, launch Terminal and type Python -v.  In a Windows environment, you’ll need to do a few more steps, detailed here.  Full Python documentation for 2.x can be found here (though personally, I find it a little dense at times!), and CodeAcademy features some pretty extensive tutorials to help you learn the syntax quickly.  Google also has some pretty extensive tutorials on Python.  Personally, I find it easiest to learn when I have a project that actually means something to me, so if you’re familiar with MARC records,  just downloading the Python scripts mentioned in this post below and learning how to run them can be a good start.

Spacing and indentation in Python is very important, so if you’re getting errors running scripts make sure that your conditional statements are indented properly 3. Also the version of Python on your machine makes a really big difference, and the version will determine whether your code runs successfully.  These examples have all been tested with Python 2.6 and 2.7, but not with Python 3 or above.  I find that Python 2.x has more help resources out on the web, which is nice to be able to take advantage of when you’re first starting out.

Use Case 1:  MARC to KBART

The complete script described below, along with some sample MARC records, is on GitHub.

The KBART format is a NISO/United Kingdom Serials Group (UKSG) initiative designed to standardize information for use with Knowledge Base systems.  The KBART format is a series of standardized fields that can be used to identify serial coverage (e.g., start date and end date) URLs, publisher information, and more.4  Notably, it’s the required format for adding and modifying customized collections in OCLC’s Knowledge Base. 5.

In this use case, I’m using OCLC’s modified KBART file – most of the fields are KBART standard but a few fields are specific to OCLC’s Knowledge Base, e.g., oclc_collection_name.6.  Typically, these KBART files would be created either manually, or generated by using some VLOOKUP Excel magic with an existing KBART file 7.  In my use case, I needed to batch migrate a bunch of data stored in MARC records to the OCLC Knowledge Base, and these particular records didn’t always correspond neatly to existing OCLC Knowledge Base Collections.  For example, one collection we wanted to manage in the OCLC Knowledge Base was the library’s print holdings so that these titles were displayed in OCLC’s A-Z Journal lookup tool.

First, I’ll need a few helper libraries for my script:

import csv
from pymarc import MARCReader
from os import listdir
from re import search

Then, I’ll declare the source directory where my MARC records are (in this case, in the same directory the script lives in a folder called marc) and instruct the script to go find all the .mrc files in that directory.  Note that this enables the script to process either a single file of multiple MARC records, or multiple distinct files, all at once.

# change the source directory to whatever directory your .mrc files are in
SRC_DIR = 'marc/'

The script identifies MARC records by looking for all files ending in .mrc using the Python filter function.  Within the filter function, lambda creates a one-off anonymous function to define the filter parameters: 8

# get a list of all .mrc files in source directory 
file_list = filter(lambda x: search('.mrc', x), listdir(SRC_DIR))

Now I’ll define the output parameters.  I need a tab-delimited file, not a comma-delimited file, but either way I’ll use the csv.writer function to create a .txt file and define the tab delimiter (\t).  I also don’t really want the fields quoted unless there’s a special character or something that might cause a problem reading the file, so I’ll set quoting to minimal:

#create tab delimited text file that quotes if special characters are present
csv_out = csv.writer(open('kbart.txt', 'w'), delimiter = '\t', 
quotechar = '"', quoting = csv.QUOTE_MINIMAL)


And I’ll also create the header row, which includes the KBART fields required by the OCLC Knowledge Base:

#create the header row
csv_out.writerow(['publication_title', 'print_identifier', 
'online_identifier', 'date_first_issue_online', 
'num_first_vol_online', 'num_first_issue_online', 
'date_last_issue_online', 'num_last_vol_online', 
'num_last_issue_online', 'title_url', 'first_author', 
'title_id', 'coverage_depth', 'coverage_notes', 
'publisher_name', 'location', 'title_notes', 'oclc_collection_name', 
'oclc_collection_id', 'oclc_entry_id', 'oclc_linkscheme', 
'oclc_number', 'action'])

Next, I’ll start a loop for writing a row into the tab-delimited text file for every MARC record found.

for item in file_list:
 fd = file(SRC_DIR + '/' + item, 'r')
 reader = MARCReader(fd)

By default, we’ll need to set each field’s value to blank (”), so that errors are avoided if the value is not present in the MARC record.  We can set all the values to blank to start with in one statement:

for record in reader:
    publication_title = print_identifier = online_identifier 
    = date_first_issue_online = num_first_vol_online 
    = num_first_issue_online = date_last_issue_online 
    = num_last_vol_online = num_last_issue_online 
    = title_url = first_author = title_id = coverage_depth 
    = coverage_notes = publisher_name = location = title_notes 
    = oclc_collection_name = oclc_collection_id = oclc_entry_id 
    = oclc_linkscheme = oclc_number = action = ''

Now I can start pulling in data from MARC fields.  At its simplest, for each field defined in my CSV, I can pull data out of MARC fields using a construction like this:

    #title_url
    if record ['856'] is not None:
      title_url = record['856']['u']

I can use generic python string parsing tools to clean up fields.  For example, if I need to strip the ending / out of a MARC 245$a field, I can use .rsplit to return everything before that final slash (/):

 
   # publication_title
    if record['245'] is not None:
      publication_title = record['245']['a'].rsplit('/', 1)[0]
      if record['245']['b'] is not None:
          publication_title = publication_title + " " + record['245']['b']

Also note that once you’ve declared a variable value (like publication_title) you can re-use that variable name to add more onto the string, as I’ve done with the 245$b subtitle above.

As is usually the case for MARC records, the trickiest business has to do with parsing out serial ranges.  The KBART format is really designed to capture, as accurately as possible, thresholds for beginning and ending dates.  MARC makes this…complicated.  Many libraries might use the 866 summary field to establish summary ranges, but there are varying local practices that determine what these might look like.  In my case, the minimal information I was looking for included beginning and ending years, and the records I was processing with the script stored summary information fairly cleanly in the 863 and 866 fields:

    # date_first_issue_online
    if record ['866'] is not None:
      date_first_issue_online = record['863']['a'].rsplit('-', 1)[0]
 
    # date_last_issue_online
    if record ['866'] is not None:
      date_last_issue_online = record['866']['a'].rsplit('-', 1)[-1]

Where further adjustments were necessary, and I couldn’t easily account for the edge cases programmatically, I imported the KBART file into OpenRefine for further cleanup.  A sample KBART file created by this script is available here.

Use Case 2:  MARC to DSpace Batch Upload

Again, the complete script for transforming MARC records for DSpace ingest is on GitHub.

This use case is very similar to the KBART scenario.  We use DSpace as our institutional repository, and we had about 2000 historical theses and dissertation MARC  records to transform and ingest into DSpace.  DSpace facilitates batch-uploading metadata as CSV files according to the RFC4180 format, which basically means all the field elements use double quotes. 9  While the fields being pulled from are different, the structure of the script is almost exactly the same.

When we first define the CSV output, we need to ensure that quoting is used for all elements – csv.QUOTE_ALL:

csv_out = csv.writer(open('output/theses.txt', 'w'), delimiter = '\t', 
quotechar = '"', quoting = csv.QUOTE_ALL)

The other nice thing about using Python for this transformation is the ability to program in static text that would be the same for all the lines in the CSV file.  For example, I parsed together a more friendly display of the department the thesis was submitted to like so:

# dc.contributor.department
if record ['690']['x'] is not None:
dccontributordepartment = ('UniversityName.  Department of ') + 
record['690']['x'].rsplit('.', 1)[0]

You can view a sample output file created by this script on here on Github.

Other PyMARC Projects

There are lots of other, potentially more interesting things that can be done with pymarc, although certainly one of its most common applications is converting MARC to more flexible formats, such as MARCXML 10.  If you’re interested in learning more,  join the pymarc Google Group, where you can often get help by the original pymarc developers (Ed Summers, Mark Matienzo, Gabriel Farrell and Geoffrey Spear).

Notes
  1. Frank, Heidi. “Augmenting the Cataloger’s Bag of Tricks: Using MarcEdit, Python, and pymarc for Batch-Processing MARC Records Generated from the Archivists’ Toolkit.” Code4Lib Journal, 20 (2013):  http://journal.code4lib.org/articles/8336
  2. https://www.youtube.com/watch?v=qkzJmNOvY00
  3. For the specifications on indentation, take a look at http://legacy.python.org/dev/peps/pep-0008/#id12
  4. http://www.uksg.org/kbart/s5/guidelines/data_fields
  5. http://oclc.org/content/dam/support/knowledge-base/kb_modify.pdf
  6.  A sample blank KBART Excel sheet for OCLC’s Knowledge Base can be found here:  http://www.oclc.org/support/documentation/collection-management/kb_kbart.xlsx.  KBART files must be saved as tab-delimited .txt files prior to upload, but are obviously more easily edited manually in Excel
  7. If you happen to be in need of this for OCLC’s KB, or  are in need of comparing two Excel files, here’s a walkthrough of how to use VLOOKUP to select owned titles from a large OCLC KBART file: http://youtu.be/mUhkMzpPnBE
  8. A nice tutorial on using lambda and filter together can be found here:  http://www.u.arizona.edu/~erdmann/mse350/topics/list_comprehensions.html#filter
  9.  https://wiki.duraspace.org/display/DSDOC18/Batch+Metadata+Editing
  10.  http://docs.evergreen-ils.org/2.3/_migrating_your_bibliographic_records.html

Using the Stripe API to Collect Library Fines by Accepting Online Payment

Recently, my library has been considering accepting library fines via online. Currently, many library fines of a small amount that many people owe are hard to collect. As a sum, the amount is significant enough. But each individual fines often do not warrant even the cost for the postage and the staff work that goes into creating and sending out the fine notice letter. Libraries that are able to collect fines through the bursar’s office of their parent institutions may have a better chance at collecting those fines. However, others can only expect patrons to show up with or to mail a check to clear their fines. Offering an online payment option for library fines is one way to make the library service more user-friendly to those patrons who are too busy to visit the library in person or to mail a check but are willing to pay online with their credit cards.

If you are new to the world of online payment, there are several terms you need to become familiar with. The following information from the article in SixRevisions is very useful to understand those terms.1

  • ACH (Automated Clearing House) payments: Electronic credit and debit transfers. Most payment solutions use ACH to send money (minus fees) to their customers.
  • Merchant Account: A bank account that allows a customer to receive payments through credit or debit cards. Merchant providers are required to obey regulations established by card associations. Many processors act as both the merchant account as well as the payment gateway.
  • Payment Gateway: The middleman between the merchant and their sponsoring bank. It allows merchants to securely pass credit card information between the customer and the merchant and also between merchant and the payment processor.
  • Payment Processor: A company that a merchant uses to handle credit card transactions. Payment processors implement anti-fraud measures to ensure that both the front-facing customer and the merchant are protected.
  • PCI (the Payment Card Industry) Compliance: A merchant or payment gateway must set up their payment environment in a way that meets the Payment Card Industry Data Security Standard (PCI DSS).

Often, the same company functions as both payment gateway and payment processor, thereby processing the credit card payment securely. Such a product is called ‘Online payment system.’ Meyer’s article I have cited above also lists 10 popular online payment systems: Stripe, Authorize.Net, PayPal, Google Checkout, Amazon Payments, Dwolla, Braintree, Samurai by FeeFighters, WePay, and 2Checkout. Bear in mind that different payment gateways, merchant accounts, and bank accounts may or may not work together, your bank may or may not work as a merchant account, and your library may or may not have a merchant account. 2

Also note that there are fees in using online payment systems like these and that different systems have different pay structures. For example, Authorize.net has the $99 setup fee and then charges $20 per month plus a $0.10 per-transaction fee. Stripe charges 2.9% + $0.30 per transaction with no setup or monthly fees. Fees for mobile payment solutions with a physical card reader such as Square may go up much higher.

Among various online payment systems, I picked Stripe because it was recommended on the Code4Lib listserv. One of the advantages for using Stripe is that it acts as both the payment gateway and the merchant account. What this means is that your library does not have to have a merchant account to accept payment online. Another big advantage of using Stripe is that you do not have to worry about the PCI compliance part of your website because the Stripe API uses a clever way to send the sensitive credit card information over to the Stripe server while keeping your local server, on which your payment form sits, completely blind to such sensitive data. I will explain this in more detail later in this post.

Below I will share some of the code that I have used to set up Stripe as my library’s online payment option for testing. This may be of interest to you if you are thinking about offering online payment as an option for your patrons or if you are simply interested in how an online payment API works. Even if your library doesn’t need to collect library fines via online, an online payment option can be a handy tool for a small-scale fund-raising drive or donation.

The first step to take to make Stripe work is getting an API keys. You do not have to create an account to get API keys for testing. But if you are going to work on your code more than one day, it’s probably worth getting an account. Stripe API has excellent documentation. I have read ‘Getting Started’ section and then jumped over to the ‘Examples’ section, which can quickly get you off the ground. (https://stripe.com/docs/examples) I found an example by Daniel Schröter in GitHub from the list of examples in the Stripe’s Examples section and decided to test out. (https://github.com/myg0v/Simple-Bootstrap-Stripe-Payment-Form) Most of the time, getting an example code requires some probing and tweaking such as getting all the required library downloaded and sorting out the paths in the code and adding API keys. This one required relatively little work.

Now, let’s take a look at the form that this code creates.

borrowedcode

In order to create a form of my own for testing, I decided to change a few things in the code.

  1. Add Patron & Payment Details.
  2. Allow custom amount for payment.
  3. Change the currency from Euro to US dollars.
  4. Configure the validation for new fields.
  5. Hide the payment form once the charge goes through instead of showing the payment form below the payment success message.

html

4. can be done as follows. The client-side validation is performed by Bootstrapvalidator jQuery Plugin. So you need to get the syntax correct to get the code, which now has new fields, to work properly.
validator

This is the Javascript that allows you to send the data submitted to your payment form to the Stripe server. First, include the Stripe JS library (line 24). Include JQuery, Bootstrap, Bootstrap Form Helpers plugin, and Bootstrap Validator plugin (line 25-28). The next block of code includes an event handler for the form, which send the payment information to the Stripe via AJAX when the form is submitted. Stripe will validate the payment information and then return a token that identifies this particular transaction.

jspart

When the token is received, this code calls for the function, stripeResponseHandler(). This function, stripeResponseHandler() checks if the Stripe server did not return any error upon receiving the payment information and, if no error has been returned, attaches the token information to the form and submits the form.

jspart2

The server-side PHP script then checks if the Stripe token has been received and, if so, creates a charge to send it to Stripe as shown below. I am using PHP here, but Stripe API supports many other languages than PHP such as Ruby and Python. So you have many options. The real payment amount appears here as part of the charge array in line 326. If the charge succeeds, the payment success message is stored in a div to be displayed.

phppart

The reason why you do not have to worry about the PCI compliance with Stripe is that Stripe API asks to receive the payment information via AJAX and the input fields of sensitive information does not have the name attribute and value. (See below for the Card Holder Name and Card Number information as an example; Click to bring up the clear version of the image.)  By omitting the name attribute and value, the local server where the online form sits is deprived of any means to retrieve the information in those input fields submitted through the form. Since sensitive information does not touch the local server at all, PCI compliance for the local server becomes no concern. To clarify, not all fields in the payment form need to be deprived of the name attribute. Only the sensitive fields that you do not want your web server to have access to need to be protected this way. Here, for example, I am assigning the name attribute and value to fields such as name and e-mail in order to use them later to send a e-mail receipt.

(NB. Please click images to see the enlarged version.)

Screen Shot 2014-08-17 at 8.01.08 PM

Now, the modified form has ‘Fee Category’, custom ‘Payment Amount,’ and some other information relevant to the billing purpose of my library.

updated

When the payment succeeds, the page changes to display the following message.

success

Stripe provides a number of fake card numbers for testing. So you can test various cases of failures. The Stripe website also displays all payments and related tokens and charges that are associated with those payments. This greatly helps troubleshooting. One thing that I noticed while troubleshooting is that Stripe logs sometimes do lag behind. That is, when a payment would succeed, associated token and charge may not appear under the “Logs” section immediately. But you will see the payment shows up in the log. So you will know that associated token and charge will eventually appear in the log later.

recent_payment

Once you are ready to test real payment transactions, you need to flip the switch from TEST to LIVE located on the top left corner. You will also need to replace your API keys for ‘TESTING’ (both secret and public) with those for ‘LIVE’ transaction. One more thing that is needed before making your library getting paid with real money online is setting up SSL (Secure Sockets Layer) for your live online payment page. This is not required for testing but necessary for processing live payment transactions. It is not a very complicated work. So don’t be discouraged at this point. You just have to buy a security certificate and put it in your Web server. Speak to your system administrator for how to get the SSL set up for your payment page. More information about setting up SSL can be found in the Stripe documentation I just linked above.

My library has not yet gone live with this online payment option. Before we do, I may make some more modifications to the code to fit the staff workflow better, which is still being mapped out. I am also planning to place the online payment page behind the university’s Shibboleth authentication in order to cut down spam and save some tedious data entry by library patrons by getting their information such as name, university email, student/faculty/staff ID number directly from the campus directory exposed through Shibboleth and automatically inserting it into the payment form fields.

In this post, I have described my experience of testing out the Stripe API as an online payment solution. As I have mentioned above, however, there are many other online payment systems out there. Depending your library’s environment and financial setup, different solutions may work better than others. To me, not having to worry about the PCI compliance by using Stripe was a big plus. If your library accepts online payment, please share what solution you chose and what factors led you to the particular online payment system in the comments.

* This post has been based upon my recent presentation, “Accepting Online Payment for Your Library and ‘Stripe’ as an Example”, given at the Code4Lib DC Unconference. Slides are available at the link above.

Notes
  1. Meyer, Rosston. “10 Excellent Online Payment Systems.” Six Revisions, May 15, 2012. http://sixrevisions.com/tools/online-payment-systems/.
  2. Ullman, Larry. “Introduction to Stripe.” Larry Ullman, October 10, 2012. http://www.larryullman.com/2012/10/10/introduction-to-stripe/.

Bootstrap Responsibly

Bootstrap is the most popular front-end framework used for websites. An estimate by meanpath several months ago sat it firmly behind 1% of the web – for good reason: Bootstrap makes it relatively painless to puzzle together a pretty awesome plug-and-play, component-rich site. Its modularity is its key feature, developed so Twitter could rapidly spin-up internal microsites and dashboards.

Oh, and it’s responsive. This is kind of a thing. There’s not a library conference today that doesn’t showcase at least one talk about responsive web design. There’s a book, countless webinars, courses, whole blogs dedicated to it (ahem), and more. The pressure for libraries to have responsive, usable websites can seem to come more from the likes of us than from the patronbase itself, but don’t let that discredit it. The trend is clear and it is only a matter of time before our libraries have their mobile moment.

Library websites that aren’t responsive feel dated, and more importantly they are missing an opportunity to reach a bevy of mobile-only users that in 2012 already made up more than a quarter of all web traffic. Library redesigns are often quickly pulled together in a rush to meet the growing demand from stakeholders, pressure from the library community, and users. The sprint makes the allure of frameworks like Bootstrap that much more appealing, but Bootstrapped library websites often suffer the cruelest of responsive ironies:

They’re not mobile-friendly at all.

Assumptions that Frameworks Make

Let’s take a step back and consider whether using a framework is the right choice at all. A front-end framework like Bootstrap is a Lego set with all the pieces conveniently packed. It comes with a series of templates, a blown-out stylesheet, scripts tuned to the environment that let users essentially copy-and-paste fairly complex web-machinery into being. Carousels, tabs, responsive dropdown menus, all sorts of buttons, alerts for every occasion, gorgeous galleries, and very smart decisions made by a robust team of far-more capable developers than we.

Except for the specific layout and the content, every Bootstrapped site is essentially a complete organism years in the making. This is also the reason that designers sometimes scoff, joking that these sites look the same. Decked-out frameworks are ideal for rapid prototyping with a limited timescale and budget because the design decisions have by and large already been made. They assume you plan to use the framework as-is, and they don’t make customization easy.

In fact, Bootstrap’s guide points out that any customization is better suited to be cosmetic than a complete overhaul. The trade-off is that Bootstrap is otherwise complete. It is tried, true, usable, accessible out of the box, and only waiting for your content.

Not all Responsive Design is Created Equal

It is still common to hear the selling point for a swanky new site is that it is “responsive down to mobile.” The phrase probably rings a bell. It describes a website that collapses its grid as the width of the browser shrinks until its layout is appropriate for whatever screen users are carrying around. This is kind of the point – and cool, as any of us with a browser-resizing obsession could tell you.

Today, “responsive down to mobile” has a lot of baggage. Let me explain: it represents a telling and harrowing ideology that for these projects mobile is the afterthought when mobile optimization should be the most important part. Library design committees don’t actually say aloud or conceive of this stuff when researching options, but it should be implicit. When mobile is an afterthought, the committee presumes users are more likely to visit from a laptop or desktop than a phone (or refrigerator). This is not true.

See, a website, responsive or not, originally laid out for a 1366×768 desktop monitor in the designer’s office, wistfully depends on visitors with that same browsing context. If it looks good in-office and loads fast, then looking good and loading fast must be the default. “Responsive down to mobile” is divorced from the reality that a similarly wide screen is not the common denominator. As such, responsive down to mobile sites have a superficial layout optimized for the developers, not the user.

In a recent talk at An Event Apart–a conference–in Atlanta, Georgia, Mat Marquis stated that 72% of responsive websites send the same assets to mobile sites as they do desktop sites, and this is largely contributing to the web feeling slower. While setting img { width: 100%; } will scale media to fit snugly to the container, it is still sending the same high-resolution image to a 320px-wide phone as a 720px-wide tablet. A 1.6mb page loads differently on a phone than the machine it was designed on. The digital divide with which librarians are so familiar is certainly nowhere near closed, but while internet access is increasingly available its ubiquity doesn’t translate to speed:

  1. 50% of users ages 12-29 are “mostly mobile” users, and you know what wireless connections are like,
  2. even so, the weight of the average website ( currently 1.6mb) is increasing.

Last December, analysis of data from pagespeed quantiles during an HTTP Archive crawl tried to determine how fast the web was getting slower. The fastest sites are slowing at a greater rate than the big bloated sites, likely because the assets we send–like increasingly high resolution images to compensate for increasing pixel density in our devices–are getting bigger.

The havoc this wreaks on the load times of “mobile friendly” responsive websites is detrimental. Why? Well, we know that

  • users expect a mobile website to load as fast on their phone as it does on a desktop,
  • three-quarters of users will give up on a website if it takes longer than 4 seconds to load,
  • the optimistic average load time for just a 700kb website on 3G is more like 10-12 seconds

eep O_o.

A Better Responsive Design

So there was a big change to Bootstrap in August 2013 when it was restructured from a “responsive down to mobile” framework to “mobile-first.” It has also been given a simpler, flat design, which has 100% faster paint time – but I digress. “Mobile-first” is key. Emblazon this over the door of the library web committee. Strike “responsive down to mobile.” Suppress the record.

Technically, “mobile-first” describes the structure of the stylesheet using CSS3 Media Queries, which determine when certain styles are rendered by the browser.

.example {
  styles: these load first;
}

@media screen and (min-width: 48em) {

  .example {

    styles: these load once the screen is 48 ems wide;

  }

}

The most basic styles are loaded first. As more space becomes available, designers can assume (sort of) that the user’s device has a little extra juice, that their connection may be better, so they start adding pizzazz. One might make the decision that, hey, most of the devices less than 48em (720px approximately with a base font size of 16px) are probably touch only, so let’s not load any hover effects until the screen is wider.

Nirvana

In a literal sense, mobile-first is asset management. More than that, mobile-first is this philosophical undercurrent, an implicit zen of user-centric thinking that aligns with libraries’ missions to be accessible to all patrons. Designing mobile-first means designing to the lowest common denominator: functional and fast on a cracked Blackberry at peak time; functional and fast on a ten year old machine in the bayou, a browser with fourteen malware toolbars trudging through the mire of a dial-up connection; functional and fast [and beautiful?] on a 23″ iMac. Thinking about the mobile layout first makes design committees more selective of the content squeezed on to the front page, which makes committees more concerned with the quality of that content.

The Point

This is the important statement that Bootstrap now makes. It expects the design committee to think mobile-first. It comes with all the components you could want, but they want you to trim the fat.

Future Friendly Bootstrapping

This is what you get in the stock Bootstrap:

  • buttons, tables, forms, icons, etc. (97kb)
  • a theme (20kb)
  • javascripts (30kb)
  • oh, and jQuery (94kb)

That’s almost 250kb of website. This is like a browser eating a brick of Mackinac Island Fudge – and this high calorie bloat doesn’t include images. Consider that if the median load time for a 700kb page is 10-12 seconds on a phone, half that time with out-of-the-box Bootstrap is spent loading just the assets.

While it’s not totally deal-breaking, 100kb is 5x as much CSS as an average site should have, as well as 15%-20% of what all the assets on an average page should weigh. Josh Broton

To put this in context, I like to fall back on Ilya Girgorik’s example comparing load time to user reaction in his talk “Breaking the 1000ms Time to Glass Mobile Barrier.” If the site loads in just 0-100 milliseconds, this feels instant to the user. By 100-300ms, the site already begins to feel sluggish. At 300-1000ms, uh – is the machine working? After 1 second there is a mental context switch, which means that the user is impatient, distracted, or consciously aware of the load-time. After 10 seconds, the user gives up.

By choosing not to pair down, your Bootstrapped Library starts off on the wrong foot.

The Temptation to Widgetize

Even though Bootstrap provides modals, tabs, carousels, autocomplete, and other modules, this doesn’t mean a website needs to use them. Bootstrap lets you tailor which jQuery plugins are included in the final script. The hardest part of any redesign is to let quality content determine the tools, not the ability to tabularize or scrollspy be an excuse to implement them. Oh, don’t Google those. I’ll touch on tabs and scrollspy in a few minutes.

I am going to be super presumptuous now and walk through the total Bootstrap package, then make recommendations for lightening the load.

Transitions

Transitions.js is a fairly lightweight CSS transition polyfill. What this means is that the script checks to see if your user’s browser supports CSS Transitions, and if it doesn’t then it simulates those transitions with javascript. For instance, CSS transitions often handle the smooth, uh, transition between colors when you hover over a button. They are also a little more than just pizzazz. In a recent article, Rachel Nabors shows how transition and animation increase the usability of the site by guiding the eye.

With that said, CSS Transitions have pretty good browser support and they probably aren’t crucial to the functionality of the library website on IE9.

Recommendation: Don’t Include.

 Modals

“Modals” are popup windows. There are plenty of neat things you can do with them. Additionally, modals are a pain to design consistently for every browser. Let Bootstrap do that heavy lifting for you.

Recommendation: Include

Dropdown

It’s hard to conclude a library website design committee without a lot of links in your menu bar. Dropdown menus are kind of tricky to code, and Bootstrap does a really nice job keeping it a consistent and responsive experience.

Recommendation: Include

Scrollspy

If you have a fixed sidebar or menu that follows the user as they read, scrollspy.js can highlight the section of that menu you are currently viewing. This is useful if your site has a lot of long-form articles, or if it is a one-page app that scrolls forever. I’m not sure this describes many library websites, but even if it does, you probably want more functionality than Scrollspy offers. I recommend jQuery-Waypoints – but only if you are going to do something really cool with it.

Recommendation: Don’t Include

Tabs

Tabs are a good way to break-up a lot of content without actually putting it on another page. A lot of libraries use some kind of tab widget to handle the different search options. If you are writing guides or tutorials, tabs could be a nice way to display the text.

Recommendation: Include

Tooltips

Tooltips are often descriptive popup bubbles of a section, option, or icon requiring more explanation. Tooltips.js helps handle the predictable positioning of the tooltip across browsers. With that said, I don’t think tooltips are that engaging; they’re sometimes appropriate, but you definitely use to see more of them in the past. Your library’s time is better spent de-jargoning any content that would warrant a tooltip. Need a tooltip? Why not just make whatever needs the tooltip more obvious O_o?

Recommendation: Don’t Include

Popover

Even fancier tooltips.

Recommendation: Don’t Include

Alerts

Alerts.js lets your users dismiss alerts that you might put in the header of your website. It’s always a good idea to give users some kind of control over these things. Better they read and dismiss than get frustrated from the clutter.

Recommendation: Include

Collapse

The collapse plugin allows for accordion-style sections for content similarly distributed as you might use with tabs. The ease-in-ease-out animation triggers motion-sickness and other aaarrghs among users with vestibular disorders. You could just use tabs.

Recommendation: Don’t Include

Button

Button.js gives a little extra jolt to Bootstrap’s buttons, allowing them to communicate an action or state. By that, imagine you fill out a reference form and you click “submit.” Button.js will put a little loader icon in the button itself and change the text to “sending ….” This way, users are told that the process is running, and maybe they won’t feel compelled to click and click and click until the page refreshes. This is a good thing.

Recommendation: Include

Carousel

Carousels are the most popular design element on the web. It lets a website slideshow content like upcoming events or new material. Carousels exist because design committees must be appeased. There are all sorts of reasons why you probably shouldn’t put a carousel on your website: they are largely inaccessible, have low engagement, are slooooow, and kind of imply that libraries hate their patrons.

Recommendation: Don’t Include.

Affix

I’m not exactly sure what this does. I think it’s a fixed-menu thing. You probably don’t need this. You can use CSS.

Recommendation: Don’t Include

Now, Don’t You Feel Better?

Just comparing the bootstrap.js and bootstrap.min.js files between out-of-the-box Bootstrap and one tailored to the specs above, which of course doesn’t consider the differences in the CSS, the weight of the images not included in a carousel (not to mention the unquantifiable amount of pain you would have inflicted), the numbers are telling:

File Before After
bootstrap.js 54kb 19kb
bootstrap.min.js 29kb 10kb

So, Bootstrap Responsibly

There is more to say. When bouncing this topic around twitter awhile ago, Jeremy Prevost pointed out that Bootstrap’s minified assets can be GZipped down to about 20kb total. This is the right way to serve assets from any framework. It requires an Apache config or .htaccess rule. Here is the .htaccess file used in HTML5 Boilerplate. You’ll find it well commented and modular: go ahead and just copy and paste the parts you need. You can eke out even more performance by “lazy loading” scripts at a given time, but these are a little out of the scope of this post.

Here’s the thing: when we talk about having good library websites we’re mostly talking about the look. This is the wrong discussion. Web designs driven by anything but the content they already have make grasping assumptions about how slick it would look to have this killer carousel, these accordions, nifty tooltips, and of course a squishy responsive design. Subsequently, these responsive sites miss the point: if anything, they’re mobile unfriendly.

Much of the time, a responsive library website is used as a marker that such-and-such site is credible and not irrelevant, but as such the website reflects a lack of purpose (e.g., “this website needs to increase library-card registration). A superficial understanding of responsive webdesign and easy-to-grab frameworks entail that the patron is the least priority.

 

About Our Guest Author :

Michael Schofield is a front-end librarian in south Florida, where it is hot and rainy – always. He tries to do neat things there. You can hear him talk design and user experience for libraries on LibUX.


Two Free Methods for Sharing Google Analytics Data Visualizations on a Public Dashboard

UPDATE (October 15th, 2014):

OOCharts as an API and service, described below, will be shut down November 15th, 2014.  More information about the decision by the developers to shut down the service is here. It’s not entirely surprising that the service going away, considering the Google SuperProxy option described in this post.  I’m leaving all the instructions here for OOCharts for posterity, but as of November 15th you can only use the SuperProxy or build your own with the Core Reporting API.

At this point, Google Analytics is arguably the standard way to track website usage data.  It’s easy to implement, but powerful enough to capture both wide general usage trends and very granular patterns.  However, it is not immediately obvious how to share Google Analytics data either with internal staff or external stakeholders – both of whom often have widespread demand for up-to-date metrics about library web property performance.  While access to Google Analytics can be granted by Google Analytics administrators to individual Google account-holders through the Google Analytics Admin screen, publishing data without requiring authentication requires some intermediary steps.

There are two main free methods of publishing Google Analytics visualizations to the web, and both involve using the Google Analytics API: OOCharts and the Google Analytics superProxy.  Both methods rely upon an intermediary service to retrieve data from the API and cache it, both to improve retrieval time and to avoid exceeding limits in API requests.1  The first method – OOCharts – requires much less time to get running initially. However, OOCharts’ long-standing beta status, and its status as a stand-alone service has less potential for long-term support than the second method, the Google Analytics superProxy.  For that reason, while OOCharts is certainly easier to set up, the superProxy method is definitely worth the investment in time (depending on what your needs).  I’ll cover both methods.

OOCharts Beta

OOCharts’ service facilitates the creation and storage of Google Analytics API Keys, which are required for sending secure requests to the API in order to retrieve data.

When setting up an OOCharts account, create your account utilizing the email address you use to access Google Analytics.  For example, if you log into Google Analytics using admin@mylibrary.org, I suggest you use this email address to sign up for OOCharts.  After creating an account with OOCharts, you will be directed to Google Analytics to authorize the OOCharts service to access your Google Analytics data on your behalf.

Let OOCharts gather Google Analytics Data on your Behalf.

Let OOCharts gather Google Analytics Data on your Behalf.

After authorizing the service, you will be able to generate an API key for any of the properties your to which your Google Analytics account has access.

In the OOCharts interface, click your name in the upper right corner, and then click inside the Add API Key field to see a list of available Analytics Properties from your account.

 

Once you’ve select a property, OOCharts will generate a key for your OOCharts application.  When you go back to the list of API Keys, you’ll see your keys along with property IDs (shown in brackets after the URL of your properties, e.g., [9999999].

APIkeys

Your API Keys now show your key values and your Google Analytics property IDs, both of which you’ll need to create visualizations with the OOCharts library.

 

Creating Visualizations with the OOCharts JavaScript Library

OOCharts appears to have started as a simple chart library for visualization data returned from the Google Analytics API. After you have set up your OOCharts account, download the front-end JavaScript library (available at http://docs.oocharts.com/) and upload it to your server or localhost.   Navigate to the /examples directory and locate the ‘timeline.html’ file.  This file contains a simplified example that displays web visits over time in Google Analytics’ familiar timeline format.

The code for this page is very simple, and contains two methods – JavaScript-only and with HTML Attributes – for creating OOCharts visualizations.  Below, I’ve separated out the required elements for both methods.  While either methods will work on their own, using HTML attributes allows for additional customizations and styling:

JavaScript-Only
		<h3>With JS</h3>
		<div id='chart'></div>
		
		<script src='../oocharts.js'></script>
		<script type="text/javascript">

			window.onload = function(){

				oo.setAPIKey("{{ YOUR API KEY }}");

				oo.load(function(){

					var timeline = new oo.Timeline("{{ YOUR PROFILE ID }}", "180d");

					timeline.addMetric("ga:visits", "Visits");

					timeline.addMetric("ga:newVisits", "New Visits");

					timeline.draw('chart');

				});
			};

		</script>
With HTML Attributes:
<h3>With HTML Attributes</h3>
	<div data-oochart='timeline' 
        data-oochart-metrics='ga:visits,Visits,ga:newVisits,New Visits' 
        data-oochart-start-date='180d' 
        data-oochart-profile='{{ YOUR PROFILE ID }}'></div>
		
			<script src='../oocharts.js'></script>
			<script type="text/javascript">

			window.onload = function(){

				oo.setAPIKey("{{ YOUR API KEY }}");

				oo.load(function(){
				});
			};

		</script>

 

For either method, plugin {{ YOUR API KEY }} where indicated with the API Key generated in OOCharts and replace {{ YOUR PROFILE ID }} with the associated eight-digit profile ID.  Load the page in your browser, and you get this:

With the API Key and Profile ID in place, the timeline.html example looks like this.  In this example I also adjusted the date paramter (30d by default) to 180d for more data.

With the API Key and Profile ID in place, the timeline.html example looks like this. In this example I also adjusted the date parameter (30d by default) to 180d for more data.

This example shows you two formats for the chart – one that is driven solely by JavaScript, and another that can be customized using HTML attributes.  For example, you could modify the <div> tag to include a style attribute or CSS class to change the width of the chart, e.g.:

<h3>With HTML Attributes</h3>

<div style=”width:400px” data-oochart='timeline' 
data-oochart-start-date='180d' 
data-oochart-metrics='ga:visits,Visits,ga:newVisits,New Visits' 
data-oochart-profile='{{ YOUR PROFILE ID }}'></div>

Here’s the same example.html file showing both the JavaScript-only format and the HTML-attributes format, now with a bit of styling on the HTML attributes chart to make it smaller:

You can use styling to adjust the HTML attributes example.

You can use styling to adjust the HTML attributes example.

Easy, right?  So what’s the catch?

OOCharts only allows 10,000 requests a month – which is even easier to exceed than the 50,000 limit on the Google Analytics API.  Each time your page loads, you use another request.  Perhaps more importantly, your Analytics API key and profile ID are pretty much ‘out there’ for the world to see if they view your page source, because those values are stored in your client-side JavaScript2.  If you’re making a private intranet for your library staff, that’s probably not a big deal; but if you want to publish your dashboard fully to the public, you’ll want to make sure those values are secure.  You can do this with the Google Analytics superProxy.

Google Analytics superProxy

In 2013, Google Analytics released a method of accessing Google Analytics API data that doesn’t require end-users to authenticate in order to view data, known as the Google Analytics superProxy.  Much like OOCharts, the superProxy facilitates the creation of a query engine that retrieves Google Analytics statistics through the Google Analytics Core Reporting API, caches the statistics in a separate web application service, and enables the display of Google Analytics data to end users without requiring individual authentication. Caching the data has the additional benefit of ensuring that your application will not exceed the Google Core Reporting API request limit of 50,000 requests each day. The superProxy can be set up to refresh a limited number of times per day, and most dashboard applications only need a daily refresh of data to stay current.
The required elements of this method are available on the superProxy Github page (Google Analytics, “Google Analytics superProxy”). There are four major parts to the setup of the superProxy:

  1. Setting up Google App Engine hosting,
  2. Preparing the development environment,
  3. Configuring and deploying the superProxy to Google’s App Engine Appspot host; and
  4. Writing and scheduling queries that will be used to populate your dashboard.
Set up Google App Engine hosting

First, your Google Analytics account credentials to access the Google App Engine at https://appengine.google.com/start. The superProxy application you will be creating will be freely hosted by the Google App Engine. Create your application and designate an Application Identifier that will serve as the endpoint domain for queries to your Google Analytics data (e.g., mylibrarycharts.appspot.com).

Create your App Engine Application

Create your App Engine Application

You can leave the default authentication option, Open to All Google Users, selected.  This setting only reflects access to your App Engine administrative screen and does not affect the ability for end-users to view the dashboard charts you create.  Only those Google users who have been authorized to access Google Analytics data will be able to access any Google Analytics information through the Google App Engine.

Ensure that API access to Google Analytics is turned on under the Services pane of the Google Developer’s Console. Under APIs and Auth for your project, visit the APIs menu and ensure that the Analytics API is turned on.

Turn on the Google Analytics API.

Turn on the Google Analytics API.  Make sure the name under the Projects label in the upper left corner is the same as your newly created Google App Engine project (e.g., librarycharts).

 

Then visit the Credentials menu to set up an OAuth 2.0 Client ID. Set the Authorized JavaScript Origins value to your appspot domain (e.g., http://mylibrarycharts.appspot.com). Use the same value for the Authorized Redirect URI, but add /admin/auth to the end (e.g., http://mylibrarycharts.appspot.com/admin/auth). Note the OAuth Client ID, OAuth Client Secret, and OAuth Redirect URI that are stored here, as you will need to reference them later before you deploy your superProxy application to the Google App Engine.

Finally, visit the Consent Screen menu and choose an email address (such as your Google account email address), fill in the product name field with your Application Identifier (e.g., mylibrarycharts) and save your settings. If you do not include these settings, you may experience errors when accessing your superProxy application admin menu.

Prepare the Development Environment

In order to configure superProxy and deploy it to Google App Engine you will need Python 2.7 installed and the Google App Engine Launcher (AKA the Google App Engine SDK).  Python just needs to be installed for the App Engine Launcher to run; don’t worry, no Python coding is required.

Configure and Deploy the superProxy

The superProxy application is available from the superProxy Github page. Download the .zip files and extract them onto your computer into a location you can easily reference (e.g., C:/Users/yourname/Desktop/superproxy or /Applications/superproxy). Use a text editor such as Notepad or Notepad++ to edit the src/app.yaml to include your Application ID (e.g., mylibrarycharts). Then use Notepad to edit src/config.py to include the OAuth Client ID, OAuth Client Secret, and the OAuth Redirect URI that were generated when you created the Client ID in the Google Developer’s Console under the Credentials menu. Detailed instructions for editing these files are available on the superProxy Github page.

After you have edited and saved src/app.yaml and src/config.py, open the Google App Engine Launcher previously downloaded. Go to File > Add Existing Application. In the dialogue box that appears, browse to the location of your superProxy app’s /src directory.

To upload your superProxy application, use the Google App Engine Launcher and browse to the /src directory where you saved and configured your superProxy application.

To upload your superProxy application, use the Google App Engine Launcher and browse to the /src directory where you saved and configured your superProxy application.

Click Add, then click the Deploy button in the upper right corner of the App Engine Launcher. You may be asked to log into your Google account, and a log console may appear informing you of the deployment process. When deployment has finished, you should be able to access your superProxy application’s Admin screen at http://[yourapplicationID].appspot.com/admin, replacing [yourapplicationID] with your Application Identifier.

Creating superProxy Queries

SuperProxy queries request data from your Google Analytics account and return that data to the superProxy application. When the data is returned, it is made available to an end-point that can be used to populate charts, graphs, or other data visualizations. Most data available to you through the Google Analytics native interface is available through superProxy queries.

An easy way to get started with building a query is to visit the Google Analytics Query Explorer. You will need to login with your Google Analytics account to use the Query Explorer. This tool allows you to build an example query for the Core Reporting API, which is the same API service that your superProxy application will be using.

Google Analytics API Query Explorer

Running example queries through the Google Analytics Query Explorer can help you to identify the metrics and dimensions you would like to use in superProxy queries. Be sure to note the metrics and dimensions you use, and also be sure to note the ids value that is populated for you when using the API Explorer.

 

When experimenting with the Google Analytics Query explorer, make note of all the elements you use in your query. For example, to create a query that retrieves the number of users that visited your site between July 4th and July 18th 2014, you will need to select your Google Account, Property and View from the drop-down menus, and then build a query with the following parameters:

  • ids = this is a number (usually 8 digits) that will be automatically populated for you when you choose your Google Analytics Account, Property and View. The ids value is your property ID, and you will need this value later when building your superProxy query.
  • dimensions = ga:browser
  • metrics = ga:users
  • start-date = 07-04-2014
  • end-date = 07-18-2014

You can set the max-results value to limit the number of results returned. For queries that could potentially have thousands of results (such as individual search terms entered by users), limiting to the top 10 or 50 results will retrieve data more quickly. Clicking on any of the fields will generate a menu from which you can select available options. Click Get Data to retrieve Google Analytics data and verify that your query works

Successful Google Analytics Query Explorer query result showing visits by browser.

Successful Google Analytics Query Explorer query result showing visits by browser.

After building a successful query, you can replicate the query in your superProxy application. Return to your superProxy application’s admin page (e.g., http://[yourapplicationid].appspot.com/admin) and select Create Query. Name your query something to make it easy to identify later (e.g., Users by Browser). The Refresh Interval refers to how often you want the superProxy to retrieve fresh data from Google Analytics. For most queries, a daily refresh of the data will be sufficient, so if you are unsure, set the refresh interval to 86400. This will refresh your data every 86400 seconds, or once per day.

Create a superProxy Query

Create a superProxy Query

We can reuse all of the elements of queries built using the Google Analytics API Explorer to build the superProxy query encoded URI.  Here is an example of an Encoded URI that queries the number of users (organized by browser) that have visited a web property in the last 30 days (you’ll need to enter your own profile ID in the ids value for this to work):

https://www.googleapis.com/analytics/v3/data/ga?ids=ga:99999991
&metrics=ga:users&dimensions=ga:browser&max-results=10
&start-date={30daysago}&end-date={today}

Before saving, be sure to run Test Query to see a preview of the kind of data that is returned by your query. A successful query will return a json response, e.g.:

{"kind": "analytics#gaData", "rows": 
[["Amazon Silk", "8"], 
["Android Browser", "36"], 
["Chrome", "1456"], 
["Firefox", "1018"], 
["IE with Chrome Frame", "1"], 
["Internet Explorer", "899"], 
["Maxthon", "2"], 
["Opera", "7"], 
["Opera Mini", "2"], 
["Safari", "940"]], 
"containsSampledData": false, 
"totalsForAllResults": {"ga:users": "4398"}, 
"id": "https://www.googleapis.com/analytics/v3/data/ga?ids=ga:84099180&dimensions=ga:browser&metrics=ga:users&start-date=2014-06-29&end-date=2014-07-29&max-results=10", 
"itemsPerPage": 10, "nextLink": "https://www.googleapis.com/analytics/v3/data/ga?ids=ga:84099180&dimensions=ga:browser&metrics=ga:users&start-date=2014-07-04&end-date=2014-07-18&start-index=11&max-results=10", 
"totalResults": 13, "query": {"max-results": 10, "dimensions": "ga:browser", "start-date": "2014-06-29", "start-index": 1, "ids": "ga:84099180", "metrics": ["ga:users"], "end-date": "2014-07-18"}, 
"profileInfo": {"webPropertyId": "UA-9999999-9", "internalWebPropertyId": "99999999", "tableId": "ga:84099180", "profileId": "9999999", "profileName": "Library Chart", "accountId": "99999999"}, 
"columnHeaders": [{"dataType": "STRING", "columnType": "DIMENSION", "name": "ga:browser"}, {"dataType": "INTEGER", "columnType": "METRIC", "name": "ga:users"}], 
"selfLink": "https://www.googleapis.com/analytics/v3/data/ga?ids=ga:84099180&dimensions=ga:browser&metrics=ga:users&start-date=2014-06-29&end-date=2014-07-29&max-results=10"}

Once you’ve tested a successful query, save it, which will allow the json string to become accessible to an application that can help to visualize this data. After saving, you will be directed to the management screen for your API, where you will need to click Activate Endpoint to begin publishing the results of the query in a way that is retrievable. Then click Start Scheduling so that the query data is refreshed on the schedule you determined when you built the query (e.g., once a day). Finally, click Refresh Data to return data for the first time so that you can start interacting with the data returned from your query. Return to your superProxy application’s Admin page, where you will be able to manage your query and locate the public end-point needed to create a chart visualization.

Using the Google Visualization API to visualize Google Analytics data

Included with the superProxy .zip file downloaded to your computer from the Github repository is a sample .html page located under /samples/superproxy-demo.html. This file uses the Google Visualization API to generate two pie charts from data returned from superProxy queries. The Google Visualization API is a service that can ingest raw data (such as json arrays that are returned by the superProxy) and generate visual charts and graphs. Save superproxy-demo.html onto a web server or onto your computer’s localhost.  We’ll set up the first pie chart to use the data from the Users by Browser query saved in your superProxy app.

Open superproxy-demo.html and locate this section:

var browserWrapper = new google.visualization.ChartWrapper({
// Example Browser Share Query
"containerId": "browser",
// Example URL: http://your-application-id.appspot.com/query?id=QUERY_ID&format=data-table-response
"dataSourceUrl": "REPLACE WITH Google Analytics superProxy PUBLIC URL, DATA TABLE RESPONSE FORMAT",
"refreshInterval": REPLACE_WITH_A_TIME_INTERVAL,
"chartType": "PieChart",
"options": {
"showRowNumber" : true,
"width": 630,
"height": 440,
"is3D": true,
"title": "REPLACE WITH TITLE"
}
});

Three values need to be modified to create a pie chart visualization:

  • dataSourceUrl: This value is the public end-point of the superProxy query you have created. To get this value, navigate to your superProxy admin page and click Manage Query on the Users by Browser query you have created. On this page, right click the DataTable (JSON Response) link and copy the URL (Figure 8). Paste the copied URL into superproxy-demo.html, replacing the text REPLACE WITH Google Analytics superProxy PUBLIC URL, DATA TABLE FORMAT. Leave quotes around the pasted URL.
Right-click the DataTable (JSON Response) link and copy the URL to your clipboard. The copied link will serve as the dataSouceUrl value in superproxy-demo.html.

Right-click the DataTable (JSON Response) link and copy the URL to your clipboard. The copied link will serve as the dataSouceUrl value in superproxy-demo.html.

  • refreshInterval – you can leave this value the same as the refresh Interval of your superProxy query (in seconds) – e.g., 86400.
  • title – this is the title that will appear above your pie chart, and should describe the data your users are looking at – e.g., Users by Browser.

Save the modified file to your server or local development environment, and load the saved page in a browser.  You should see a rather lovely pie chart:

Your pie chart's data will refresh automatically on the refresh schedule you set in your query.

Your pie chart’s data will refresh automatically based upon the Refresh Interval you specify in your superProxy query and your page’s JavaScript parameters.

That probably seemed like a lot of work just to make a pie chart.  But now that your app is set up, making new charts from your Google Analytics data just involves visiting your App Engine site, scheduling a new query, and referencing that with the Google Visualization API.  To me, the Google superProxy method has three distinct advantages over the simpler OOCharts method:

  • Security – Users won’t be able to view your API Keys by viewing the source of your dashboard’s web page
  • Stability – OOCharts might not be around forever.  For that matter, Google’s free App Engine service might not be around forever, but betting on Google is [mostly] a safe bet
  • Flexibility – You can create a huge range of queries, and test them out easily using the API explorer, and the Google Visualization API has extensive documentation and a fairly active user group from whom to gather advice and examples.

 

Notes

  1. There is a 50,000 request per day limit on the Analytics API.  That sounds like a lot, but it’s surprisingly easy to exceed. Consider creating a dashboard with 10 charts, each making a call to the Analytics API.  Without a service that caches the data, the data is refreshed every time a user loads a page.  After just 5,000 visits to the page (which makes 10 API calls – one for each chart – each time the page is loaded), the API limit is exceeded:  5,000 page loads x 10 calls per page = 50,000 API requests.
  2. You can use pre-built OOCharts Queries – https://app.oocharts.com/mc/query/list – to hide your profile ID (but not your API Key). There are many ways to minify and obfuscate client-side JavaScript to make it harder to read, but it’s still pretty much accessible to someone who wants to get it

A Short and Irreverently Non-Expert Guide to CSS Preprocessors and Frameworks

It took me a long time to wrap my head around what a CSS preproccessor was and why you might want to use one. I knew everyone was doing it, but for the amount of CSS I was doing, it seemed like overkill. And one more thing to learn! When you are a solo web developer/librarian, it’s always easier to clutch desperately at the things you know well and not try to add more complexity. So this post is for those of you are in my old position. If you’re already an expert, you can skip the post and go straight to the comments to sell us on your own favorite tools.

The idea, by the way, is that you will be able to get one of these up and running today. I promise it’s that easy (assuming you have access to install software on your computer).

Ok, So Why Should I Learn This?

Creating a modern and responsive website requires endless calculations, CSS adjustments, and other considerations that it’s not really possible to do it (especially when you are a solo web developer) without building on work that lots of other people have done. Even if you aren’t doing anything all that sophisticated in your web design, you probably want at minimum to have a nicely proportioned columnar layout with colors and typefaces that are easy to adapt as needed. Bonus points if your site is responsive so it looks good on tablets and phones. All of that takes a lot of math and careful planning to accomplish, and requires much attention to documentation if you want others to be able to maintain the site.

Frameworks and preprocessors take care of many development challenges. They do things like provide responsive columnar layouts that are customizable to your design needs, provide “mixins” of code others have written, allow you to nest elements rather than writing CSS with selectors piled on selectors, and create variables for any elements you might repeat such as typefaces and colors. Once you figure it out, it’s pretty addictive to figure out where you can be more efficient in your CSS. Then if your institution switches the shade of red used for active links, H1s, and footer text, you only have to change one variable to update this rather than trying to find all the places that color appears in your stylesheets.

What You Should Learn

If I’ve convinced you this is something worth putting time into, now you have to figure out where you should spend that time.This post will outline one set of choices, but you don’t need to stick with this if you want to get more involved in learning more.

Sometimes these choices are made for you by whatever language or content management system you are using, and usually one will dictate another. For instance, if you choose a framework that uses Sass, you should probably learn Sass. If the theme you’ve chosen for your content management system already includes preprocessor files, you’ll save yourself lots of time just by going with what that theme has chosen. It’s also important to note that you can use a preprocessor with any project, even just a series of flat HTML files, and in that case definitely use whatever you prefer.

I’m going to point out just a few of each here, since this is basically what I’ve used with which I am familiar. There are lots and lots of all of these things out there, and I would welcome people with specific experience to describe it in the comments.

Before you get too scared about the languages these tools use, it’s really only relevant to beginners insofar as you have to have one of these languages installed on your system to compile your Sass or Less files into CSS.

Preprocessors

Sass was developed in 2006. It can be written with Sass or SCSS syntax. Sass runs on Ruby.

Less was developed in 2009. It was originally written in Ruby, but now is in Javascript.

A Completely Non-Comprehensive List of CSS Frameworks

Note! You don’t have to use a framework for every project. If you have a two column layout with a simple breakpoint and don’t need all the overhead, feel free to not use one. But for most projects, frameworks provide things like responsive layouts and useful features (for instance tabs, menus, image galleries, and various CSS tricks) you might want to build in your site without a lot of thought.

  • Compass (runs on Sass)
  • Bootstrap (runs on Less)
  • Foundation (runs on Sass)
  • There are a million other ones out there too. Google search for css frameworks gets “About 4,860,000 results”. So I’m not kidding about needing to figure it out based on the needs of your project and what preprocessor you prefer.
A Quick and Dirty Tutorial for Getting These Up and Running

Let’s try out Sass and Compass. I started working with them when I was theming Omeka, and I run them on my standard issue work Windows PC, so I know you can do it too!

Ingredients:

Stir to combine.

Ok it’s not quite that easy. But it’s not a lot harder.

  1. Sass has great installation instructions. I’ve always run it from the command line without a fuss, and I don’t even particularly care for the command line. So feel free to follow those instructions.
  2. If you don’t have Ruby installed already (likely on a standard issue work Windows PC–it is already installed on Macs), install it. Sass suggests RubyInstaller, and I second that suggestion. There are some potential confusing or annoying things about doing it this way, but don’t worry about it until one pops up.
  3. If you are on Windows, open up a new command line window by typing cmd in your start search bar. If you are on a Mac, open up your Terminal app.
  4. Now all you have to do is install Sass with one line of code in your command line client/terminal window: gem install sass. If you are on a Mac, you might get an error message and  have to use the command sudo gem install sass.
  5. Next install Compass. You’ve got it, just type gem install compass.
  6. Read through this tutorial on Getting Started with Sass and Compass. It’s basically what I just said, but will link you out to a bunch of useful tutorials.
  7. If you decide the command line isn’t for you, you might want to look into other options for installing and compiling your files. The Sass installation instructions linked above give a few suggestions.
Working on a Project With These Tools

When you start a project, Compass will create a series of files Sass/SCSS files (you choose which syntax) that you will edit and then compile into CSS files. The easiest way to get started is to head over to the Compass installation page and use their menu to generate the command line text you’ll need to get started.

For this example, I’ll create a new project called acrltest by typing compass create acrltest. In the image below you’ll see what this looks like. This text also provides some information about what to do next, so you may want to copy it to save it for later.

compass1

You’ll now have a new directory with several recommended SCSS files, but you are by no means limited to these files. Sass provides the @import command, which imports additional SCSS files, which can either be full CSS files, or “partials”, which allow you to define styles without creating an additional CSS file. Partials start with an underscore. The normal practice is to have a partial _base.scss file, where you identify base styles. You will then import these to your screen.scss file to use whatever you have in that file.

Here’s what this looks like.

On _base.scss, I’ve created two variables.

$headings: #345fff;
$links: #c5ff34;

Now on my screen.scss file, I’ve imported my file and can use these variables.

@import "base";

a {
    color: $links;
   }

 h1 {
    color: $headings;
 }

Now this doesn’t do anything yet since there are no CSS files that you can use on the web. Here’s how you actually make the CSS files.

Open up your command prompt or terminal window and change to the directory your Compass project is in. Then type compass watch, and Compass will compile your SCSS files to CSS. Everytime you save your work, the CSS will be updated, which is very handy.

compass2

Now you’ll have something like this in the screen.css file in the stylesheets directory:

 

/* line 9, ../sass/screen.scss */
a {
  color: #c5ff34;
}

/* line 13, ../sass/screen.scss */
h1 {
  color: #345fff;
}

Note that you will have comments inserted telling you where this all came from.

Next Steps

This was an extremely basic overview, and there’s a lot more I didn’t cover–one of the major things being all the possibilities that frameworks provide. But I hope you start to get why this can ultimately make your life easier.

If you have a little experience with this stuff, please share it in the comments.


The Tools Behind Doge Decimal Classification

Four months ago, I made Doge Decimal Classification, a little website that translates Dewey Decimal class names into “Doge” speak à la the meme. It’s a silly site, for sure, but I took the opportunity to learn several new tools. This post will briefly detail each and provide resources for learning more.

Node & NPM

Because I’m curious about the framework, I chose to write Doge Decimal Classification in Node.js. Node is a relatively recent programming framework that expands the capabilities of JavaScript; rather than running in a browser, Node lets you write server-side software or command line utilities.

For Doge Decimal Classification (henceforth just DDC because, let’s face it, it’s pretty much obsoleted Dewey at this point), I wanted to write two pieces: a module and a website. A module is a reusable piece of code which provides some functionality; in this case, doge-speak Dewey class names. My website is a consumer of that module, it takes the module’s output and dresses it up in Comic Sans and bright colors. By dividing the two pieces a bit, I can work on them separately and perhaps reuse the DDC module elsewhere, for instance as a command-line tool.

For the module, I needed to write a NPM package, which turned out to be fairly straightforward. NPM is Node’s package manager, it provides a central repository for thousands of modules. Skipping over the boring part where I write code that actually does something, an NPM package is defined by a package.json file that contains metadata. That metadata tells NPM how users can find, install, and use my module. You can read my module’s package.json and some of its meaning may be self-evident, but here are a few pieces worth explaining:

  • the main field determines what the entry point of my module is, so when another app uses it (which, in Node, is done via a line like “var ddc = require('dogedc');") Node knows to load and execute a certain file
  • the dependencies field lists any packages which my package uses (it just so happens I didn’t use any) while devDependencies contains packages which anyone who wanted to work on developing the dogedc module itself would need (e.g. to run tests or automate other tedious tasks)
  • the repository field tells NPM what version control software I’m using and where to download the source code for my package (in this case, on GitHub)

Once I had my running code and package.json file in place, all I needed to do was run npm publish inside my project and it was published on NPM for anyone to install. Magic! When I update dogedc, either by adding new features or squashing bugs, all I need to do is run npm publish once again.

One final nicety of NPM that’s worth describing is the npm link command. While I’m working on my module, I want to be able to use it in the website, but also develop new features and fiddle with it in other contexts. It doesn’t make much sense to repeatedly install it from NPM everywhere that I use it, especially when I’m debugging new code which I don’t want to publish yet. Instead, npm link tells NPM to use my local copy of dogedc as if it was installed globally from NPM. This lets me preview the experience of someone consuming the latest version of my module without actually pushing that version out to the world.

Learn more: What is Node.js & why do I care? , How to create a NodeJS NPM package, NPM Developer Guide, & How do I get started with Node.js

Doge-ification

Now for perhaps the only amusing part of this post: how I went about translating Dewey Decimal classes into doge. The doge meme consists almost entirely of two-word phrases along the lines of “Much zombies. Such death. So amaze” (to quote from a Walking Dead themed meme). The only common one-word phrase is “wow” which is sprinkled liberally in most Doge images.

Reading through a few memes, the approach to taking a class name and turning it into doge becomes clear: split the name into a series of two-word doge phrases with a random adjective inserted before each noun. So “Extraterrestial Worlds” (999) becomes something like “Many extraterrestial. Much worlds.” Because the meme ignores most grammar conventions, we don’t need to worry about mismatching count and noncount nouns with appropriate adjectives. So, for instance, even though “worlds” is a count noun (you can say “one world, two worlds”) we can use the noncount adjective “much” to modify it.

There are a few more steps we need to take to create proper doge phrases. First of all, how do we know what adjectives to use? After viewing a bunch of memes, I decided on a list of the most commonly used ones: many, much, so, such, and very. Secondly, our simple procedure might work on “Extraterrestial Worlds” just fine, but what about “Library and Information Sciences”? The resulting “Many library. Much and. So information. Such sciences.” doesn’t look quite right. We need to strip out small words like conjunctions because they’re not used in doge.

So the final algorithm resembles:

  • Make the entire class name lowercase
  • Strip out stop words and punctuation
  • Split the string into an array on spaces (in JavaScript this is “array.split(" ")")
  • Loop over the array, each time adding a Doge word in front of the current word and then a period
  • Flip a coin to decide whether or not to add “Wow” on the end when we’re done

There are still a few weaknesses here. Most notably, if we use an unassigned class number like 992 the phrase “Not assigned or no longer used” comes out terribly (“Much not. Many assigned. So no. Much longer. Very used.”) simply because it’s a more traditional sentence, not a noun-laden class name. To write this algorithm in a robust way, one would have to incorporate natural language processing to identify which words should be stripped out, or perhaps be more thoughtful about which adjectives modify which nouns. That seems like a fun project for another day, but it was too much for my DDC module.

Learn more: The article A Linguist Explains the Grammar of Doge. Wow was useful in understanding doge.

Express

With my module in hand, I chose to use Express for the foundation of my site because it’s the most popular web framework for Node right now. Express is similar to Ruby on Rails or Django for Python. 1 Express isn’t a CMS like WordPress or Drupal; you can’t just install it and have a running blog working in a few minutes. Instead, Express provides tools and conventions for writing a web app. It scaffolds you over some of the busywork involved but does not do everything for you.

To get started with Express, you can generate a basic template for a site with a command line tool courtesy of NPM:

$ # first we install this tool globally (the -g flag)
$ npm install -g express-generator
$ # creates a ddc folder & puts the site template in it
$ express ddc

   create : ddc
   create : ddc/package.json
   create : ddc/app.js
   …a whole bunch more of this…

   install dependencies:
     $ cd ddc && npm install

   run the app:
     $ DEBUG=ddc ./bin/www
$ # just following the instructions above…
$ cd ddc && npm install

Since the Doge Decimal Classification site is simplistic, the initial template that Express provides created about 80% of the final code for the site. Mostly by looking through the generated project’s structure, I figured out what I needed to change and where I should make edits, such as the stark CSS that comes with a new Express site.

There’s a lot to Express so I won’t cover the framework in detail, but to give a short example of how it works I’ll discuss routing. Routing is the practice of determining what content to serve when someone visits a particular URL. Express organizes routing into two parts, one of which occurs in an “app.js” file and the other inside a “routes” directory. Here’s an example:

app.js:

// in app.js, I tell Express which routes to use for
// certain requests. Below means "for GET requests to
// the web site's root, use index"
app.get("/", routes.index);
// for GET requests anywhere else, use the "number" route
// ":number" becomes a special token I can use later
app.get("/:number", routes.number);

routes/index.js:

// inside my routes/index.js folder, I define routes.
// The "request" and "response" parameters below
// correspond to the users' request and what I choose
// to send back to them
exports.index = function(request, response){
        response.render("index", {
            // a random number from 0 to 999
            number: Math.floor(Math.random()*1000)
        });
    });
};

exports.number = function(request, response){
     // here req.params.number is that ":number"
     // from app.js above
        response.render("index", {
            number: request.params.number
        });
    });
};

Routing is how information from a user’s request—such as visiting dogedc.herokuapp.com/020—gets passed into my website and used to generate specific content. I use it to render the specific class number that someone chooses to put onto the URL, but you can imagine better use cases like delivering a user’s profile with a route like “app.get('/users/:user', routes.profile)“. There’s another piece here in how that “number” information gets used to generate HTML, but we’ve talked about Express enough.

Learn more: The Dead-Simple Step-by-Step Guide for Front-End Developers to Getting Up and Running with Node.JS, Express, Jade, and MongoDB takes you through a more involved example of using Express with the popular MongoDB database. Introduction to Express by the ever-excellent NetTuts is also a decent intro. 2

Social Media <meta>s

In my website, I also included Facebook Open Graph and Twitter Card metadata using HTML meta tags. These ensure that I have greater control over how Facebook, Twitter, and other social media platforms present DDC. For instance, someone could share a link to the site in Twitter without including my username. With Twitter cards, I can associate the site with my account, which causes my name to display beside it. In general, the link looks better; an image and tagline I specify also appear. Compare the two tweets in the screenshot below, one which links to a Tumblr (which doesn’t have my <meta> tags on it) and one where Twitter Cards are in effect:

2 tweets about DDC

Learn more: Gearing Up Your Sites for Sharing with Twitter & Facebook Meta Tags

Heroku

There’s one final piece to the DDC site: deployment and hosting. I’ve written the web app and thanks to Express I can even run it locally with a simple npm start command from inside its directory. But how do I share it with the world?

Since I chose to write Doge Decimal Classification in Node, my hosting options were limited. Node is a new technology and it’s not available on everywhere. If I really wanted something easy to publish, I would’ve used PHP or pure client-side JavaScript. But I purposefully set out to learn new tools and that’s in large part why I chose to host my app on Heroku.

Heroku is a popular cloud hosting platform. It’s oriented around a suite of command line tools which push new versions of your website to the cloud, as opposed to a traditional web host which might use FTP, a web-based administrative dashboard like Plesk or cPanel, or SSH to provide access to a remote server. Instead, Heroku builds upon the fact that most developers are already versioning their apps with git. Rather than add another new technology to the stack that you have to learn, Heroku works via a smart integration with git.

Once you have the Heroku command line tools installed, all you need to do is write a short “Procfile” which tells Heroku how to start your app (my Procfile is a single line, “web: npm start”). From there, you can run heroku create to start your app and then git push heroku master to push your code live. This makes publishing an app a cinch and Heroku has a free tier which has suited my simple site just fine. If we wanted to scale up, however, Heroku makes increasing the power of our server a matter of running a single command.

Learn more: Getting Started with Heroku; there’s a more specific article for starting a Node project

In Conclusion

There are actually a couple more things I learned in writing Doge Decimal Classification, such as writing unit tests with Nodeunit and using Jade for templates, but this post has gone on long enough. While building DDC, there were many different technologies I could’ve chosen to utilize rather than the ones I selected here. Why Node and Express over Ruby and Sinatra? Because I felt like it. There are entirely too many programming languages, frameworks, and tools for any one person to become familiar with them all. What’s more, I’m not sure when I’ll get a chance to these tools again, as they’re not common to any of the library technology I work with. But I found using a fun side project to level up in new areas much worthwhile, very enjoy.

Notes

  1. Technically, Rails and Django are more full-featured than Express, which is perhaps more similar to the smaller Sinatra or Flask frameworks for Ruby and Python respectively. The general idea of all these frameworks is the same, though; they provide useful functionality for writing web apps in a particular programming language. Some are just more “batteries included” than others.
  2. There are many more fine Express tutorials out there, but try to stick to recent ones; Express is being developed at a rapid pace and chances are if you run npm install express at the beginning of a tutorial, you’ll end up with a more current version than the one being covered, with subtle but important differences. Be cognizant of that possibility and install the appropriate version by running npm install express@3.3.3 (if 3.3.3 is the version you want).

Websockets For Real-time And Interactive Interfaces

TL;DR WebSockets allows the server to push up-to-date information to the browser without the browser making a new request. Watch the videos below to see the cool things WebSockets enables.

Real-Time Technologies

You are on a Web page. You click on a link and you wait for a new page to load. Then you click on another link and wait again. It may only be a second or a few seconds before the new page loads after each click, but it still feels like it takes way too long for each page to load. The browser always has to make a request and the server gives a response. This client-server architecture is part of what has made the Web such a success, but it is also a limitation of how HTTP works. Browser request, server response, browser request, server response….

But what if you need a page to provide up-to-the-moment information? Reloading the page for new information is not very efficient. What if you need to create a chat interface or to collaborate on a document in real-time? HTTP alone does not work so well in these cases. When a server gets updated information, HTTP provides no mechanism to push that message to clients that need it. This is a problem because you want to get information about a change in chat or a document as soon as it happens. Any kind of lag can disrupt the flow of the conversation or slow down the editing process.

Think about when you are tracking a package you are waiting for. You may have to keep reloading the page for some time until there is any updated information. You are basically manually polling the server for updates. Using XMLHttpRequest (XHR) (also commonly known as Ajax) has been a popular way to try to work around the limitations of HTTP somewhat. After the initial page load, JavaScript can be used to poll the server for any updated information without user intervention.

Using JavaScript in this way you can still use normal HTTP and almost simulate getting a real-time feed of data from the server. After the initial request for the page, JavaScript can repeatedly ask the server for updated information. The browser client still makes a request and the server responds, and the request can be repeated. Because this cycle is all done with JavaScript it does not require user input, does not result in a full page reload, and the amount of data which is returned from the server can be minimal. In the case where there is no new data to return, the server can just respond with something like, “Sorry. No new data. Try again.” And then the browser repeats the polling–tries again and again until there is some new data to update the page. And then goes back to polling again.

This kind of polling has been implemented in many different ways, but all polling methods still have some queuing latency. Queuing latency is the time a message has to wait on the server before it can be delivered to the client. Until recently there has not been a standardized, widely implemented way for the server to send messages to a browser client as soon as an event happens. The server would always have to sit on the information until the client made a request. But there are a couple of standards that do allow the server to send messages to the browser without having to wait for the client to make a new request.

Server Sent Events (aka EventSource) is one such standard. Once the client initiates the connection with a handshake, Server Sent Events allows the server to continue to stream data to the browser. This is a true push technology. The limitation is that only the server can send data over this channel. In order for the browser to send any data to the server, the browser would still need to make an Ajax/XHR request. EventSource also lacks support even in some recent browsers like IE11.

WebSockets allows for full-duplex communication between the client and the server. The client does not have to open up a new connection to send a message to the server which saves on some overhead. When the server has new data it does not have to wait for a request from the client and can send messages immediately to the client over the same connection. Client and server can even be sending messages to each other at the same time. WebSockets is a better option for applications like chat or collaborative editing because the communication channel is bidirectional and always open. While there are other kinds of latency involved here, WebSockets solves the problem of queuing latency. Removing this latency concern is what is meant by WebSockets being a real-time technology. Current browsers have good support for WebSockets.

Using WebSockets solves some real problems on the Web, but how might libraries, archives, and museums use them? I am going to share details of a couple applications from my work at NCSU Libraries.

Digital Collections Now!

When Google Analytics first turned on real-time reporting it was mesmerizing. I could see what resources on the NCSU Libraries’ Rare and Unique Digital Collections site were being viewed at the exact moment they were being viewed. Or rather I could view the URL for the resource being viewed. I happened to notice that there would sometimes be multiple people viewing the same resource at the same time. This gave me some hint that today someone’s social share or forum post was getting a lot of click throughs right now. Or sometimes there would be a story in the news and we had an image of one of the people involved. I could then follow up and see examples of where we were being effective with search engine optimization.

The Rare & Unique site has a lot of visual resources like photographs and architectural drawings. I wanted to see the actual images that were being viewed. The problem, though, was that Google Analytics does not have an easy way to click through from a URL to the resource on your site. I would have to retype the URL, copy and paste the part of the URL path, or do a search for the resource identifier. I just wanted to see the images now. (OK, this first use case was admittedly driven by one of the great virtues of a programmer–laziness.)

My first attempt at this was to create a page that would show the resources which had been viewed most frequently in the past day and past week. To enable this functionality, I added some custom logging that is saved to a database. Every view of every resource would just get a little tick mark that would be tallied up occasionally. These pages showing the popular resources of the moment are then regenerated every hour.

It was not a real-time view of activity, but it was easy to implement and it did answer a lot of questions for me about what was most popular. Some images are regularly in the group of the most-viewed images. I learned that people often visit the image of the NC State men’s basketball 1983 team roster which went on to win the NCAA tournament. People also seem to really like the indoor pool at the Biltmore estate.

Really Real-Time

Now that I had this logging in place I set about to make it really real-time. I wanted to see the actual images being viewed at that moment by a real user. I wanted to serve up a single page and have it be updated in real-time with what is being viewed. And this is where the persistent communication channel of WebSockets came in. WebSockets allows the server to immediately send these updates to the page to be displayed.

People have told me they find this real-time view to be addictive. I found it to be useful. I have discovered images I never would have seen or even known how to search for before. At least for me this has been an effective form of serendipitous discovery. I also have a better sense of what different traffic volume actually feels like on good day. You too can see what folks are viewing in real-time now. And I have written up some more details on how this is all wired up together.

The Hunt Library Video Walls

I also used WebSockets to create interactive interfaces on the the Hunt Library video walls. The Hunt Library has five large video walls created with Cristie MicroTiles. These very large displays each have their own affordances based on the technologies in the space and the architecture. The Art Wall is above the single service point just inside the entrance of the library and is visible from outside the doors on that level. The Commons Wall is in front of a set of stairs that also function as coliseum-like seating. The Game Lab is within a closed space and already set up with various game consoles.

Listen to Wikipedia

When I saw and heard the visualization and sonification Listen to Wikipedia, I thought it would be perfect for the iPearl Immersion Theater. Listen to Wikipedia visualizes and sonifies data from the stream of edits on Wikipedia. The size of the bubbles is determined by the size of the change to an entry, and the sound changes in pitch based on the size of the edit. Green circles show edits from unregistered contributors, and purple circles mark edits performed by automated bots. (These automated bots are sometimes used to integrate library data into Wikipedia.) A bell signals an addition to an entry. A string pluck is a subtraction. New users are announced with a string swell.

The original Listen to Wikipedia (L2W) is a good example of the use of WebSockets for real-time displays. Wikipedia publishes all edits for every language into IRC channels. A bot called wikimon monitors each of the Wikipedia IRC channels and watches for edits. The bot then forwards the information about the edits over WebSockets to the browser clients on the Listen to Wikipedia page. The browser then takes those WebSocket messages and uses the data to create the visualization and sonification.

As you walk into the Hunt Library almost all traffic goes past the iPearl Immersion Theater. The one feature that made this space perfect for Listen to Wikipedia was that it has sound and, depending on your tastes, L2W can create pleasant ambient sounds1. I began by adjusting the CSS styling so that the page would fit the large. Besides setting the width and height, I adjusted the size of the fonts. I added some text to a panel on the right explaining what folks are seeing and hearing. On the left is now text asking passersby to interact with the wall and the list of languages currently being watched for updates.

One feature of the original L2W that we wanted to keep was the ability to change which languages are being monitored and visualized. Each language can individually be turned off and on. During peak times the English Wikipedia alone can sound cacophonous. An active bot can make lots of edits of all roughly similar sizes. You can also turn off or on changes to Wikidata which collects structured data that can support Wikipedia entries. Having only a few of the less frequently edited languages on can result in moments of silence punctuated by a single little dot and small bell sound.

We wanted to keep the ability to change the experience and actually get a feel for the torrent or trickle of Wikipedia edits and allow folks to explore what that might mean. We currently have no input device for directly interacting with the Immersion Theater wall. For L2W the solution was to allow folks to bring their own devices to act as a remote control. We encourage passersby to interact with the wall with a prominent message. On the wall we show the URL to the remote control. We also display a QR code version of the URL. To prevent someone in New Zealand from controlling the Hunt Library wall in Raleigh, NC, we use a short-lived, three-character token.

Because we were uncertain how best to allow a visitor to kick off an interaction, we included both a URL and QR code. They each have slightly different URLs so that we can track use. We were surprised to find that most of the interactions began with scanning the QR code. Currently 78% of interactions begin with the QR code. We suspect that we could increase the number of visitors interacting with the wall if there were other simpler ways to begin the interaction. For bring-your-own-device remote controls we are interested in how we might use technologies like Bluetooth Low Energy within the building for a variety of interactions with the surroundings and our services.

The remote control Web page is a list of big checkboxes next to each of the languages. Clicking on one of the languages turns its stream on or off on the wall (connects or disconnects one of the WebSockets channels the wall is listening on). The change happens almost immediately with the wall showing a message and removing or adding the name of the language from a side panel. We wanted this to be at least as quick as the remote control on your TV at home.

The quick interaction is possible because of WebSockets. Both the browser page on the wall and the remote control client listen on another WebSockets channel for such messages. This means that as soon as the remote control sends a message to the server it can be sent immediately to the wall and the change reflected. If the wall were using polling to get changes, then there would potentially be more latency before a change registered on the wall. The remote control client also uses WebSockets to listen on a channel waiting for updates. This allows feedback to be displayed to the user once the change has actually been made. This feedback loop communication happens over WebSockets.

Having the remote control listen for messages from the server also serves another purpose. If more than one person enters the space to control the wall, what is the correct way to handle that situation? If there are two users, how do you accurately represent the current state on the wall for both users? Maybe once the first user begins controlling the wall it locks out other users. This would work, but then how long do you lock others out? It could be frustrating for a user to have launched their QR code reader, lined up the QR code in their camera, and scanned it only to find that they are locked out and unable to control the wall. What I chose to do instead was to have every message of every change go via WebSockets to every connected remote control. In this way it is easy to keep the remote controls synchronized. Every change on one remote control is quickly reflected on every other remote control instance. This prevents most cases where the remote controls might get out of sync. While there is still the possibility of a race condition, it becomes less likely with the real-time connection and is harmless. Besides not having to lock anyone out, it also seems like a lot more fun to notice that others are controlling things as well–maybe it even makes the experience a bit more social. (Although, can you imagine how awful it would be if everyone had their own TV remote at home?)

I also thought it was important for something like an interactive exhibit around Wikipedia data to provide the user some way to read the entries. From the remote control the user can get to a page which lists the same stream of edits that are shown on the wall. The page shows the title for the most recently edited entry at the top of the page and pushes others down the page. The titles link to the current revision for that page. This page just listens to the same WebSockets channels as the wall does, so the changes appear on the wall and remote control at the same time. Sometimes the stream of edits can be so fast that it is impossible to click on an interesting entry. A button allows the user to pause the stream. When an intriguing title appears on the wall or there is a large edit to a page, the viewer can pause the stream, find the title, and click through to the article.

The reaction from students and visitors has been fun to watch. The enthusiasm has had unexpected consequences. For instance one day we were testing L2W on the wall and noting what adjustments we would want to make to the design. A student came in and sat down to watch. At one point they opened up their laptop and deleted a large portion of a Wikipedia article just to see how large the bubble on the wall would be. Fortunately the edit was quickly reverted.

We have also seen the L2W exhibit pop up on social media. This Instagram video was posted with the comment, “Reasons why I should come to the library more often. #huntlibrary.”

This is people editing–Oh, someone just edited Home Alone–editing Wikipedia in this exact moment.

The original Listen to Wikipedia is open source. I have also made the source code for the Listen to Wikipedia exhibit and remote control application available. You would likely need to change the styling to fit whatever display you have.

Other Examples

I have also used WebSockets for some other fun projects. The Hunt Library Visualization Wall has a unique columnar design, and I used it to present images and video from our digital special collections in a way that allows users to change the exhibit. For the Code4Lib talk this post is based on, I developed a template for creating slide decks that include audience participation and synchronized notes via WebSockets.

Conclusion

The Web is now a better development platform for creating real-time and interactive interfaces. WebSockets provides the means for sending real-time messages between servers, browser clients, and other devices. This opens up new possibilities for what libraries, archives, and museums can do to provide up to the moment data feeds and to create engaging interactive interfaces using Web technologies.

If you would like more technical information about WebSockets and these projects, please see the materials from my Code4Lib 2014 talk (including speaker notes) and some notes on the services and libraries I have used. There you will also find a post with answers to the (serious) questions I was asked during the Code4Lib presentation. I’ve also posted some thoughts on designing for large video walls.

Thanks: Special thanks to Mike Nutt, Brian Dietz, Yairon Martinez, Alisa Katz, Brent Brafford, and Shirley Rodgers for their help with making these projects a reality.


About Our Guest Author: Jason Ronallo is the Associate Head of Digital Library Initiatives at NCSU Libraries and a Web developer. He has worked on lots of other interesting projects. He occasionally writes for his own blog Preliminary Inventory of Digital Collections.

Notes

  1. Though honestly Listen to Wikipedia drove me crazy listening to it so much as I was developing the Immersion Theater display.

Analyzing Usage Logs with OpenRefine

Background

Like a lot of librarians, I have access to a lot of data, and sometimes no idea how to analyze it. When I learned about linked data and the ability to search against data sources with a piece of software called OpenRefine, I wondered if it would be possible to match our users’ discovery layer queries against the Library of Congress Subject Headings. From there I could use the linking in LCSH to find the Library of Congress Classification, and then get an overall picture of the subjects our users were searching for. As with many research projects, it didn’t really turn out like I anticipated, but it did open further areas of research.

At California State University, Fullerton, we use an open source application called Xerxes, developed by David Walker at the CSU Chancellor’s Office, in combination with the Summon API. Xerxes acts as an interface for any number of search tools, including Solr, federated search engines, and most of the major discovery service vendors. We call it the Basic Search, and it’s incredibly popular with students, with over 100,000 searches a month and growing. It’s also well-liked – in a survey, about 90% of users said they found what they were looking for. We have monthly files of our users’ queries, so I had all of the data I needed to go exploring with OpenRefine.

OpenRefine

OpenRefine is an open source tool that deals with data in a very different way than typical spreadsheets. It has been mentioned in TechConnect before, and Margaret Heller’s post, “A Librarian’s Guide to OpenRefine” provides an excellent summary and introduction. More resources are also available on Github.

One of the most powerful things OpenRefine does is to allow queries against open data sets through a function called reconciliation. In the open data world, reconciliation refers to matching the same concept among different data sets, although in this case we are matching unknown entities against “a well-known set of reference identifiers” (Re-using Cool URIs: Entity Reconciliation Against LOD Hubs).

Reconciling Against LCSH

In this case, we’re reconciling our discovery layer search queries with LCSH. This basically means it’s trying to match the entire user query (e.g. “artist” or “cost of assisted suicide”) against what’s included in the LCSH linked open data. According to the LCSH website this includes “all Library of Congress Subject Headings, free-floating subdivisions (topical and form), Genre/Form headings, Children’s (AC) headings, and validation strings* for which authority records have been created. The content includes a few name headings (personal and corporate), such as William Shakespeare, Jesus Christ, and Harvard University, and geographic headings that are added to LCSH as they are needed to establish subdivisions, provide a pattern for subdivision practice, or provide reference structure for other terms.”

I used the directions at Free Your Metadata to point me in the right direction. One note: the steps below apply to OpenRefine 2.5 and version 0.8 of the RDF extension. OpenRefine 2.6 requires version 0.9 of the RDF extension. Or you could use LODRefine, which bundles some major extensions and I hear is great, but personally haven’t tried. The basic process shouldn’t change too much.

(1) Import your data

OpenRefine has quite a few file type options, so your format is likely already supported.

 Screenshot of importing data

(2) Clean your data

In my case, this involves deduplicating by timestamp and removing leading and trailing whitespaces. You can also remove weird punctuation, numbers, and even extremely short queries (<2 characters).

(3) Add the RDF extension.

If you’ve done it correctly, you should see an RDF dropdown next to Freebase.

Screenshot of correctly installed RDF extension

(4) Decide which data you’d like to search on.

In this example, I’ve decided to use just queries that are less than or equal to four words, and removed duplicate search queries. (Xerxes handles facet clicks as if they were separate searches, so there are many duplicates. I usually don’t, though, unless they happen at nearly the same time). I’ve also experimented with limiting to 10 or 15 characters, but there were not many more matches with 15 characters than 10, even though the data set was much larger. It depends on how much computing time you want to spend…it’s really a personal choice. In this case, I chose 4 words because of my experience with 15 characters – longer does not necessarily translate into more matches. A cursory glance at LCSH left me with the impression that the vast majority of headings (not including subdivisions, since they’d be searched individually) were 4 words or less. This, of course, means that your data with more than 4 words is unusable – more on that later.

Screenshot of adding a column based on word count using ngrams

(5) Go!

Shows OpenRefine reconciling

(6) Now you have your queries that were reconciled against LCSH, so you can limit to just those.

Screenshot of limiting to reconciled queries

Finding LC Classification

First, you’ll need to extract the cell.recon.match.id – the ID for the matched query that in the case of LCSH is the URI of the concept.

Screenshot of using cell.recon.match.id to get URI of concept

At this point you can choose whether to grab the HTML or the JSON, and create a new column based on this one by fetching URLs. I’ve never been able to get the parseJson() function to work correctly with LC’s JSON outputs, so for both HTML and JSON I’ve just regexed the raw output to isolate the classification. For more on regex see Bohyun Kim’s previous TechConnect post, “Fear No Longer Regular Expressions.”

On the raw HTML, the easiest way to do it is to transform the cells or create a new column with:

replace(partition(value,/<li property=”madsrdf:classification”>(<[^>]+>)*([A-Z]{1,2})/)[1],/<li property=”madsrdf:classification”>(<[^>]+>)*([A-Z]{1,2})/,”$2″).

Screenshot of using regex to get classification

You’ll note this will only pull out the first classification given, even if some have multiple classifications. That was a conscious choice for me, but obviously your needs may vary.

(Also, although I’m only concentrating on classification for this project, there’s a huge amount of data that you could work with – you can see an example URI for Acting to see all of the different fields).

Once you have the classifications, you can export to Excel and create a pivot table to count the instances of each, and you get a pretty table.

Table of LC Classifications

Caveats & Further Explorations

As you can guess by the y-axis in the table above, the number of matches is a very small percentage of actual searches. First I limited to keyword searches (as opposed to title/subject), then of those only ones that were 4 or fewer words long (about 65% of keyword searches). Of those, only about 1000 of the 26000 queries matched, and resulted in about 360 actual LC Classifications. Most months I average around 500, but in this example I took out duplicates even if they were far apart in time, just to experiment.

One thing I haven’t done but am considering is allowing matches that aren’t 100%. From my example above, there are another 600 or so queries that matched at 50-99%. This could significantly increase the number of matches and thus give us more classifications to work with.

Some of this is related to the types of searches that students are doing (see Michael J DeMars’ and my presentation “Making Data Less Daunting” at Electronic Resources & Libraries 2014, which this article grew out of, for some crazy examples) and some to the way that LCSH is structured. I chose LCSH because I could get linked to the LC Classification and thus get a sense of the subjects, but I’m definitely open to ideas. If you know of a better linked data source, I’m all ears.

I must also note that this is a pretty inefficient way of matching against LCSH. If you know of a way I could download the entire set, I’m interested in investigating that way as well.

Another approach that I will explore is moving away from reconciliation with LCSH (which is really more appropriate for a controlled vocabulary) to named-entity extraction, which takes natural language inputs and tries to recognize or extract common concepts (name, place, etc). Here I would use it as a first step before trying to match against LCSH. Free Your Metadata has a new named-entity extraction extension for OpenRefine, so I’ll definitely explore that option.

Planned Research

In the end, although this is interesting, does it actually mean anything? My next step with this dataset is to take a subset of the search queries and assign classification numbers. Over the course of several months, I hope to see if what I’ve pulled in automatically resembles the hand-classified data, and then draw conclusions.

So far, most of the peaks are expected – psychology and nursing are quite strong departments. There are some surprises though – education has been consistently underrepresented, based on both our enrollment numbers and when you do word counts (see our presentation for one month’s top word counts). Education students have a robust information literacy program. Does this mean that education students do complex searches that don’t match LCSH? Do they mostly use subject databases? Once again, an area for future research, should these automatic results match the classifications I do by hand.

What do you think? I’d love to hear your feedback or suggestions.

About Our Guest Author

Jaclyn Bedoya has lived and worked on three continents, although currently she’s an ER Librarian at CSU Fullerton. It turns out that growing up in Southern California spoils you, and she’s happiest being back where there are 300 days of sunshine a year. Also Disneyland. Reach her @spamgirl on Twitter or jaclynbedoya@gmail.com


My First Hackathon & WikipeDPLA

Almost two months ago, I attended my first hackathon during ALA’s Midwinter Meeting. Libhack was coordinated by the Library Code Year Interest Group. Much credit is due to coordinators Zach Coble, Emily Flynn, Jesse Saunders, and Chris Strauber. The University of Pennsylvania graciously hosted the event in their Van Pelt Library.

What’s a hackathon? It’s a short event, usually a day or two, wherein coders and other folks get together to produce software. Hackathons typically work on a particular problem, application, or API (a source of structured data). LibHack focused on APIs from two major library organizations: OCLC and the Digital Public Library of America (DPLA).

Impressions & Mixed Content

Since this was my first hackathon and the gritty details below may be less than relevant to all our readers, I will front-load my general impressions of Libhack rather than talk about the code I wrote. First of all, splitting the hackathon into two halves focused on different APIs and catering to different skill levels worked well. There were roughly equal numbers of participants in both the structured, beginner-oriented OCLC group and the more independent DPLA group.

Having representatives from both of the participating institutions was wonderful. While I didn’t take advantage of the attending DPLA staff as much as I should have, it was great to have a few people to answer questions. What’s more, I think DPLA benefited from hearing about developers’ experiences with their API. For instance, there are a few metadata fields in their API which might contain an array or a string depending upon the record. If an application assumes one or the other, chances are it breaks at some point and the programmer has to locate the error and write code that handles either data format.

Secondly, the DPLA API is currently available only over unencrypted HTTP. Thus due to the mixed content policies of web browsers it is difficult to call the HTTP API on HTTPS pages. For the many HTTP sites on the web this isn’t a concern, but I wanted to call the DPLA API from Wikipedia which only serves content over HTTPS. To work around this limitation, users have to manually override mixed content blocking in their browser, a major limitation for my project. DPLA already had plans to roll out an HTTPS API, but I think hearing from developers may influence its priority.

Learn You Some Lessons

Personally, I walked away from Libhack with a few lessons. First of all, I had to throw away my initial code before creating something useful. While I had a general idea in mind—somehow connect DPLA content related to a given Wikipedia page—I wasn’t sure what type of project I should create. I started writing a command-line tool in Python, envisioning a command that could be passed a Wikipedia URL or article title and return a list of related items in the DPLA. But after struggling with a pretty unsatisfying project for a couple hours, including a detour into investigating the MediaWiki API, I threw everything aside and took a totally different approach by building a client-side script meant to run in a web browser. In the end, I’m a lot happier with the outcome after my initial failure. I love the command line, but the appeal of such a tool would be niche at best. What I wrote has a far broader appeal.1

Secondly, I worked closely with Wikipedian Jake Orlowitz.2 While he isn’t a coder, his intimate knowledge of Wikipedia was invaluable for our end product. Whenever I had a question about Wikipedia’s inner workings or needed someone to bounce ideas off of, he was there. While I blindly starting writing some JavaScript without a firm idea of how we could embed it onto Wikipedia pages, it was Jake who pointed me towards User Scripts and created an excellent installation tour.3 In other groups, I heard people discussing metadata, subject terms, and copyright. I think that having people of varied expertise in a group is advantageous when compared with a group solely composed of coders. Many hackathons explicitly state that non-programmers are welcome and with good reason; experts can outline goals, consider end-user interactions, and interpret API results. These are all invaluable contributions which are also hard to do with one’s face buried in a code editor.

While I did enjoy my hackathon experience, I was expecting a bit more structure and larger project groups. I arrived late, which doubtless didn’t help, but the DPLA groups were very fragmented. Some projects were only individuals, while others (like ours) were pairs. I had envisioned groups of at least four, where perhaps one person would compose plans and documentation, another would design a user interface, and the remainder would write back-end code. I can’t say that I was at all disappointed, but I could have benefited from the perspectives of a larger group.

What is WikipeDPLA?

So what did we build at Libhack anyway? As previously stated, we made a Wikipedia user script. I’ve dubbed it WikipeDPLA, though you can find it as FindDPLA on Wikipedia. Once installed, the script will query DPLA’s API on each article you visit, inserting related items towards the top.

WikipeDPLA in action

How does it work?

Here’s a step-by-step walkthrough of how WikipeDPLA works:

When you visit a Wikipedia article, it collects a few pieces of information about the article by copying text from the page’s HTML: the article’s title, any “X redirects here” notices, and the article’s categories.

First, WikipeDPLA constructs a DPLA query using the article’s title. Specifically, it constructs a JSONP query. JSONP is a means of working around the web’s same-origin policy which lets scripts manipulate data on other web pages. It works by including a script tag with a specially constructed URL containing a reference to one of your JavaScript functions:

<script src="//example.com/jsonp-api?q=search+term&callback=parseResponse"></script>

In responding to this request, the API plays a little trick; it doesn’t just return raw data, since that would be invalid JavaScript and thus cause a parsing error in the browser. Instead, it wraps the data in the function we’ve provided it. In the example above, that’s parseResponse:

parseResponse({
    "results": [
        {"title": "Searcher Searcherson",
        "id": 123123,
        "genre": "Omphaloskepsis"},
        {"title": "Terminated Term",
        "id": 321321,
        "genre": "Literalism"}
    ]
});

This is valid JavaScript; parseResponse receives an object which contains an array of search result records, each with some minimal metadata. This pattern has the handy feature that, as soon as our query results are available, they’re immediately passed to our callback function.

WikipeDPLA’s equivalent of parseResponse looks to see if there are any results. If the article’s title doesn’t return any results, then it’ll try again with any alternate titles culled from the article’s redirection notices. If those queries are also fruitless, it starts to go through the article’s categories.

Once we’ve guaranteed that we have some results from DPLA, we parse the API’s metadata into a simpler subset. This subset consists of the item’s title, a link to its content, and an “isImage” Boolean value noting whether or not the item is an image. With this simpler set of data in hand, we loop through our results to build a string of HTML which is then inserted onto the page. Voìla! DPLA search results in Wikipedia.

Honing

After putting the project together, I continued to refine it. I used the “isImage” Boolean to put a small image icon next to an item’s link. Then, after the hackathon, I noticed that my script was a nuisance if a user started reading a page anywhere other than at its start. For instance, if you start reading the Barack Obama article at the Presidency section, you will read for a moment and then suddenly be jarred as the DPLA results are inserted up top and push the rest of the article’s text down the page. In order to mitigate this behavior, we need to know if the top of the article is in view before inserting our results HTML. I used a jQuery visibility plug-in and an event listener on window scroll events to fix this.

Secondly, I was building a project with several targets: a user script for Wikipedia, a Grease/Tampermonkey user script4, and a (as yet inchoate) browser extension. To reuse the same basic JavaScript but in these different contexts, I chose to use the make command. Make is a common program used for projects which have multiple platform targets. It has an elegantly simple design: when you run make foo inside of a directly, make looks in a file named “makefile” for a line labelled “foo:” and then executes the shell command on the subsequent line. So if I have the following makefile:

hello:
    echo 'hello world!'

bye:
    echo 'goodbye!'

clean:
    rm *.log *.cache

Inside the same directory as this makefile, the commands make hello, make bye, and make clean respectively would print “hello world!” to my terminal, print “goodbye!”, and delete all files ending in extension “log” or “cache”. This contrived example doesn’t help much, but in my project I can run something like make userscript and the Grease/Tampermonkey script is automatically produced by prepending some header text to the main WikipeDPLA script. Similarly, make push produces all the various platform targets and then pushes the results up to the GitHub repo, saving a significant amount of typing on the command line.

These bits of trivia about interface design and tooling allude to a more important idea: it’s vital to choose projects that help you learn, particularly in a low-stakes environment like a hackathon. No one expects greatness from a product duct taped together in a few hours, so seize the opportunity to practice rather than aim for perfection. I didn’t have to write a makefile, but I chose to spend time familiarizing myself with a useful tool.

What’s Next?

While I am quite happy with my work at Libhack I do have plans for improvement. My main goal is to turn WikipeDPLA into a browser extension, for Chrome and perhaps Firefox. An extension offers a couple advantages: it can avoid the mixed-content issue with DPLA’s HTTP-only API5 and it is available even for users who aren’t logged in to Wikipedia. It would also be nice to expand my approach to encompassing other major digital library APIs, such as Europeana or Australia’s Trove.

And, of course, I want to attend more hackathons. Libhack was a very positive event for me, both in terms of learning and producing something useful, so I’m encouraged and hope other library conferences offer collaborative coding opportunities.

Other Projects

Readers should head over to LITA Blog where organizer Zach Coble has a report on libhack which details several other projects created at the Midwinter hackathon. Or you could just follow @HistoricalCats on Twitter.

Notes

  1. An aside related to learning to program: being a relatively new coder, I often think about advice I can give others looking to start coding. One common question is “what language should I learn first?” There’s one stock response, that it’s important not to worry too much about this choice because learning the fundamentals of one language will enable you to learn others quickly. But that dodges the question, because what people want to hear is a proper noun like “Ruby” or “Python” or “JavaScript.” And JavaScript, despite not being nearly as user friendly as those other two options, is a great starting place because it lets you work on the web with little effort. All of this to say; if I didn’t know JavaScript fairly well, I would not have been able to make something so useful.
  2. Shameless plug: Jake works on the Wikipedia Library, an interesting project that aims to connect Wikipedian researchers with source material, from subscription databases and open access repositories alike.
  3. User Scripts are pieces of JavaScript that a user can choose to insert whenever they are signed into and browsing Wikipedia. They’re similar to Greasemonkey user scripts, except the scripts only apply to Wikipedia. These scripts can do anything from customize the site’s appearance to insert new content, which is exactly what we did.
  4. Greasemonkey is the Firefox add-on for installing scripts that run on specified sites or pages; Tampermonkey is an analogous extension for Chrome.
  5. How’s that for acronyms?

Bash Scripting: automating repetitive command line tasks

Introduction

One of my current tasks is to develop workflows for digital preservation procedures. These workflows begin with the acquisition of files – either disk images or logical file transfers – both of which end up on a designated server. Once acquired, the images (or files) are checked for viruses. If clean, they are bagged using Bagit and then copied over to a different server for processing.1 This work is all done at the command line, and as you can imagine, it gets quite repetitive. It’s also a bit error-prone since our file naming conventions include a 10-digit media ID number, which is easily mistyped. So once all the individual processes were worked out, I decided to automate things a bit by placing the commands into a single script. I should mention here that I’m no Linux whiz- I use it as needed which sometimes is daily, sometimes not. This is the first time I’ve ever tried to tie commands together in a Bash script, but I figured previous programming experience would help.

Creating a Script

To get started, I placed all the virus check commands for disk images into a script. These commands are different than logical file virus checks since the disk has to be mounted to get a read. This is a pretty simple step – first add:

#!/bin/bash

as the first line of the file (this line should not be indented or have any other whitespace in front of it). This tells the kernel which kind of interpreter to invoke, in this case, Bash. You could substitute the path to another interpreter, like Python, for other types of scripts: #!/bin/python.

Next, I changed the file permissions to make the script executable:

chmod +x myscript

I separated the virus check commands so that I could test those out and make sure they were working as expected before delving into other script functions.

Here’s what my initial script looked like (comments are preceded by a #):

#!/bin/bash

#mount disk
sudo mount -o ro,loop,nodev,noexec,nosuid,noatime /mnt/transferfiles/2013_072/2013_072_DM0000000001/2013_072_DM0000000001.iso /mnt/iso

#check to verify mount
mountpoint -q /mnt/iso && echo "mounted" || "not mounted"

#call the Clam AV program to run the virus check
clamscan -r /mnt/iso > "/mnt/transferfiles/2013_072/2013_072_DM0000000001/2013_072_DM0000000001_scan_test.txt"

#unmount disk
sudo umount /mnt/iso

#check disk unmounted
mountpoint -q /mnt/iso && echo "mounted" || "not mounted"

All those options on the mount command? They give me the piece of mind that accessing the disk will in no way alter it (or affect the host server), thus preserving its authenticity as an archival object. You may also be wondering about the use of “&&” and “||”.  These function as conditional AND and OR operators, respectively. So “&&” tells the shell to run the first command, AND if that’s successful, it will run the second command. Conversely, the “||” tells the shell to run the first command OR if that fails, run the second command. So the mount check command can be read as: check to see if the directory at /mnt/iso is a mountpoint. If the mount is successful, then echo “mounted.” If it’s not, echo “not mounted.” More on redirection.

Adding Variables

You may have noted that the script above only works on one disk image (2013_072_DM0000000001.iso), which isn’t very useful. I created variables for the accession number, the digital media number, and the file extension, since they all changed depending on the disk image information.The file naming convention we use for disk images is consistent and strict. The top level directory is the accession number. Within that, each disk image acquired from that accession is stored within it’s own directory, named using it’s assigned number. The disk image is then named by a combination of the accession number and the disk image number. Yes, it’s repetitive, but it keeps track of where things came from and links to data we have stored elsewhere. Given that these disks may not be processed for 20 years, such redundancy is preferred.

At first I thought the accession number, digital media number, and extension variables would be best entered at the initial run command; type in one line to run many commands. Each variable is separated by a space, the .iso at the end is the extension for an optical disk image file:

$ ./virus_check.sh 2013_072 DM0000000001 .iso

In Bash, scripts run with arguments are named $1 for the first variable, $2 for the second, and so on. This actually tripped me up for a day or so. I initially thought the $1, $2, etc. variables names used by the book I was referencing were for examples only, and that the first variables I referenced in the script would automatically map in order, so if 2013_072 was the first argument, and $accession was the first variable, $accession = 2013_072 (much like when you pass in a parameter to a Python function). Then I realized there was a reason that more than one reference book and/or site used the $1, $2, $3 system for variables passed in as command line arguments. I assigned each to it’s proper variable, and things were rolling again.

#!/bin/bash

#assign command line variables
$1=$accession
$2=$digital_media
$3=$extension

#mount disk
sudo mount -o ro,loop,noatime /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} /mnt/iso<span style="line-height: 1.5em;"> </span>

Note: variables names are often presented without curly braces; it’s recommended to place them in curly braces when adjacent to other strings.2

Reading Data

After testing the script a bit, I realized I  didn’t like passing the variables in via the command line. I kept making typos, and it was annoying not to have any data validation done in a more interactive fashion. I reconfigured the script to prompt the user for input:

read -p "Please enter the accession number" accession

read -p "Please enter the digital media number" digital_media

read -p "Please enter the file extension, include the preceding period" extension

After reviewing some test options, I decided to test that $accession and $digital_media were valid directories, and that the combo of all three variables was a valid file. This test seems more conclusive than simply testing whether or not the variables fit the naming criteria, but it does mean that if the data entered is invalid, the feedback given to the user is limited. I’m considering adding tests for naming criteria as well, so that the user knows when the error is due to a typo vs. a non-existent directory or file. I also didn’t want the code to simply quit when one of the variables is invalid – that’s not very user-friendly. I decided to ask the user for input until valid input was received.

read -p "Please enter the accession number" accession
until [ -d /mnt/transferfiles/${accession} ]; do
     read -p "Invalid. Please enter the accession number." accession
done

read -p "Please enter the digital media number" digital_media
until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
     read -p "Invalid. Please enter the digital media number." digital_media
done

read -p  "Please enter the file extension, include the preceding period" extension
until [ -e/mnt/transferfiles/${accession}/${accession}_${digital_media}/
${accession}_${digital_media}${extension} ]; do
     read -p "Invalid. Please enter the file extension, including the preceding period" extension
done

Creating Functions

You may have noted that the command used to test if a disk is mounted or not is called twice. This is done on purpose, as it was a test I found helpful when running the virus checks; the virus check runs and outputs a report even if the disk isn’t mounted. Occasionally disks won’t mount for various reasons. In such cases, the resulting report will state that it scanned no data, which is confusing because the disk itself possibly could have contained no data. Testing if it’s mounted eliminates that confusion. The command is repeated after the disk has been unmounted, mainly because I found it easy to forget the unmount step, and testing helps reinforce good behavior. Given that the command is repeated twice, it makes sense to make it a function rather than duplicate it.

check_mount () {
     #checks to see if disk is mounted or not
     mountpoint -q /mnt/iso && echo "mounted" || "not mounted" 
}

Lastly, I created a function for the input variables. I’m sure there’s prettier, more concise ways of writing this function, but since it’s still being refined and I’m still learning Bash scripting, I decided to leave it for now. I did want it placed in it’s own function because I’m planning to add additional code that will notify me if the virus check is positive and exit the program, or if it’s negative, bag the disk image and corresponding files, and copy them over to another server where they’ll wait for further processing.

get_image () {
     #gets data from user, validates it    
     read -p "Please enter the accession number" accession
     until [ -d /mnt/transferfiles/${accession} ]; do
          read -p "Invalid. Please enter the accession number." accession
     done

     read -p "Please enter the digital media number" digital_media
     until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
          read -p "Invalid. Please enter the digital media number." digital_media
     done

     read -p  "Please enter the file extension, include the preceding period" extension
     until [ -e /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} ]; do
          read -p "Invalid. Please enter the file extension, including the preceding period" extension
     done
}

Resulting (but not final!) Script

#!/bin/bash
#takes accession number, digital media number, and extension as variables to mount a disk image and run a virus check

check_mount () {
     #checks to see if disk is mounted or not
     mountpoint -q /mnt/iso && echo "mounted" || "not mounted"
}

get_image () {
     #gets disk image data from user, validates info
     read -p "Please enter the accession number" accession
     until [ -d /mnt/transferfiles/${accession} ]; do
          read -p "Invalid. Please enter the accesion number." accession
     done

     read -p "Please enter the digital media number" digital_media
     until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
          read -p "Invalid. Please enter the digital media number." digital_media
     done

     read -p  "Please enter the file extension, include the preceding period" extension
     until [ -e/mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} ]; do
          read -p "Invalid. Please enter the file extension, including the preceding period" extension
     done
}

get_image

#mount disk
sudo mount -o ro,loop,noatime /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} /mnt/${extension} 

check_mount

#run virus check
sudo clamscan -r /mnt/iso > "/mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}_scan_test.txt"

#unmount disk
sudo umount /mnt/iso

check_mount

Conclusion

There’s a lot more I’d like to do with this script. In addition to what I’ve already mentioned, I’d love to enable it to run over a range of digital media numbers, since they often are sequential. It also doesn’t stop if the disk isn’t mounted, which is an issue. But I thought it served as a good example of how easy it is to take repetitive command line tasks and turn them into a script. Next time, I’ll write about the second phase of development, which will include combining this script with another one, virus scan reporting, bagging, and transfer to another server.

Suggested References

An A-Z Index of the Bash command line for Linux

The books I used, both are good for basic command line work, but they only include a small section for actual scripting:

Barrett, Daniel J., Linux Pocket Guide. O’Reilly Media, 2004.

Shotts, Jr., William E. The Linux Command Line: A complete introduction. no starch press. 2012.

The book I wished I used:

Robbins, Arnold and Beebe, Nelson H. F. Classic Shell Scripting. O’Reilly Media, 2005.

Notes

  1. Logical file transfers often arrive in bags, which are then validated and virus checked.
  2. Linux Pocket Guide