The Tools Behind Doge Decimal Classification

Four months ago, I made Doge Decimal Classification, a little website that translates Dewey Decimal class names into “Doge” speak à la the meme. It’s a silly site, for sure, but I took the opportunity to learn several new tools. This post will briefly detail each and provide resources for learning more.

Node & NPM

Because I’m curious about the framework, I chose to write Doge Decimal Classification in Node.js. Node is a relatively recent programming framework that expands the capabilities of JavaScript; rather than running in a browser, Node lets you write server-side software or command line utilities.

For Doge Decimal Classification (henceforth just DDC because, let’s face it, it’s pretty much obsoleted Dewey at this point), I wanted to write two pieces: a module and a website. A module is a reusable piece of code which provides some functionality; in this case, doge-speak Dewey class names. My website is a consumer of that module, it takes the module’s output and dresses it up in Comic Sans and bright colors. By dividing the two pieces a bit, I can work on them separately and perhaps reuse the DDC module elsewhere, for instance as a command-line tool.

For the module, I needed to write a NPM package, which turned out to be fairly straightforward. NPM is Node’s package manager, it provides a central repository for thousands of modules. Skipping over the boring part where I write code that actually does something, an NPM package is defined by a package.json file that contains metadata. That metadata tells NPM how users can find, install, and use my module. You can read my module’s package.json and some of its meaning may be self-evident, but here are a few pieces worth explaining:

  • the main field determines what the entry point of my module is, so when another app uses it (which, in Node, is done via a line like “var ddc = require('dogedc');") Node knows to load and execute a certain file
  • the dependencies field lists any packages which my package uses (it just so happens I didn’t use any) while devDependencies contains packages which anyone who wanted to work on developing the dogedc module itself would need (e.g. to run tests or automate other tedious tasks)
  • the repository field tells NPM what version control software I’m using and where to download the source code for my package (in this case, on GitHub)

Once I had my running code and package.json file in place, all I needed to do was run npm publish inside my project and it was published on NPM for anyone to install. Magic! When I update dogedc, either by adding new features or squashing bugs, all I need to do is run npm publish once again.

One final nicety of NPM that’s worth describing is the npm link command. While I’m working on my module, I want to be able to use it in the website, but also develop new features and fiddle with it in other contexts. It doesn’t make much sense to repeatedly install it from NPM everywhere that I use it, especially when I’m debugging new code which I don’t want to publish yet. Instead, npm link tells NPM to use my local copy of dogedc as if it was installed globally from NPM. This lets me preview the experience of someone consuming the latest version of my module without actually pushing that version out to the world.

Learn more: What is Node.js & why do I care? , How to create a NodeJS NPM package, NPM Developer Guide, & How do I get started with Node.js

Doge-ification

Now for perhaps the only amusing part of this post: how I went about translating Dewey Decimal classes into doge. The doge meme consists almost entirely of two-word phrases along the lines of “Much zombies. Such death. So amaze” (to quote from a Walking Dead themed meme). The only common one-word phrase is “wow” which is sprinkled liberally in most Doge images.

Reading through a few memes, the approach to taking a class name and turning it into doge becomes clear: split the name into a series of two-word doge phrases with a random adjective inserted before each noun. So “Extraterrestial Worlds” (999) becomes something like “Many extraterrestial. Much worlds.” Because the meme ignores most grammar conventions, we don’t need to worry about mismatching count and noncount nouns with appropriate adjectives. So, for instance, even though “worlds” is a count noun (you can say “one world, two worlds”) we can use the noncount adjective “much” to modify it.

There are a few more steps we need to take to create proper doge phrases. First of all, how do we know what adjectives to use? After viewing a bunch of memes, I decided on a list of the most commonly used ones: many, much, so, such, and very. Secondly, our simple procedure might work on “Extraterrestial Worlds” just fine, but what about “Library and Information Sciences”? The resulting “Many library. Much and. So information. Such sciences.” doesn’t look quite right. We need to strip out small words like conjunctions because they’re not used in doge.

So the final algorithm resembles:

  • Make the entire class name lowercase
  • Strip out stop words and punctuation
  • Split the string into an array on spaces (in JavaScript this is “array.split(" ")")
  • Loop over the array, each time adding a Doge word in front of the current word and then a period
  • Flip a coin to decide whether or not to add “Wow” on the end when we’re done

There are still a few weaknesses here. Most notably, if we use an unassigned class number like 992 the phrase “Not assigned or no longer used” comes out terribly (“Much not. Many assigned. So no. Much longer. Very used.”) simply because it’s a more traditional sentence, not a noun-laden class name. To write this algorithm in a robust way, one would have to incorporate natural language processing to identify which words should be stripped out, or perhaps be more thoughtful about which adjectives modify which nouns. That seems like a fun project for another day, but it was too much for my DDC module.

Learn more: The article A Linguist Explains the Grammar of Doge. Wow was useful in understanding doge.

Express

With my module in hand, I chose to use Express for the foundation of my site because it’s the most popular web framework for Node right now. Express is similar to Ruby on Rails or Django for Python. 1 Express isn’t a CMS like WordPress or Drupal; you can’t just install it and have a running blog working in a few minutes. Instead, Express provides tools and conventions for writing a web app. It scaffolds you over some of the busywork involved but does not do everything for you.

To get started with Express, you can generate a basic template for a site with a command line tool courtesy of NPM:

$ # first we install this tool globally (the -g flag)
$ npm install -g express-generator
$ # creates a ddc folder & puts the site template in it
$ express ddc

   create : ddc
   create : ddc/package.json
   create : ddc/app.js
   …a whole bunch more of this…

   install dependencies:
     $ cd ddc && npm install

   run the app:
     $ DEBUG=ddc ./bin/www
$ # just following the instructions above…
$ cd ddc && npm install

Since the Doge Decimal Classification site is simplistic, the initial template that Express provides created about 80% of the final code for the site. Mostly by looking through the generated project’s structure, I figured out what I needed to change and where I should make edits, such as the stark CSS that comes with a new Express site.

There’s a lot to Express so I won’t cover the framework in detail, but to give a short example of how it works I’ll discuss routing. Routing is the practice of determining what content to serve when someone visits a particular URL. Express organizes routing into two parts, one of which occurs in an “app.js” file and the other inside a “routes” directory. Here’s an example:

app.js:

// in app.js, I tell Express which routes to use for
// certain requests. Below means "for GET requests to
// the web site's root, use index"
app.get("/", routes.index);
// for GET requests anywhere else, use the "number" route
// ":number" becomes a special token I can use later
app.get("/:number", routes.number);

routes/index.js:

// inside my routes/index.js folder, I define routes.
// The "request" and "response" parameters below
// correspond to the users' request and what I choose
// to send back to them
exports.index = function(request, response){
        response.render("index", {
            // a random number from 0 to 999
            number: Math.floor(Math.random()*1000)
        });
    });
};

exports.number = function(request, response){
     // here req.params.number is that ":number"
     // from app.js above
        response.render("index", {
            number: request.params.number
        });
    });
};

Routing is how information from a user’s request—such as visiting dogedc.herokuapp.com/020—gets passed into my website and used to generate specific content. I use it to render the specific class number that someone chooses to put onto the URL, but you can imagine better use cases like delivering a user’s profile with a route like “app.get('/users/:user', routes.profile)“. There’s another piece here in how that “number” information gets used to generate HTML, but we’ve talked about Express enough.

Learn more: The Dead-Simple Step-by-Step Guide for Front-End Developers to Getting Up and Running with Node.JS, Express, Jade, and MongoDB takes you through a more involved example of using Express with the popular MongoDB database. Introduction to Express by the ever-excellent NetTuts is also a decent intro. 2

Social Media <meta>s

In my website, I also included Facebook Open Graph and Twitter Card metadata using HTML meta tags. These ensure that I have greater control over how Facebook, Twitter, and other social media platforms present DDC. For instance, someone could share a link to the site in Twitter without including my username. With Twitter cards, I can associate the site with my account, which causes my name to display beside it. In general, the link looks better; an image and tagline I specify also appear. Compare the two tweets in the screenshot below, one which links to a Tumblr (which doesn’t have my <meta> tags on it) and one where Twitter Cards are in effect:

2 tweets about DDC

Learn more: Gearing Up Your Sites for Sharing with Twitter & Facebook Meta Tags

Heroku

There’s one final piece to the DDC site: deployment and hosting. I’ve written the web app and thanks to Express I can even run it locally with a simple npm start command from inside its directory. But how do I share it with the world?

Since I chose to write Doge Decimal Classification in Node, my hosting options were limited. Node is a new technology and it’s not available on everywhere. If I really wanted something easy to publish, I would’ve used PHP or pure client-side JavaScript. But I purposefully set out to learn new tools and that’s in large part why I chose to host my app on Heroku.

Heroku is a popular cloud hosting platform. It’s oriented around a suite of command line tools which push new versions of your website to the cloud, as opposed to a traditional web host which might use FTP, a web-based administrative dashboard like Plesk or cPanel, or SSH to provide access to a remote server. Instead, Heroku builds upon the fact that most developers are already versioning their apps with git. Rather than add another new technology to the stack that you have to learn, Heroku works via a smart integration with git.

Once you have the Heroku command line tools installed, all you need to do is write a short “Procfile” which tells Heroku how to start your app (my Procfile is a single line, “web: npm start”). From there, you can run heroku create to start your app and then git push heroku master to push your code live. This makes publishing an app a cinch and Heroku has a free tier which has suited my simple site just fine. If we wanted to scale up, however, Heroku makes increasing the power of our server a matter of running a single command.

Learn more: Getting Started with Heroku; there’s a more specific article for starting a Node project

In Conclusion

There are actually a couple more things I learned in writing Doge Decimal Classification, such as writing unit tests with Nodeunit and using Jade for templates, but this post has gone on long enough. While building DDC, there were many different technologies I could’ve chosen to utilize rather than the ones I selected here. Why Node and Express over Ruby and Sinatra? Because I felt like it. There are entirely too many programming languages, frameworks, and tools for any one person to become familiar with them all. What’s more, I’m not sure when I’ll get a chance to these tools again, as they’re not common to any of the library technology I work with. But I found using a fun side project to level up in new areas much worthwhile, very enjoy.

Notes

  1. Technically, Rails and Django are more full-featured than Express, which is perhaps more similar to the smaller Sinatra or Flask frameworks for Ruby and Python respectively. The general idea of all these frameworks is the same, though; they provide useful functionality for writing web apps in a particular programming language. Some are just more “batteries included” than others.
  2. There are many more fine Express tutorials out there, but try to stick to recent ones; Express is being developed at a rapid pace and chances are if you run npm install express at the beginning of a tutorial, you’ll end up with a more current version than the one being covered, with subtle but important differences. Be cognizant of that possibility and install the appropriate version by running npm install express@3.3.3 (if 3.3.3 is the version you want).

Websockets For Real-time And Interactive Interfaces

TL;DR WebSockets allows the server to push up-to-date information to the browser without the browser making a new request. Watch the videos below to see the cool things WebSockets enables.

Real-Time Technologies

You are on a Web page. You click on a link and you wait for a new page to load. Then you click on another link and wait again. It may only be a second or a few seconds before the new page loads after each click, but it still feels like it takes way too long for each page to load. The browser always has to make a request and the server gives a response. This client-server architecture is part of what has made the Web such a success, but it is also a limitation of how HTTP works. Browser request, server response, browser request, server response….

But what if you need a page to provide up-to-the-moment information? Reloading the page for new information is not very efficient. What if you need to create a chat interface or to collaborate on a document in real-time? HTTP alone does not work so well in these cases. When a server gets updated information, HTTP provides no mechanism to push that message to clients that need it. This is a problem because you want to get information about a change in chat or a document as soon as it happens. Any kind of lag can disrupt the flow of the conversation or slow down the editing process.

Think about when you are tracking a package you are waiting for. You may have to keep reloading the page for some time until there is any updated information. You are basically manually polling the server for updates. Using XMLHttpRequest (XHR) (also commonly known as Ajax) has been a popular way to try to work around the limitations of HTTP somewhat. After the initial page load, JavaScript can be used to poll the server for any updated information without user intervention.

Using JavaScript in this way you can still use normal HTTP and almost simulate getting a real-time feed of data from the server. After the initial request for the page, JavaScript can repeatedly ask the server for updated information. The browser client still makes a request and the server responds, and the request can be repeated. Because this cycle is all done with JavaScript it does not require user input, does not result in a full page reload, and the amount of data which is returned from the server can be minimal. In the case where there is no new data to return, the server can just respond with something like, “Sorry. No new data. Try again.” And then the browser repeats the polling–tries again and again until there is some new data to update the page. And then goes back to polling again.

This kind of polling has been implemented in many different ways, but all polling methods still have some queuing latency. Queuing latency is the time a message has to wait on the server before it can be delivered to the client. Until recently there has not been a standardized, widely implemented way for the server to send messages to a browser client as soon as an event happens. The server would always have to sit on the information until the client made a request. But there are a couple of standards that do allow the server to send messages to the browser without having to wait for the client to make a new request.

Server Sent Events (aka EventSource) is one such standard. Once the client initiates the connection with a handshake, Server Sent Events allows the server to continue to stream data to the browser. This is a true push technology. The limitation is that only the server can send data over this channel. In order for the browser to send any data to the server, the browser would still need to make an Ajax/XHR request. EventSource also lacks support even in some recent browsers like IE11.

WebSockets allows for full-duplex communication between the client and the server. The client does not have to open up a new connection to send a message to the server which saves on some overhead. When the server has new data it does not have to wait for a request from the client and can send messages immediately to the client over the same connection. Client and server can even be sending messages to each other at the same time. WebSockets is a better option for applications like chat or collaborative editing because the communication channel is bidirectional and always open. While there are other kinds of latency involved here, WebSockets solves the problem of queuing latency. Removing this latency concern is what is meant by WebSockets being a real-time technology. Current browsers have good support for WebSockets.

Using WebSockets solves some real problems on the Web, but how might libraries, archives, and museums use them? I am going to share details of a couple applications from my work at NCSU Libraries.

Digital Collections Now!

When Google Analytics first turned on real-time reporting it was mesmerizing. I could see what resources on the NCSU Libraries’ Rare and Unique Digital Collections site were being viewed at the exact moment they were being viewed. Or rather I could view the URL for the resource being viewed. I happened to notice that there would sometimes be multiple people viewing the same resource at the same time. This gave me some hint that today someone’s social share or forum post was getting a lot of click throughs right now. Or sometimes there would be a story in the news and we had an image of one of the people involved. I could then follow up and see examples of where we were being effective with search engine optimization.

The Rare & Unique site has a lot of visual resources like photographs and architectural drawings. I wanted to see the actual images that were being viewed. The problem, though, was that Google Analytics does not have an easy way to click through from a URL to the resource on your site. I would have to retype the URL, copy and paste the part of the URL path, or do a search for the resource identifier. I just wanted to see the images now. (OK, this first use case was admittedly driven by one of the great virtues of a programmer–laziness.)

My first attempt at this was to create a page that would show the resources which had been viewed most frequently in the past day and past week. To enable this functionality, I added some custom logging that is saved to a database. Every view of every resource would just get a little tick mark that would be tallied up occasionally. These pages showing the popular resources of the moment are then regenerated every hour.

It was not a real-time view of activity, but it was easy to implement and it did answer a lot of questions for me about what was most popular. Some images are regularly in the group of the most-viewed images. I learned that people often visit the image of the NC State men’s basketball 1983 team roster which went on to win the NCAA tournament. People also seem to really like the indoor pool at the Biltmore estate.

Really Real-Time

Now that I had this logging in place I set about to make it really real-time. I wanted to see the actual images being viewed at that moment by a real user. I wanted to serve up a single page and have it be updated in real-time with what is being viewed. And this is where the persistent communication channel of WebSockets came in. WebSockets allows the server to immediately send these updates to the page to be displayed.

People have told me they find this real-time view to be addictive. I found it to be useful. I have discovered images I never would have seen or even known how to search for before. At least for me this has been an effective form of serendipitous discovery. I also have a better sense of what different traffic volume actually feels like on good day. You too can see what folks are viewing in real-time now. And I have written up some more details on how this is all wired up together.

The Hunt Library Video Walls

I also used WebSockets to create interactive interfaces on the the Hunt Library video walls. The Hunt Library has five large video walls created with Cristie MicroTiles. These very large displays each have their own affordances based on the technologies in the space and the architecture. The Art Wall is above the single service point just inside the entrance of the library and is visible from outside the doors on that level. The Commons Wall is in front of a set of stairs that also function as coliseum-like seating. The Game Lab is within a closed space and already set up with various game consoles.


Listen to Wikipedia

When I saw and heard the visualization and sonification Listen to Wikipedia, I thought it would be perfect for the iPearl Immersion Theater. Listen to Wikipedia visualizes and sonifies data from the stream of edits on Wikipedia. The size of the bubbles is determined by the size of the change to an entry, and the sound changes in pitch based on the size of the edit. Green circles show edits from unregistered contributors, and purple circles mark edits performed by automated bots. (These automated bots are sometimes used to integrate library data into Wikipedia.) A bell signals an addition to an entry. A string pluck is a subtraction. New users are announced with a string swell.

The original Listen to Wikipedia (L2W) is a good example of the use of WebSockets for real-time displays. Wikipedia publishes all edits for every language into IRC channels. A bot called wikimon monitors each of the Wikipedia IRC channels and watches for edits. The bot then forwards the information about the edits over WebSockets to the browser clients on the Listen to Wikipedia page. The browser then takes those WebSocket messages and uses the data to create the visualization and sonification.

As you walk into the Hunt Library almost all traffic goes past the iPearl Immersion Theater. The one feature that made this space perfect for Listen to Wikipedia was that it has sound and, depending on your tastes, L2W can create pleasant ambient sounds1. I began by adjusting the CSS styling so that the page would fit the large. Besides setting the width and height, I adjusted the size of the fonts. I added some text to a panel on the right explaining what folks are seeing and hearing. On the left is now text asking passersby to interact with the wall and the list of languages currently being watched for updates.

One feature of the original L2W that we wanted to keep was the ability to change which languages are being monitored and visualized. Each language can individually be turned off and on. During peak times the English Wikipedia alone can sound cacophonous. An active bot can make lots of edits of all roughly similar sizes. You can also turn off or on changes to Wikidata which collects structured data that can support Wikipedia entries. Having only a few of the less frequently edited languages on can result in moments of silence punctuated by a single little dot and small bell sound.

We wanted to keep the ability to change the experience and actually get a feel for the torrent or trickle of Wikipedia edits and allow folks to explore what that might mean. We currently have no input device for directly interacting with the Immersion Theater wall. For L2W the solution was to allow folks to bring their own devices to act as a remote control. We encourage passersby to interact with the wall with a prominent message. On the wall we show the URL to the remote control. We also display a QR code version of the URL. To prevent someone in New Zealand from controlling the Hunt Library wall in Raleigh, NC, we use a short-lived, three-character token.

Because we were uncertain how best to allow a visitor to kick off an interaction, we included both a URL and QR code. They each have slightly different URLs so that we can track use. We were surprised to find that most of the interactions began with scanning the QR code. Currently 78% of interactions begin with the QR code. We suspect that we could increase the number of visitors interacting with the wall if there were other simpler ways to begin the interaction. For bring-your-own-device remote controls we are interested in how we might use technologies like Bluetooth Low Energy within the building for a variety of interactions with the surroundings and our services.

The remote control Web page is a list of big checkboxes next to each of the languages. Clicking on one of the languages turns its stream on or off on the wall (connects or disconnects one of the WebSockets channels the wall is listening on). The change happens almost immediately with the wall showing a message and removing or adding the name of the language from a side panel. We wanted this to be at least as quick as the remote control on your TV at home.

The quick interaction is possible because of WebSockets. Both the browser page on the wall and the remote control client listen on another WebSockets channel for such messages. This means that as soon as the remote control sends a message to the server it can be sent immediately to the wall and the change reflected. If the wall were using polling to get changes, then there would potentially be more latency before a change registered on the wall. The remote control client also uses WebSockets to listen on a channel waiting for updates. This allows feedback to be displayed to the user once the change has actually been made. This feedback loop communication happens over WebSockets.

Having the remote control listen for messages from the server also serves another purpose. If more than one person enters the space to control the wall, what is the correct way to handle that situation? If there are two users, how do you accurately represent the current state on the wall for both users? Maybe once the first user begins controlling the wall it locks out other users. This would work, but then how long do you lock others out? It could be frustrating for a user to have launched their QR code reader, lined up the QR code in their camera, and scanned it only to find that they are locked out and unable to control the wall. What I chose to do instead was to have every message of every change go via WebSockets to every connected remote control. In this way it is easy to keep the remote controls synchronized. Every change on one remote control is quickly reflected on every other remote control instance. This prevents most cases where the remote controls might get out of sync. While there is still the possibility of a race condition, it becomes less likely with the real-time connection and is harmless. Besides not having to lock anyone out, it also seems like a lot more fun to notice that others are controlling things as well–maybe it even makes the experience a bit more social. (Although, can you imagine how awful it would be if everyone had their own TV remote at home?)

I also thought it was important for something like an interactive exhibit around Wikipedia data to provide the user some way to read the entries. From the remote control the user can get to a page which lists the same stream of edits that are shown on the wall. The page shows the title for the most recently edited entry at the top of the page and pushes others down the page. The titles link to the current revision for that page. This page just listens to the same WebSockets channels as the wall does, so the changes appear on the wall and remote control at the same time. Sometimes the stream of edits can be so fast that it is impossible to click on an interesting entry. A button allows the user to pause the stream. When an intriguing title appears on the wall or there is a large edit to a page, the viewer can pause the stream, find the title, and click through to the article.

The reaction from students and visitors has been fun to watch. The enthusiasm has had unexpected consequences. For instance one day we were testing L2W on the wall and noting what adjustments we would want to make to the design. A student came in and sat down to watch. At one point they opened up their laptop and deleted a large portion of a Wikipedia article just to see how large the bubble on the wall would be. Fortunately the edit was quickly reverted.

We have also seen the L2W exhibit pop up on social media. This Instagram video was posted with the comment, “Reasons why I should come to the library more often. #huntlibrary.”

This is people editing–Oh, someone just edited Home Alone–editing Wikipedia in this exact moment.

The original Listen to Wikipedia is open source. I have also made the source code for the Listen to Wikipedia exhibit and remote control application available. You would likely need to change the styling to fit whatever display you have.

Other Examples

I have also used WebSockets for some other fun projects. The Hunt Library Visualization Wall has a unique columnar design, and I used it to present images and video from our digital special collections in a way that allows users to change the exhibit. For the Code4Lib talk this post is based on, I developed a template for creating slide decks that include audience participation and synchronized notes via WebSockets.

Conclusion

The Web is now a better development platform for creating real-time and interactive interfaces. WebSockets provides the means for sending real-time messages between servers, browser clients, and other devices. This opens up new possibilities for what libraries, archives, and museums can do to provide up to the moment data feeds and to create engaging interactive interfaces using Web technologies.

If you would like more technical information about WebSockets and these projects, please see the materials from my Code4Lib 2014 talk (including speaker notes) and some notes on the services and libraries I have used. There you will also find a post with answers to the (serious) questions I was asked during the Code4Lib presentation. I’ve also posted some thoughts on designing for large video walls.

Thanks: Special thanks to Mike Nutt, Brian Dietz, Yairon Martinez, Alisa Katz, Brent Brafford, and Shirley Rodgers for their help with making these projects a reality.


About Our Guest Author: Jason Ronallo is the Associate Head of Digital Library Initiatives at NCSU Libraries and a Web developer. He has worked on lots of other interesting projects. He occasionally writes for his own blog Preliminary Inventory of Digital Collections.

Notes

  1. Though honestly Listen to Wikipedia drove me crazy listening to it so much as I was developing the Immersion Theater display.

Analyzing Usage Logs with OpenRefine

Background

Like a lot of librarians, I have access to a lot of data, and sometimes no idea how to analyze it. When I learned about linked data and the ability to search against data sources with a piece of software called OpenRefine, I wondered if it would be possible to match our users’ discovery layer queries against the Library of Congress Subject Headings. From there I could use the linking in LCSH to find the Library of Congress Classification, and then get an overall picture of the subjects our users were searching for. As with many research projects, it didn’t really turn out like I anticipated, but it did open further areas of research.

At California State University, Fullerton, we use an open source application called Xerxes, developed by David Walker at the CSU Chancellor’s Office, in combination with the Summon API. Xerxes acts as an interface for any number of search tools, including Solr, federated search engines, and most of the major discovery service vendors. We call it the Basic Search, and it’s incredibly popular with students, with over 100,000 searches a month and growing. It’s also well-liked – in a survey, about 90% of users said they found what they were looking for. We have monthly files of our users’ queries, so I had all of the data I needed to go exploring with OpenRefine.

OpenRefine

OpenRefine is an open source tool that deals with data in a very different way than typical spreadsheets. It has been mentioned in TechConnect before, and Margaret Heller’s post, “A Librarian’s Guide to OpenRefine” provides an excellent summary and introduction. More resources are also available on Github.

One of the most powerful things OpenRefine does is to allow queries against open data sets through a function called reconciliation. In the open data world, reconciliation refers to matching the same concept among different data sets, although in this case we are matching unknown entities against “a well-known set of reference identifiers” (Re-using Cool URIs: Entity Reconciliation Against LOD Hubs).

Reconciling Against LCSH

In this case, we’re reconciling our discovery layer search queries with LCSH. This basically means it’s trying to match the entire user query (e.g. “artist” or “cost of assisted suicide”) against what’s included in the LCSH linked open data. According to the LCSH website this includes “all Library of Congress Subject Headings, free-floating subdivisions (topical and form), Genre/Form headings, Children’s (AC) headings, and validation strings* for which authority records have been created. The content includes a few name headings (personal and corporate), such as William Shakespeare, Jesus Christ, and Harvard University, and geographic headings that are added to LCSH as they are needed to establish subdivisions, provide a pattern for subdivision practice, or provide reference structure for other terms.”

I used the directions at Free Your Metadata to point me in the right direction. One note: the steps below apply to OpenRefine 2.5 and version 0.8 of the RDF extension. OpenRefine 2.6 requires version 0.9 of the RDF extension. Or you could use LODRefine, which bundles some major extensions and I hear is great, but personally haven’t tried. The basic process shouldn’t change too much.

(1) Import your data

OpenRefine has quite a few file type options, so your format is likely already supported.

 Screenshot of importing data

(2) Clean your data

In my case, this involves deduplicating by timestamp and removing leading and trailing whitespaces. You can also remove weird punctuation, numbers, and even extremely short queries (<2 characters).

(3) Add the RDF extension.

If you’ve done it correctly, you should see an RDF dropdown next to Freebase.

Screenshot of correctly installed RDF extension

(4) Decide which data you’d like to search on.

In this example, I’ve decided to use just queries that are less than or equal to four words, and removed duplicate search queries. (Xerxes handles facet clicks as if they were separate searches, so there are many duplicates. I usually don’t, though, unless they happen at nearly the same time). I’ve also experimented with limiting to 10 or 15 characters, but there were not many more matches with 15 characters than 10, even though the data set was much larger. It depends on how much computing time you want to spend…it’s really a personal choice. In this case, I chose 4 words because of my experience with 15 characters – longer does not necessarily translate into more matches. A cursory glance at LCSH left me with the impression that the vast majority of headings (not including subdivisions, since they’d be searched individually) were 4 words or less. This, of course, means that your data with more than 4 words is unusable – more on that later.

Screenshot of adding a column based on word count using ngrams

(5) Go!

Shows OpenRefine reconciling

(6) Now you have your queries that were reconciled against LCSH, so you can limit to just those.

Screenshot of limiting to reconciled queries

Finding LC Classification

First, you’ll need to extract the cell.recon.match.id – the ID for the matched query that in the case of LCSH is the URI of the concept.

Screenshot of using cell.recon.match.id to get URI of concept

At this point you can choose whether to grab the HTML or the JSON, and create a new column based on this one by fetching URLs. I’ve never been able to get the parseJson() function to work correctly with LC’s JSON outputs, so for both HTML and JSON I’ve just regexed the raw output to isolate the classification. For more on regex see Bohyun Kim’s previous TechConnect post, “Fear No Longer Regular Expressions.”

On the raw HTML, the easiest way to do it is to transform the cells or create a new column with:

replace(partition(value,/<li property=”madsrdf:classification”>(<[^>]+>)*([A-Z]{1,2})/)[1],/<li property=”madsrdf:classification”>(<[^>]+>)*([A-Z]{1,2})/,”$2″).

Screenshot of using regex to get classification

You’ll note this will only pull out the first classification given, even if some have multiple classifications. That was a conscious choice for me, but obviously your needs may vary.

(Also, although I’m only concentrating on classification for this project, there’s a huge amount of data that you could work with – you can see an example URI for Acting to see all of the different fields).

Once you have the classifications, you can export to Excel and create a pivot table to count the instances of each, and you get a pretty table.

Table of LC Classifications

Caveats & Further Explorations

As you can guess by the y-axis in the table above, the number of matches is a very small percentage of actual searches. First I limited to keyword searches (as opposed to title/subject), then of those only ones that were 4 or fewer words long (about 65% of keyword searches). Of those, only about 1000 of the 26000 queries matched, and resulted in about 360 actual LC Classifications. Most months I average around 500, but in this example I took out duplicates even if they were far apart in time, just to experiment.

One thing I haven’t done but am considering is allowing matches that aren’t 100%. From my example above, there are another 600 or so queries that matched at 50-99%. This could significantly increase the number of matches and thus give us more classifications to work with.

Some of this is related to the types of searches that students are doing (see Michael J DeMars’ and my presentation “Making Data Less Daunting” at Electronic Resources & Libraries 2014, which this article grew out of, for some crazy examples) and some to the way that LCSH is structured. I chose LCSH because I could get linked to the LC Classification and thus get a sense of the subjects, but I’m definitely open to ideas. If you know of a better linked data source, I’m all ears.

I must also note that this is a pretty inefficient way of matching against LCSH. If you know of a way I could download the entire set, I’m interested in investigating that way as well.

Another approach that I will explore is moving away from reconciliation with LCSH (which is really more appropriate for a controlled vocabulary) to named-entity extraction, which takes natural language inputs and tries to recognize or extract common concepts (name, place, etc). Here I would use it as a first step before trying to match against LCSH. Free Your Metadata has a new named-entity extraction extension for OpenRefine, so I’ll definitely explore that option.

Planned Research

In the end, although this is interesting, does it actually mean anything? My next step with this dataset is to take a subset of the search queries and assign classification numbers. Over the course of several months, I hope to see if what I’ve pulled in automatically resembles the hand-classified data, and then draw conclusions.

So far, most of the peaks are expected – psychology and nursing are quite strong departments. There are some surprises though – education has been consistently underrepresented, based on both our enrollment numbers and when you do word counts (see our presentation for one month’s top word counts). Education students have a robust information literacy program. Does this mean that education students do complex searches that don’t match LCSH? Do they mostly use subject databases? Once again, an area for future research, should these automatic results match the classifications I do by hand.

What do you think? I’d love to hear your feedback or suggestions.

About Our Guest Author

Jaclyn Bedoya has lived and worked on three continents, although currently she’s an ER Librarian at CSU Fullerton. It turns out that growing up in Southern California spoils you, and she’s happiest being back where there are 300 days of sunshine a year. Also Disneyland. Reach her @spamgirl on Twitter or jaclynbedoya@gmail.com


My First Hackathon & WikipeDPLA

Almost two months ago, I attended my first hackathon during ALA’s Midwinter Meeting. Libhack was coordinated by the Library Code Year Interest Group. Much credit is due to coordinators Zach Coble, Emily Flynn, Jesse Saunders, and Chris Strauber. The University of Pennsylvania graciously hosted the event in their Van Pelt Library.

What’s a hackathon? It’s a short event, usually a day or two, wherein coders and other folks get together to produce software. Hackathons typically work on a particular problem, application, or API (a source of structured data). LibHack focused on APIs from two major library organizations: OCLC and the Digital Public Library of America (DPLA).

Impressions & Mixed Content

Since this was my first hackathon and the gritty details below may be less than relevant to all our readers, I will front-load my general impressions of Libhack rather than talk about the code I wrote. First of all, splitting the hackathon into two halves focused on different APIs and catering to different skill levels worked well. There were roughly equal numbers of participants in both the structured, beginner-oriented OCLC group and the more independent DPLA group.

Having representatives from both of the participating institutions was wonderful. While I didn’t take advantage of the attending DPLA staff as much as I should have, it was great to have a few people to answer questions. What’s more, I think DPLA benefited from hearing about developers’ experiences with their API. For instance, there are a few metadata fields in their API which might contain an array or a string depending upon the record. If an application assumes one or the other, chances are it breaks at some point and the programmer has to locate the error and write code that handles either data format.

Secondly, the DPLA API is currently available only over unencrypted HTTP. Thus due to the mixed content policies of web browsers it is difficult to call the HTTP API on HTTPS pages. For the many HTTP sites on the web this isn’t a concern, but I wanted to call the DPLA API from Wikipedia which only serves content over HTTPS. To work around this limitation, users have to manually override mixed content blocking in their browser, a major limitation for my project. DPLA already had plans to roll out an HTTPS API, but I think hearing from developers may influence its priority.

Learn You Some Lessons

Personally, I walked away from Libhack with a few lessons. First of all, I had to throw away my initial code before creating something useful. While I had a general idea in mind—somehow connect DPLA content related to a given Wikipedia page—I wasn’t sure what type of project I should create. I started writing a command-line tool in Python, envisioning a command that could be passed a Wikipedia URL or article title and return a list of related items in the DPLA. But after struggling with a pretty unsatisfying project for a couple hours, including a detour into investigating the MediaWiki API, I threw everything aside and took a totally different approach by building a client-side script meant to run in a web browser. In the end, I’m a lot happier with the outcome after my initial failure. I love the command line, but the appeal of such a tool would be niche at best. What I wrote has a far broader appeal.1

Secondly, I worked closely with Wikipedian Jake Orlowitz.2 While he isn’t a coder, his intimate knowledge of Wikipedia was invaluable for our end product. Whenever I had a question about Wikipedia’s inner workings or needed someone to bounce ideas off of, he was there. While I blindly starting writing some JavaScript without a firm idea of how we could embed it onto Wikipedia pages, it was Jake who pointed me towards User Scripts and created an excellent installation tour.3 In other groups, I heard people discussing metadata, subject terms, and copyright. I think that having people of varied expertise in a group is advantageous when compared with a group solely composed of coders. Many hackathons explicitly state that non-programmers are welcome and with good reason; experts can outline goals, consider end-user interactions, and interpret API results. These are all invaluable contributions which are also hard to do with one’s face buried in a code editor.

While I did enjoy my hackathon experience, I was expecting a bit more structure and larger project groups. I arrived late, which doubtless didn’t help, but the DPLA groups were very fragmented. Some projects were only individuals, while others (like ours) were pairs. I had envisioned groups of at least four, where perhaps one person would compose plans and documentation, another would design a user interface, and the remainder would write back-end code. I can’t say that I was at all disappointed, but I could have benefited from the perspectives of a larger group.

What is WikipeDPLA?

So what did we build at Libhack anyway? As previously stated, we made a Wikipedia user script. I’ve dubbed it WikipeDPLA, though you can find it as FindDPLA on Wikipedia. Once installed, the script will query DPLA’s API on each article you visit, inserting related items towards the top.

WikipeDPLA in action

How does it work?

Here’s a step-by-step walkthrough of how WikipeDPLA works:

When you visit a Wikipedia article, it collects a few pieces of information about the article by copying text from the page’s HTML: the article’s title, any “X redirects here” notices, and the article’s categories.

First, WikipeDPLA constructs a DPLA query using the article’s title. Specifically, it constructs a JSONP query. JSONP is a means of working around the web’s same-origin policy which lets scripts manipulate data on other web pages. It works by including a script tag with a specially constructed URL containing a reference to one of your JavaScript functions:

<script src="//example.com/jsonp-api?q=search+term&callback=parseResponse"></script>

In responding to this request, the API plays a little trick; it doesn’t just return raw data, since that would be invalid JavaScript and thus cause a parsing error in the browser. Instead, it wraps the data in the function we’ve provided it. In the example above, that’s parseResponse:

parseResponse({
    "results": [
        {"title": "Searcher Searcherson",
        "id": 123123,
        "genre": "Omphaloskepsis"},
        {"title": "Terminated Term",
        "id": 321321,
        "genre": "Literalism"}
    ]
});

This is valid JavaScript; parseResponse receives an object which contains an array of search result records, each with some minimal metadata. This pattern has the handy feature that, as soon as our query results are available, they’re immediately passed to our callback function.

WikipeDPLA’s equivalent of parseResponse looks to see if there are any results. If the article’s title doesn’t return any results, then it’ll try again with any alternate titles culled from the article’s redirection notices. If those queries are also fruitless, it starts to go through the article’s categories.

Once we’ve guaranteed that we have some results from DPLA, we parse the API’s metadata into a simpler subset. This subset consists of the item’s title, a link to its content, and an “isImage” Boolean value noting whether or not the item is an image. With this simpler set of data in hand, we loop through our results to build a string of HTML which is then inserted onto the page. Voìla! DPLA search results in Wikipedia.

Honing

After putting the project together, I continued to refine it. I used the “isImage” Boolean to put a small image icon next to an item’s link. Then, after the hackathon, I noticed that my script was a nuisance if a user started reading a page anywhere other than at its start. For instance, if you start reading the Barack Obama article at the Presidency section, you will read for a moment and then suddenly be jarred as the DPLA results are inserted up top and push the rest of the article’s text down the page. In order to mitigate this behavior, we need to know if the top of the article is in view before inserting our results HTML. I used a jQuery visibility plug-in and an event listener on window scroll events to fix this.

Secondly, I was building a project with several targets: a user script for Wikipedia, a Grease/Tampermonkey user script4, and a (as yet inchoate) browser extension. To reuse the same basic JavaScript but in these different contexts, I chose to use the make command. Make is a common program used for projects which have multiple platform targets. It has an elegantly simple design: when you run make foo inside of a directly, make looks in a file named “makefile” for a line labelled “foo:” and then executes the shell command on the subsequent line. So if I have the following makefile:

hello:
    echo 'hello world!'

bye:
    echo 'goodbye!'

clean:
    rm *.log *.cache

Inside the same directory as this makefile, the commands make hello, make bye, and make clean respectively would print “hello world!” to my terminal, print “goodbye!”, and delete all files ending in extension “log” or “cache”. This contrived example doesn’t help much, but in my project I can run something like make userscript and the Grease/Tampermonkey script is automatically produced by prepending some header text to the main WikipeDPLA script. Similarly, make push produces all the various platform targets and then pushes the results up to the GitHub repo, saving a significant amount of typing on the command line.

These bits of trivia about interface design and tooling allude to a more important idea: it’s vital to choose projects that help you learn, particularly in a low-stakes environment like a hackathon. No one expects greatness from a product duct taped together in a few hours, so seize the opportunity to practice rather than aim for perfection. I didn’t have to write a makefile, but I chose to spend time familiarizing myself with a useful tool.

What’s Next?

While I am quite happy with my work at Libhack I do have plans for improvement. My main goal is to turn WikipeDPLA into a browser extension, for Chrome and perhaps Firefox. An extension offers a couple advantages: it can avoid the mixed-content issue with DPLA’s HTTP-only API5 and it is available even for users who aren’t logged in to Wikipedia. It would also be nice to expand my approach to encompassing other major digital library APIs, such as Europeana or Australia’s Trove.

And, of course, I want to attend more hackathons. Libhack was a very positive event for me, both in terms of learning and producing something useful, so I’m encouraged and hope other library conferences offer collaborative coding opportunities.

Other Projects

Readers should head over to LITA Blog where organizer Zach Coble has a report on libhack which details several other projects created at the Midwinter hackathon. Or you could just follow @HistoricalCats on Twitter.

Notes

  1. An aside related to learning to program: being a relatively new coder, I often think about advice I can give others looking to start coding. One common question is “what language should I learn first?” There’s one stock response, that it’s important not to worry too much about this choice because learning the fundamentals of one language will enable you to learn others quickly. But that dodges the question, because what people want to hear is a proper noun like “Ruby” or “Python” or “JavaScript.” And JavaScript, despite not being nearly as user friendly as those other two options, is a great starting place because it lets you work on the web with little effort. All of this to say; if I didn’t know JavaScript fairly well, I would not have been able to make something so useful.
  2. Shameless plug: Jake works on the Wikipedia Library, an interesting project that aims to connect Wikipedian researchers with source material, from subscription databases and open access repositories alike.
  3. User Scripts are pieces of JavaScript that a user can choose to insert whenever they are signed into and browsing Wikipedia. They’re similar to Greasemonkey user scripts, except the scripts only apply to Wikipedia. These scripts can do anything from customize the site’s appearance to insert new content, which is exactly what we did.
  4. Greasemonkey is the Firefox add-on for installing scripts that run on specified sites or pages; Tampermonkey is an analogous extension for Chrome.
  5. How’s that for acronyms?

Bash Scripting: automating repetitive command line tasks

Introduction

One of my current tasks is to develop workflows for digital preservation procedures. These workflows begin with the acquisition of files – either disk images or logical file transfers – both of which end up on a designated server. Once acquired, the images (or files) are checked for viruses. If clean, they are bagged using Bagit and then copied over to a different server for processing.1 This work is all done at the command line, and as you can imagine, it gets quite repetitive. It’s also a bit error-prone since our file naming conventions include a 10-digit media ID number, which is easily mistyped. So once all the individual processes were worked out, I decided to automate things a bit by placing the commands into a single script. I should mention here that I’m no Linux whiz- I use it as needed which sometimes is daily, sometimes not. This is the first time I’ve ever tried to tie commands together in a Bash script, but I figured previous programming experience would help.

Creating a Script

To get started, I placed all the virus check commands for disk images into a script. These commands are different than logical file virus checks since the disk has to be mounted to get a read. This is a pretty simple step – first add:

#!/bin/bash

as the first line of the file (this line should not be indented or have any other whitespace in front of it). This tells the kernel which kind of interpreter to invoke, in this case, Bash. You could substitute the path to another interpreter, like Python, for other types of scripts: #!/bin/python.

Next, I changed the file permissions to make the script executable:

chmod +x myscript

I separated the virus check commands so that I could test those out and make sure they were working as expected before delving into other script functions.

Here’s what my initial script looked like (comments are preceded by a #):

#!/bin/bash

#mount disk
sudo mount -o ro,loop,nodev,noexec,nosuid,noatime /mnt/transferfiles/2013_072/2013_072_DM0000000001/2013_072_DM0000000001.iso /mnt/iso

#check to verify mount
mountpoint -q /mnt/iso && echo "mounted" || "not mounted"

#call the Clam AV program to run the virus check
clamscan -r /mnt/iso > "/mnt/transferfiles/2013_072/2013_072_DM0000000001/2013_072_DM0000000001_scan_test.txt"

#unmount disk
sudo umount /mnt/iso

#check disk unmounted
mountpoint -q /mnt/iso && echo "mounted" || "not mounted"

All those options on the mount command? They give me the piece of mind that accessing the disk will in no way alter it (or affect the host server), thus preserving its authenticity as an archival object. You may also be wondering about the use of “&&” and “||”.  These function as conditional AND and OR operators, respectively. So “&&” tells the shell to run the first command, AND if that’s successful, it will run the second command. Conversely, the “||” tells the shell to run the first command OR if that fails, run the second command. So the mount check command can be read as: check to see if the directory at /mnt/iso is a mountpoint. If the mount is successful, then echo “mounted.” If it’s not, echo “not mounted.” More on redirection.

Adding Variables

You may have noted that the script above only works on one disk image (2013_072_DM0000000001.iso), which isn’t very useful. I created variables for the accession number, the digital media number, and the file extension, since they all changed depending on the disk image information.The file naming convention we use for disk images is consistent and strict. The top level directory is the accession number. Within that, each disk image acquired from that accession is stored within it’s own directory, named using it’s assigned number. The disk image is then named by a combination of the accession number and the disk image number. Yes, it’s repetitive, but it keeps track of where things came from and links to data we have stored elsewhere. Given that these disks may not be processed for 20 years, such redundancy is preferred.

At first I thought the accession number, digital media number, and extension variables would be best entered at the initial run command; type in one line to run many commands. Each variable is separated by a space, the .iso at the end is the extension for an optical disk image file:

$ ./virus_check.sh 2013_072 DM0000000001 .iso

In Bash, scripts run with arguments are named $1 for the first variable, $2 for the second, and so on. This actually tripped me up for a day or so. I initially thought the $1, $2, etc. variables names used by the book I was referencing were for examples only, and that the first variables I referenced in the script would automatically map in order, so if 2013_072 was the first argument, and $accession was the first variable, $accession = 2013_072 (much like when you pass in a parameter to a Python function). Then I realized there was a reason that more than one reference book and/or site used the $1, $2, $3 system for variables passed in as command line arguments. I assigned each to it’s proper variable, and things were rolling again.

#!/bin/bash

#assign command line variables
$1=$accession
$2=$digital_media
$3=$extension

#mount disk
sudo mount -o ro,loop,noatime /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} /mnt/iso<span style="line-height: 1.5em;"> </span>

Note: variables names are often presented without curly braces; it’s recommended to place them in curly braces when adjacent to other strings.2

Reading Data

After testing the script a bit, I realized I  didn’t like passing the variables in via the command line. I kept making typos, and it was annoying not to have any data validation done in a more interactive fashion. I reconfigured the script to prompt the user for input:

read -p "Please enter the accession number" accession

read -p "Please enter the digital media number" digital_media

read -p "Please enter the file extension, include the preceding period" extension

After reviewing some test options, I decided to test that $accession and $digital_media were valid directories, and that the combo of all three variables was a valid file. This test seems more conclusive than simply testing whether or not the variables fit the naming criteria, but it does mean that if the data entered is invalid, the feedback given to the user is limited. I’m considering adding tests for naming criteria as well, so that the user knows when the error is due to a typo vs. a non-existent directory or file. I also didn’t want the code to simply quit when one of the variables is invalid – that’s not very user-friendly. I decided to ask the user for input until valid input was received.

read -p "Please enter the accession number" accession
until [ -d /mnt/transferfiles/${accession} ]; do
     read -p "Invalid. Please enter the accession number." accession
done

read -p "Please enter the digital media number" digital_media
until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
     read -p "Invalid. Please enter the digital media number." digital_media
done

read -p  "Please enter the file extension, include the preceding period" extension
until [ -e/mnt/transferfiles/${accession}/${accession}_${digital_media}/
${accession}_${digital_media}${extension} ]; do
     read -p "Invalid. Please enter the file extension, including the preceding period" extension
done

Creating Functions

You may have noted that the command used to test if a disk is mounted or not is called twice. This is done on purpose, as it was a test I found helpful when running the virus checks; the virus check runs and outputs a report even if the disk isn’t mounted. Occasionally disks won’t mount for various reasons. In such cases, the resulting report will state that it scanned no data, which is confusing because the disk itself possibly could have contained no data. Testing if it’s mounted eliminates that confusion. The command is repeated after the disk has been unmounted, mainly because I found it easy to forget the unmount step, and testing helps reinforce good behavior. Given that the command is repeated twice, it makes sense to make it a function rather than duplicate it.

check_mount () {
     #checks to see if disk is mounted or not
     mountpoint -q /mnt/iso && echo "mounted" || "not mounted" 
}

Lastly, I created a function for the input variables. I’m sure there’s prettier, more concise ways of writing this function, but since it’s still being refined and I’m still learning Bash scripting, I decided to leave it for now. I did want it placed in it’s own function because I’m planning to add additional code that will notify me if the virus check is positive and exit the program, or if it’s negative, bag the disk image and corresponding files, and copy them over to another server where they’ll wait for further processing.

get_image () {
     #gets data from user, validates it    
     read -p "Please enter the accession number" accession
     until [ -d /mnt/transferfiles/${accession} ]; do
          read -p "Invalid. Please enter the accession number." accession
     done

     read -p "Please enter the digital media number" digital_media
     until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
          read -p "Invalid. Please enter the digital media number." digital_media
     done

     read -p  "Please enter the file extension, include the preceding period" extension
     until [ -e /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} ]; do
          read -p "Invalid. Please enter the file extension, including the preceding period" extension
     done
}

Resulting (but not final!) Script

#!/bin/bash
#takes accession number, digital media number, and extension as variables to mount a disk image and run a virus check

check_mount () {
     #checks to see if disk is mounted or not
     mountpoint -q /mnt/iso && echo "mounted" || "not mounted"
}

get_image () {
     #gets disk image data from user, validates info
     read -p "Please enter the accession number" accession
     until [ -d /mnt/transferfiles/${accession} ]; do
          read -p "Invalid. Please enter the accesion number." accession
     done

     read -p "Please enter the digital media number" digital_media
     until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
          read -p "Invalid. Please enter the digital media number." digital_media
     done

     read -p  "Please enter the file extension, include the preceding period" extension
     until [ -e/mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} ]; do
          read -p "Invalid. Please enter the file extension, including the preceding period" extension
     done
}

get_image

#mount disk
sudo mount -o ro,loop,noatime /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} /mnt/${extension} 

check_mount

#run virus check
sudo clamscan -r /mnt/iso > "/mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}_scan_test.txt"

#unmount disk
sudo umount /mnt/iso

check_mount

Conclusion

There’s a lot more I’d like to do with this script. In addition to what I’ve already mentioned, I’d love to enable it to run over a range of digital media numbers, since they often are sequential. It also doesn’t stop if the disk isn’t mounted, which is an issue. But I thought it served as a good example of how easy it is to take repetitive command line tasks and turn them into a script. Next time, I’ll write about the second phase of development, which will include combining this script with another one, virus scan reporting, bagging, and transfer to another server.

Suggested References

An A-Z Index of the Bash command line for Linux

The books I used, both are good for basic command line work, but they only include a small section for actual scripting:

Barrett, Daniel J., Linux Pocket Guide. O’Reilly Media, 2004.

Shotts, Jr., William E. The Linux Command Line: A complete introduction. no starch press. 2012.

The book I wished I used:

Robbins, Arnold and Beebe, Nelson H. F. Classic Shell Scripting. O’Reilly Media, 2005.

Notes

  1. Logical file transfers often arrive in bags, which are then validated and virus checked.
  2. Linux Pocket Guide

What is Node.js & why do I care?

At its simplest, Node.js is server-side JavaScript. JavaScript is a popular programming language, but it almost always runs inside a web browser. So JavaScript can, for instance, manipulate the contents of this page by being included inside <script> tags, but it doesn’t get to play around with the files on our computer or tell our server what HTML to send like the PHP that runs this WordPress blog.

Node is interesting for more than just being on the server side. It provides a new way of writing web servers while using an old UNIX philosophy. Hopefully, by the end of this post, you’ll see its potential and how it differentiates itself from other programming environments and web frameworks.

Hello, World

To start, let’s do some basic Node programming. Head over to nodejs.org and click Install.1 Once you’ve run the installer, a node executable will be available for you on the command line. Any script you pass to node will be interpreted and the results displayed. Let’s do the classic “hello world” example. Create a new file in a text editor, name it hello.js, and put the following on its only line:

console.log('Hello, world!');

If you’ve written JavaScript before, you may recognize this already. console.log is a common debugging method which prints strings to your browser’s JavaScript console. In Node, console.log will output to your terminal. To see that, open up a terminal (on Mac, you can use Terminal.app while on Windows both cmd.exe and PowerShell will work) and navigate to the folder where you put hello.js. Your terminal will likely open in your user’s home folder; you can change directories by typing cd followed by a space and the subdirectory you want to go inside. For instance, if I started at “C:\users\me” I could run cd Documents to enter “C:\users\me\Documents”. Below, we open a terminal, cd into the Documents folder, and run our script to see its results.

$ cd Documents
$ node hello.js
Hello, world!

That’s great and all, but it leaves a lot to be desired. Let’s do something a little more sophisticated; let’s write a web server which responds “Hello!” to any request sent to it. Open a new file up, name it server.js, and write this inside:

var http = require('http');
http.createServer(handleRequest).listen(8888);
function handleRequest (request, response) {
  response.end( 'Hello!' );
}

In our terminal, we can run node server.js and…nothing happens. Our prompt seems to hang, not outputting anything but also not letting us type another command. What gives? Well, Node is running a web server and it’s waiting for responses. Open up your web browser and navigate to “localhost:8888″; the exclamation “Hello!” should appear. In four lines of code, we just wrote an HTTP server. Sure, it’s the world’s dumbest server that only says “Hello!” over and over no matter what we request from it, but it’s still an achievement. If you’re the sort of person who gets giddy at how easy this was, then Node.js is for you.

Let’s walk through server.js line-by-line. First, we import the core HTTP library that comes with Node. The “require” function is a way of loading external modules into your script, similar to how the function of the same name does in Ruby or import in Python. The HTTP library gives us a handy “createServer” method which receives HTTP requests and passes them along to a callback function. On our 2nd line, we call createServer, pass it the function we want to handle incoming requests, and set it to listen for requests sent to port 8888. The choice of 8888 is arbitrary; we could choose any number over 1024, while operating systems often restrict the lower ports which are already in use by specific protocols. Finally, we define our handleRequest callback which will receive a request and response object for each HTTP request. Those objects have many useful properties and methods, but we simply called the response object’s end method which sends a response and optionally accepts some data to put into that response.

The use of callback functions is very common in Node. If you’ve written JavaScript for a web browser you may recognize this style of programming; it’s the same as when you define an event listener which responds to mouse clicks, or assign a function to process the result of an AJAX request. The callback function doesn’t executive synchronously in the same order you wrote it in your code, it waits for some “event” to occur, whether that event is a click or an AJAX request returning data.

In our HTTP server example, we also see a bit of what makes Node different from other server-side languages like PHP, Perl, Python, and Ruby. Those languages typically work with a web server, such as Apache, which passes certain requests over to the languages and serves up whatever they return. Node is a server, it gives you low-level access to the inner workings of protocols like HTTP and TCP. You don’t need to run Apache and have requests sent to Node: it handles them on its own.

Who cares?

Some of you are no doubt wondering: what exactly is the big deal? Why am I reading about this? Surely, the world has enough programming languages, and JavaScript is nothing new, even server-side JavaScript isn’t that new.2 There are already plenty of web servers out there. What need does Node.js fill?

To answer that, we must revisit the origins of Node. The best way to understand is to watch Ryan Dahl present on the impetus for creating Node. He says, essentially, that other programming frameworks are doing IO (input/output) wrong. IO comes in many forms: when you’re reading or writing to a file, when you’re querying databases, and when you’re receiving and sending HTTP requests. In all of these situations, your code asks for data…waits…and waits…and then, once it has the data, it manipulates it or performs some calculation, and then sends it somewhere else…and waits…and waits. Basically, because the code is constantly waiting for some IO operation, it spends most of its time sitting around rather than crunching digits like it wants to. IO operations are commonly the bottlenecks in programs, so we shouldn’t let our code just stop every time they perform one.

Node not only has a beneficial asynchronous programming model but it has developed other advantages as well. Because lots of people already know how to write JavaScript, it’s started up much quicker than languages which are entirely new to developers. It reuses Google Chrome’s V8 as a JavaScript interpreter, giving it a big speed boost. Node’s package manager, NPM, is growing at a tremendous rate, far faster than its sibling package managers for Java, Ruby, and Python. NPM itself is a strong point of Node; it’s learned from other package managers and has many excellent features. Finally, other programming languages were developed to be all-purpose tools. Node, while it does share the same all-purpose utility, is really intended for the web. It’s meant to write web servers and handle HTTP intelligently.

Node also follows many UNIX principles. Doug McIlroy succinctly summarized the UNIX philosophy as “Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.” NPM does a great job letting authors write small modules which work well together. This has been tough previously in JavaScript because web browsers have no “require” function; there’s no native way for modules to define and load their dependencies, which resulted in the popularity of large, complicated libraries.3 jQuery is a good example; it’s tremendously popular and it includes hundreds of functions in its API, while most sites that use it really only need a few. Large, complicated programs are more difficult to test, debug, and reason about, which is why UNIX avoided them.

Many Node modules also support streams that allow you to pipe data through a series of programs. This is analogous to how BASH and other shells let you pipe text from one command to another, with each command taking the output of the last as its input. To visualize this, see Stream Playground written by John Resig, creator of jQuery. Streams allow you to plug-in different functionality in when needed. This pseudocode shows how one might read a CSV from a server’s file system (the core “fs” library stands for “file system”), filter out certain rows, and send it over HTTP:

fs.createReadStream('spreadsheet.csv').pipe(filter).pipe(http);
// Want to compress the response? Just add another pipe.
fs.createReadStream('spreadsheet.csv').pipe(filter).pipe(compressor).pipe(http);

Streams have the advantage of limiting how much memory a program uses because only small portions of data are being operated on at once. Think of the difference between copying a million-line spreadsheet all at once or line-by-line; the second is less likely to crash or run into the limit of how much data the system clipboard can hold.

Libraryland Examples

Node is still very new and there aren’t a lot prominent examples of library usage. I’ll try to present a few, but I think it’s more worth knowing about as a major trend in web development.

Most amusingly, Ed Summers of the Library of Congress and Sean Hannan of Johns Hopkins University made a Cataloging Highscores page that presents original cataloging performed in WorldCat in a retro arcade-style display. This app uses the popular socket.io module that establishes a real-time connection between your browser and the server, a strength of Node. Any web service that needs to be continually updated is a prime candidate for Node.js: current news articles, social media streams, auto-complete suggestions as a user types in search terms, and chat reference all come to mind. In fact, SpringShare’s LibChat uses socket.io as well, though I can’t tell if it’s using Node on the server or PHP. A similar example of real-time updating, also by Ed Summers, is Wikistream which streams the dizzying number of edits happening on various Wikipedias through your browser.4

There was a lightning talk on Node at Code4Lib 2010 which mentions writing a connector to the popular Apache Solr search platform. Aaron Coburn’s proposed talk for Code4Lib 2014 mentions that Amherst is using Node to build the web front-end to their Fedora-based digital library.

Tools You Can Use

With the explosive growth of NPM, there are already tons of useful tools written in Node. While many of these are tools for writing web servers, like Express, some are command line programs you can use to accomplish a variety of tasks.

Yeoman is a scaffolding application that makes it easy to produce various web apps by giving you expert templates. You can install separate generators that produce templates for things like a Twitter Bootstrap site, a JavaScript bookmarklet, a mobile site, or a project using the Angular JavaScript MVC framework. Running yo angular to invoke the Angular generator gives you a lot more than just a base HTML file and some JavaScript libraries; it also provides a series of Grunt tasks for testing, running a development server, and building a site optimized for production. Grunt is another incredibly useful Node project, dubbed “the JavaScript task runner.” It lets you pick from hundreds of community plugins to automate tedious tasks like minifying and concatenating your scripts before deploying a website.

Finally, another tool that I like is phantomas which is a Node project that works with PhantomJS to run a suite of performance tests on a site. It provides more detailed reports than any other performance tool I’ve used, telling you things like how many DOM queries ran and median latency of HTTP requests.

Learn More

Nodeschool.io features a growing number of lessons on using Node. Better yet, the lessons are actually written in Node, so you install them with NPM and verify your results on the command line. There are several topics, from basics to using streams to working with databases.

Nettuts+, always a good place for coding tutorials, has an introduction to Node which takes you from installation to coding a real-time server. If you want to learn about writing a real-time chat application with socket.io, they have a tutorial for that, too.

If you want a broad and thorough overview, there are a few introductory books on Node, with The Node Beginner Book offering several free chapters. O’Reilly’s Node for Front-End Developers is also a good starting point.

How to Node is a popular blog with articles on various topics, though some are too in-depth for beginners. I’d head here if you want to learn more on a specific topic, such as streams, or working with particular databases like MongoDB.

Finally, the Node API docs are a good place to go when you get stuck using a particular core module.

Notes

  1. If you use a package manager, such as Homebrew on Mac OS X or APT on Linux, Node is likely available within it. One caveat I have noticed is that the stock Debian/Ubuntu apt-get install nodejs is a few major versions behind; you may want to add Chris Lea’s PPA to get a current version. If you’re subject to the whims of your IT department, you may need to convince them to install Node for you, or talk to your sysadmin to get it on your server. Since it’s a rather new technology, don’t be surprised if you have to explain what it is and why you want to try it out.
  2. Previous projects, including Rhino from Mozilla and Narwhal, have let people use JavaScript outside the server. Node, however, has caught on far more than either of these projects, for some of the reasons outlined in this post.
  3. RequireJS is one project that’s trying to address this need. The ECMAScript standard that defines JavaScript is also working on native modules but they’re in draft form and it’ll be a long time before all browsers support them.
  4. If you’re curious, the code for both Cataloging Highscores and Wikistream are open source and available on GitHub.

Query a Google Spreadsheet like a Database with Google Visualization API Query Language

Libraries make much use of spreadsheets. Spreadsheets are easy to create, and most library staff are familiar with how to use them. But they can quickly get unwieldy as more and more data are entered. The more rows and columns a spreadsheet has, the more difficult it is to browse and quickly identify specific information. Creating a searchable web application with a database at the back-end is a good solution since it will let users to quickly perform a custom search and filter out unnecessary information. But due to the staff time and expertise it requires, creating a full-fledged searchable web database application is not always a feasible option at many libraries.

Creating a MS Access custom database or using a free service such as Zoho can be an alternative to creating a searchable web database application. But providing a read-only view for MS Access database can be tricky although possible. MS Access is also software locally installed in each PC and therefore not necessarily available for the library staff when they are not with their work PCs on which MS Access is installed. Zoho Creator offers a way to easily convert a spreadsheet into a database, but its free version service has very limited features such as maximum 3 users, 1,000 records, and 200 MB storage.

Google Visualization API Query Language provides a quick and easy way to query a Google spreadsheet and return and display a selective set of data without actually converting a spreadsheet into a database. You can display the query result in the form of a HTML table, which can be served as a stand-alone webpage. All you have to do is to construct a custom URL.

A free version of Google spreadsheet has a limit in size and complexity. For example, one free Google spreadsheet can have no more than 400, 000 total cells. But you can purchase more Google Drive storage and also query multiple Google spreadsheets (or even your own custom databases) by using Google Visualization API Query Language and Google Chart Libraries together.  (This will be the topic of my next post. You can also see the examples of using Google Chart Libraries and Google Visualization API Query Language together in my presentation slides at the end of this post.)

In this post, I will explain the parameters of Google Visualization API Query Language and how to construct a custom URL that will query, return, and display a selective set of data in the form of an HTML page.

A. Display a Google Spreadsheet as an HTML page

The first step is to identify the URL of the Google spreadsheet of your choice.

The URL below opens up the third sheet (Sheet 3) of a specific Google spreadsheet. There are two parameters you need to pay attention inside the URL: key and gid.

https://docs.google.com/spreadsheet/ccc?key=0AqAPbBT_k2VUdDc3aC1xS2o0c2ZmaVpOQWkyY0l1eVE&usp=drive_web#gid=2

This breaks down the parameters in a way that is easier to view:

  • https://docs.google.com/spreadsheet/ccc
    ?key=0AqAPbBT_k2VUdDc3aC1xS2o0c2ZmaVpOQWkyY0l1eVE
    &usp=drive_web

    #gid=2

Key is a unique identifier to each Google spreadsheet. So you need to use that to cretee a custom URL later that will query and display the data in this spreadsheet. Gid specifies which sheet in the spreadsheet you are opening up. The gid for the first sheet is 0; the gid for the third sheet is 2.

Screen Shot 2013-11-27 at 9.44.29 AM

Let’s first see how Google Visualization API returns the spreadsheet data as a DataTable object. This is only for those who are curious about what goes on behind the scenes. You can see that for this view, the URL is slightly different but the values of the key and the gid parameter stay the same.

https://spreadsheets.google.com/tq?&tq=&key=0AqAPbBT_k2VUdDc3aC1xS2o0c2ZmaVpOQWkyY0l1eVE&gid=2

Screen Shot 2013-11-27 at 9.56.03 AM

In order to display the same result as an independent HTML page, all you need to do is to take the key and the gid parameter values of your own Google spreadsheet and construct the custom URL following the same pattern shown below.

  • https://spreadsheets.google.com
    /tq?tqx=out:html&tq=
    &key=0AqAPbBT_k2VUdDc3aC1xS2o0c2ZmaVpOQWkyY0l1eVE
    &gid=2

https://spreadsheets.google.com/tq?tqx=out:html&tq=&key=0AqAPbBT_k2VUdDc3aC1xS2o0c2ZmaVpOQWkyY0l1eVE&gid=2

Screen Shot 2013-11-27 at 9.59.11 AM

By the way, if the URL you created doesn’t work, it is probably because you have not encoded it properly. Try this handy URL encoder/decoder page to encode it by hand or you can use JavaScript encodeURIComponent() function.
Also if you want the URL to display the query result without people logging into Google Drive first, make sure to set the permission setting of the spreadsheet to be public. On the other hand, if you need to control access to the spreadsheet only to a number of users, you have to remind your users to first go to Google Drive webpage and log in with their Google account before clicking your URLs. Only when the users are logged into Google Drive, they will be able see the query result.

B. How to Query a Google Spreadsheet

We have seen how to create a URL to show an entire sheet of a Google spreadsheet as an HTML page above. Now let’s do some querying, so that we can pick and choose what data the table is going to display instead of the whole sheet. That’s where the Query Language comes in handy.

Here is an example spreadsheet with over 50 columns and 500 rows.

  • https://docs.google.com/spreadsheet/ccc?
    key=0AqAPbBT_k2VUdDFYamtHdkFqVHZ4VXZXSVVraGxacEE
    &usp=drive_web
    #gid=0

https://docs.google.com/spreadsheet/ccc?key=0AqAPbBT_k2VUdDFYamtHdkFqVHZ4VXZXSVVraGxacEE&usp=drive_web#gid=0

Screen Shot 2013-11-27 at 10.15.41 AM

What I want to do is to show only column B, C, D, F where C contains ‘Florida.’ How do I do this? Remember the URL we created to show the entire sheet above?

  • https://spreadsheets.google.com/tq?tqx=out:html&tq=&key=___&gid=___

There we had no value for the tq parameter. This is where we insert our query.

Google Visualization API Query Language is pretty much the same as SQL. So if you are familiar with SQL, forming a query is dead simple. If you aren’t SQL is also easy to learn.

  • The query should be written like this:
    SELECT B, C, D, F WHERE C CONTAINS ‘Florida’
  • After encoding it properly, you get something like this:
    SELECT%20B%2C%20C%2C%20D%2C%20F%20WHERE%20C%20CONTAINS%20%27Florida%27
  • Add it to the tq parameter and don’t forget to also specify the key:
    https://spreadsheets.google.com/tq?tqx=out:html&tq=SELECT%20B%2C%20C%2C%20D%2C%20F%20WHERE%20C%20CONTAINS%20%27Florida%27
    &key=0AqAPbBT_k2VUdEtXYXdLdjM0TXY1YUVhMk9jeUQ0NkE

I am omitting the gid parameter here because there is only one sheet in this spreadsheet but you can add it if you would like. You can also omit it if the sheet you want is the first sheet. Ta-da!

Screen Shot 2013-11-27 at 10.26.13 AM

Compare this with the original spreadsheet view. I am sure you can appreciate how the small effort put into creating a URL pays back in terms of viewing an unwieldy large spreadsheet manageable.

You can also easily incorporate functions such as count() or sum() into your query to get an overview of the data you have in the spreadsheet.

  • select D,F count(C) where (B contains ‘author name’) group by D, F

For example, this query above shows how many articles a specific author published per year in each journal. The screenshot of the result is below and you can see it for yourself here: https://spreadsheets.google.com/tq?tqx=out:html&tq=select+D,F,count(C)+where+%28B+contains+%27Agoulnik%27%29+group+by+D,F&key=0AqAPbBT_k2VUdEtXYXdLdjM0TXY1YUVhMk9jeUQ0NkE

Screen Shot 2013-11-27 at 11.34.25 AM

Take this spread sheet as another example.

libbudgetfake

This simple query below displays the library budget by year. For those who are unfamiliar with ‘pivot‘, pivot table is a data summarization tool. The query below asks the spreadsheet to calculate the total of all the values in the B column (Budget amount for each category) by the values found in the C column (Years).

Screen Shot 2013-11-27 at 11.46.49 AM

This is another example of querying the spreadsheet connected to my library’s Literature Search request form. The following query asks the spreadsheet to count the number of literature search requests by Research Topic (=column I) that were received in 2011 (=column G) grouped by the values in the column C, i.e. College of Medicine Faculty or College of Medicine Staff.

  • select C, count(I) where (G contains ’2011′) group by C

litsearch

C. More Querying Options

There are many more things you can do with a custom query. Google has an extensive documentation that is easy to follow: https://developers.google.com/chart/interactive/docs/querylanguage#Language_Syntax

These are just a few examples.

  • ORDER BY __ DESC
    : Order the results in the descending order of the column of your choice. Without ‘DESC,’ the result will be listed in the ascending order.
  • LIMIT 5
    : Limit the number of results. Combined with ‘Order by’ you can quickly filter the results by the most recent or the oldest items.

My presentation slides given at the 2013 LITA Forum below includes more detailed information about Google Visualization API Query Language, parameters, and other options as well as how to use Google Chart Libraries in combination with Google Visualization API Query Language for data visualization, which is the topic of my next post.

Happy querying Google Spreadsheet!

 


Web Scraping: Creating APIs Where There Were None

Websites are human-readable. That’s great for us, we’re humans. It’s not so great for computer programs, which tend to be better at navigating structured data rather than visuals.

Web scraping is the practice of “scraping” information from a website’s HTML. At its core, web scraping lets programs visit and manipulate a website much like people do. The advantage to this is that, while programs aren’t great at navigating the web on their own, they’re really good at repeating things over and over. Once a web scraping script is set up, it can run an operation thousands of times over without breaking a sweat. Compare that to the time and tedium of clicking through a thousand websites to copy-paste the information you’re interested in and you can see the appeal of automation.

Why web scraping?

Why would anybody use web scraping? There are a few good reasons which are, unfortunately, all too common in libraries.

You need an API where there is none.

Many of the web services we subscribe to don’t expose their inner workings via an API. It’s worth taking a moment to explain the term API, which is used frequently but rarely given a better definition beyond the uninformative “Application Programming Interface”.

Let’s consider a common type of API, a search API. When you visit Worldcat and search, the site checks an enormous database of millions of metadata records and returns a nice, visually formatted list of ones relevant to your query. Again, this is great for humans. We can read through the results and pick out the ones we’re interested in. But what happens when we want to repurpose this data elsewhere? What if we want to build a bento search box, displaying results from our databases and Worldcat alongside each other?1 The answer is that we can’t easily accomplish this without an API.

For example, the human-readable results of search engine may look like this:

1. Instant PHP Web Scraping

by Jacob Ward

Publisher: Packt Publishing 2013

2. Scraping by: wage labor, slavery, and survival in early Baltimore

by Seth Rockman

Publisher: Johns Hopkins University Press 2009

That’s fine for human eyes, but for our search application it’s a pain in the butt. Even if we could embed a result like this using an iframe, the styling might not match what we want and the metadata fields might not display in a manner consistent with our other records (e.g. why is the publication year included with publisher?). What an API returns, on the other hand, may look like this:

[
  {
    "title": "Instant PHP Web Scraping",
    "author": "Jacob Ward",
    "publisher": "Packt Publishing",
    "publication_date": "2013"
  },
  {
    "title": "Scraping by: wage labor, slavery, and survival in early Baltimore",
    "author": "Seth Rockman",
    "publisher": "Johns Hopkins University Press",
    "publication_date": "2009"
  }
]

Unless you really love curly braces and quotation marks, that looks awful. But it’s very easy to manipulate in many programming languages. Here’s an incomplete example in Python:

results = json.load( data )
for result in results:
  print result.title + ' - ' + result.author

Here “data” is our search results from above and we can use a function to easily parse that data into a variable. The script then loops over each search result and prints out its title in author in a format like “Instant PHP Web Scraping – Jacob Ward”.

An API is hard to use or doesn’t have the data you need.

Sometimes services do expose their data via an API, but the API has limitations that the human interface of the website doesn’t. Perhaps it doesn’t expose all the metadata which is visible in search results. Fellow Tech Connect author Margaret Heller mentioned that Ulrich’s API doesn’t include subject information, though it’s present in the search results presented to human users.

Some APIs can also be more difficult to use than web scraping. The ILS at my place of work is like this; you have to pay extra to get the API activated and it requires server configuration on a shared server I don’t have access to. The API has strict authentication requirements which are required even for read-only calls (e.g. I’m just accessing publicly-viewable data, not making account changes). The boilerplate code the vendor provides doesn’t work, or rather only works for trivial examples. All these hurdles combine to make scraping the catalog appealing.

As a side effect, how you reconfigure a site’s data might inspire its own API. Are you sorely missing a feature so bad you need to hack around it? Writing a nice proof-of-concept with web scraping might prove that there’s a use case for a particular API feature.

How?

More or less all web scraping works the same way:

  • Use a scripting language to get the HTML of a particular page
  • Find the interesting pieces of a page using CSS, XPath, or DOM traversal—any means of identifying specific HTML elements
  • Manipulate those pieces, extracting the data you need
  • Pipe the data somewhere else, e.g. into another web page, spreadsheet, or script

Let’s go through an example using the Directory of Open Access Journals. Now, the DOAJ has an API of sorts; it supports retrieving metadata via the OAI-PMH verbs. This means a request for a URL like http://www.doaj.org/oai?verb=GetRecord&identifier=18343147&metadataPrefix=oai_dc will return XML with information about one of the DOAJ journals. But OAI-PMH doesn’t support any search APIs; we can use standard identifiers and other means of looking up specific articles or publications, but we can’t do a traditional keyword search.

Libraries, of the code persuasion

Before we get too far, let’s lean on those who came before us. Scraping a website is both a common task and a complex one. Remember last month, when I said that we don’t need to reinvent the wheel in our programming because reusable modules exist for most common tasks? Please let’s not write our own web scraping library from scratch.

Code libraries, which go by different names depending on the language (most amusingly, they’re called “eggs” in Python and “gems” in Ruby), are pre-written chunks of code which help you complete common tasks. Any task which several people have had to do before probably has a library devoted to it. Google searches for “best [insert task] module for [insert language]” typically turn up useful guidance on where to start.

While each language has its own means of incorporating others’ code into your own, they all basically have two steps: 1) download the external library somewhere onto your hard drive or server, often using a command-line tool, and 2) import the code into your script. The external library should have some documentation on how to use it’s special features once you’re imported it.

What does this look like in PHP, the language our example will be in? First, we visit the Simple HTML DOM website on Sourceforge to download a single PHP file. Then, we place that file in the same directory that our scraping script will live. In our scraping script, we write a single line up at the top:

<?php
require_once( 'simple_html_dom.php' );
?>

Now it’s as if the whole contents of the simple_html_dom.php file were in our script. We can use functions and classes which were defined in the other file, such as the file_get_html function which is not otherwise available. PHP actually has a few functions which are used to import code in different ways; the documentation page for the include function describes the basic mechanics.

Web scraping a DOAJ search

While the DOAJ doesn’t have a search API, it does have a search bar which we can manipulate in our scraping. Let’s run a test search, view the HTML source of the result, and identify the elements we’re interested in. First, we visit doaj.org and type in a search. Note the URL:

doaj.org/doaj?func=search&template=&uiLanguage=en&query=librarianship

I’ve highlighted the key-value pairs in the URLs query string, making the keys bold and the values italicized. Here our search term was “librarianship” which is the value associated with the appropriately-named “query” key. If we change the word “librarianship” to a different search term and visit the new URL, we see results for the new term, predictably. With easily hackable URLs like this, it’s easy for us to write a web scraping script. Here’s the first half of our example in PHP:

<?php
// see http://simplehtmldom.sourceforge.net/manual_api.htm for documentation
require_once( 'simple_html_dom.php' );

$base = 'http://www.doaj.org/doaj?func=search&template=&uiLanguage=en&query=';
$query = urlencode( 'librarianship' );

$html = file_get_html( $base . $query );
// to be continued...
?>

So far, everything is straightforward. We insert the web scraping library we’re using, then use what we’ve figured out about the DOAJ URL structure: it has a base which won’t change and a query which we want to change according to our interests. You could have the query come from command-line arguments or web form data like the $_GET array in PHP, but let’s just keep it as a simple string.

We urlencode the string because we don’t want spaces or other illegal characters sneaking their way in there; while the script still works with $query = 'new librarianship' for example, using unencoded text in URLs is a bad habit to get into. Other functions, such as file_get_contents, will produce errors if passed a URL with spaces in it. On the other hand, urlencode( 'new librarianship' ) returns the appropriately encoded string “new+librarianship”. If you do take user input, remember to sanitize it before using it elsewhere.

For the second part, we need to investigate the HTML source of DOAJ’s search results page. Here’s a screenshot and a simplified example of what it looks like:

2 search results from the DOAJ

A couple search results from DOAJ for the term “librarianship”

<div id="result">
  <div class="record" id="record1">
    <div class="imageDiv">
      <img src="/doajImages/journal.gif"><br><span><small>Journal</small></span>
    </div><!-- END imageDiv -->
    <div class="data">
      <a href="/doaj?func=further&amp;passMe=http://www.collaborativelibrarianship.org">
        <b>Collaborative Librarianship</b>
      </a>
      <strong>ISSN/EISSN</strong>: 19437528
      <br><strong>Publisher</strong>: Regis University
      <br><strong>Subject</strong>:
      <a href="/doaj?func=subject&amp;cpId=129&amp;uiLanguage=en">Library and Information Science</a>
      <br><b>Country</b>: United States
      <b>Language</b>: English<br>
      <b>Start year</b> 2009<br>
      <b>Publication fee</b>:
    </div> <!-- END data -->
    <!-- ...more markup -->
  </div> <!-- END record -->
  <div class="recordColored" id="record2">
    <div class="imageDiv">
      <img src="/doajImages/article.png"><br><span><small>Article</small></span>
    </div><!-- END imageDiv -->
    <div class="data">
       <b>Mentoring for Emerging Careers in eScience Librarianship: An iSchool – Academic Library Partnership </b>
      <div style="color: #585858">
        <!-- author (s) -->
         <strong>Authors</strong>:
          <a href="/doaj?func=search&amp;query=au:&quot;Gail Steinhart&quot;">Gail Steinhart</a>
          ---
          <a href="/doaj?func=search&amp;query=au:&quot;Jian Qin&quot;">Jian Qin</a><br>
        <strong>Journal</strong>: <a href="/doaj?func=issues&amp;jId=88616">Journal of eScience Librarianship</a>
        <strong>ISSN/EISSN</strong>: 21613974
        <strong>Year</strong>: 2012
        <strong>Volume</strong>: 1
        <strong>Issue</strong>: 3
        <strong>Pages</strong>: 120-133
        <br><strong>Publisher</strong>: University of Massachusetts Medical School
      </div><!-- End color #585858 -->
    </div> <!-- END data -->
    <!-- ...more markup -->
   </div> <!-- END record -->
   <!-- more records -->
</div> <!-- END results list -->

Even with much markup removed, there’s a lot going on here. We need to zone in on what’s interesting and find patterns in the markup that help us retrieve it. While it may not be obvious from the example above, the title of each search result is contained in a <b> tag towards the beginning of each record (lines 8 and 26 above).

Here’s a sketch of the element hierarchy leading to the title: a <div> with id=”result” > a <div> with a class of either “record” or “recordColored” > a <div> with a class of “data” > possibly an <a> tag (present in the first example, absent in the second) > the <b> tag containing the title. Noting the conditional parts of this hierarchy is important; if we didn’t note that sometimes an <a> tag is present and that the class can be either “record” or “recordColored”, we wouldn’t be getting all the items we want.

Let’s try to return the titles of all search results on the first page. We can use Simple HTML DOM’s find method to extract the content of specific elements using CSS selectors. Now that we know how the results are structured, we can write a more complete example:

<?php
require_once( 'simple_html_dom.php' );

$base = 'http://www.doaj.org/doaj?func=search&template=&uiLanguage=en&query=';
$query = urlencode( 'librarianship' );

$html = file_get_html( $base . $query );

// using our knowledge of the DOAJ results page
$records = $html->find( '.record .data, .recordColored .data' );

foreach( $records as $record ) {
  echo $record->getElementsByTagName( 'b', 0 )->plaintext . PHP_EOL;
}
?>

The beginning remains the same, but this time we actually do something with the HTML. We use find to pull the records which have class “data.” Then we echo the first <b> tag’s text content. The getElementsByTagName method typically returns an array, but if you pass a second integer parameter it returns the array element at that index (0 being the first element in the array, because computer scientists count from zero). The ->plaintext property simply contains the text found in the element, if we echoed the element itself we would see opening and closing <b> tags wrapped around the title. Finally, we append an “end-of-line” (EOL) character just to make the output easier to read.

To see our results, we can run our script on the command line. For Linux or Mac users, that likely means merely opening a terminal (in Applications/Utilities on a Mac) since they come with PHP pre-installed. On Windows, you may need to use WAMP or XAMPP to run PHP scripts. XAMPP gives you a “shell” button to open a terminal, while you can put the PHP executable in your Windows environment variables if you’re using WAMP.

Once you have a terminal open, the php command will execute whatever PHP script you pass it as a parameter. If we run php name-of-our-script.php in the same directory as our script, we see ten search result titles printed to the terminal:

> php doaj-search.php
Collaborative Librarianship
Mentoring for Emerging Careers in eScience Librarianship: An iSchool – Academic Library Partnership
Education for Librarianship in Turkey Education for Librarianship in Turkey
Turkish Librarianship: A Selected Bibliography Turkish Librarianship: A Selected Bibliography
Journal of eScience Librarianship
Editorial: Our Philosophies of Librarianship
Embedded Academic Librarianship: A Review of the Literature
Model Curriculum for 'Oriental Librarianship' in India
A General Outlook on Turkish Librarianship and Libraries
The understanding of subject headings among students of librarianship

This is a simple, not-too-useful example. But it could expanded in many ways. Try copying the script above and attempting some of the following:

  • Make the script return more than the ten items on the first page of results
  • Use some of DOAJ’s advanced search functions, for instance a date limiter
  • Only return journals or articles, not both
  • Return more than just the title of results, for instance the author(s), URLs, or publication date

Accomplishing these tasks involves learning more about DOAJ’s URL and markup structure, but also learning more about the scraping library you’re using.

Common Problems

There are a couple possible hangups when web scraping. First of all, many websites employ user-agent sniffing to serve different versions of themselves to different devices. A user agent is a hideous string of text which web browsers and other HTTP clients use to identify themselves.2 If a site misinterprets our script’s user agent, we may end up on a mobile or other version of a site instead of the desktop one we were expecting. Worse yet, some sites try to prevent scraping by blacklisting certain user agents.

Luckily, most web scraping libraries have tools built in to work around this problem. A nice example is Ruby’s Mechanize, which has an agent.user_agent_alias property which can be set to a number of popular web browsers. When using an alias, our script essentially tells the responding web server that it’s a common desktop browser and thus is more likely to get a standard response.

It’s also routine that we’ll want to scrape something behind authentication. While IP authentication can be circumvented by running scripts from an on-campus connection, other sites may require login credentials. Again, most web scraping libraries already have built-in tools for handling authentication. We can find which form controls on the page we need to fill in, insert your username and password into the form, and then submit it programmatically. Storing a login in a plain text script is never a good idea though, so be careful.

Considerations

Not all web scraping is legitimate. Taking data which is copyrighted and merely re-displaying it on our site without proper attribution is not only illegal, it’s just not being a good citizen of the web. The Wikipedia article on web scraping has a lengthy section on legal issues with a few historical cases from various countries.

It’s worth noting that web scraping can be very brittle, meaning it breaks often and easily. Scraping typically relies on other people’s markup to remain consistent. If just a little piece of HTML changes, our entire script might be thrown off, looking for elements that no longer exist.

One way to counteract this is to write selectors which are as broad as possible. For instance, let’s return to the DOAJ search results markup. Why did we use such a concise CSS selector to find the title when we could have been much more specific? Here’s a more explicit way of getting the same data:

$html->find( 'div#result > div.record > div.data, div#result > div.recordColored > div.data' );

What’s wrong with these selectors? We’re relying on so much more to stay the same. We need: the result wrapper to be a <div>, the result wrapper to have an id of “result”, the record to be a <div>, and the data inside the record to be a <div>. Our use of the child selector “>” means we need the element hierarchy to stay precisely the same. If any of these properties of the DOAJ markup changed, our selector wouldn’t find anything and our script would need to be updated. Meanwhile, our much more generic line still grabs the right information because it doesn’t depend on particular tags or other aspects of the markup remaining constant:

$html->find( '.record .data, .recordColored .data' );

We’re still relying on a few things—we have to, there’s no getting around that in web scraping—but a lot could change and we’d be set. If the DOAJ upgraded to HTML5 tags, swapping out <div> for <article> or <section>, we would be OK. If the wrapping <div> was removed, or had its id change, we’d be OK. If a new wrapper was inserted in between the “data” and “record” <div>, we’d be OK. Our approach is more resilient.

If you did try running our PHP script, you probably noticed it was rather slow. It’s not like typing a query into Google and seeing results immediately. We have to request a page from an external site, which then queries its backend database, processes the results, and displays HTML which we ultimately don’t use, at least not as intended. This highlights that web scraping isn’t a great option for user-facing searches; it can take too long to return results. One option is to cache searches, for instance storing results of previous scrapings in a database and then checking to see if the database has something relevant before resorting to pulling content off an external site.

It’s also worth noting that web scraping projects should try to be reasonable about the number of times they request an external resource. Every time our script pulls in a site’s HTML, it’s another request that site’s server has to process. A site may not have an API because it cannot handle the amount of traffic one would attract. If our web scraping project is going to be sending thousands of requests per hour, we should consider how reasonable that is. A simple email to the third party explaining what we’re doing and the amount of traffic it may generate is a nice courtesy.

Overall, web scraping is handy in certain situations (see below) or for scripts which are run seldom or a single time. For instance, if we’re doing an analysis of faculty citations at our institution, we might not have access to a raw list of citations. But faculty may have university web pages where they list all their publications in a consistent format. We could write a script which only needs to run once, culling a large list of citations for analysis. Once we’ve scraped that information, you could use OpenRefine or other power tools to extract particular journal titles or whatever else we’re interested in.

How is web scraping used in libraries?

I asked Twitter what other libraries are using web scraping for and got a few replies:

@phette23 Pulling working papers off a departmental website for the inst repo. Had to web scrape for metadata.
— Ondatra libskoolicus (@LibSkrat) September 25, 2013

Matthew Reidsma of Grand Valley State University also had several examples:

To fuel a live laptop/iPad availability site by scraping holdings information from the catalog. See the availability site as well as the availability charts for the last few days and the underlying code which does the scraping. This uses the same Simple HTML Dom library as our example above.

It’s also used to create a staff API by scraping the GVSU Library’s Staff Directory and reformatting it; see the code and the result. The result may not look very readable—it’s JSON, a common data format that’s particularly easy to reuse in some languages such as JavaScript—but remember that APIs are for machine-readable data which can be easily reused by programs, not people.

Jacqueline Hettel of Stanford University has a great blog post that describes using a Google Chrome extension and XPath queries to scrape acknowledgments from humanities monographs in Google Books; no coding required! She and Chris Bourg are presenting their results at the Digital Library Federation in November.

Finally, I use web scraping to pull hours information from our main library site into our mobile version. I got tired of updating the hours in two places every time they changed, so now I pull them in using a PHP script. It’s worth noting that this dual-maintenance annoyance is one major reason websites can and should be done in responsive designs.

Most of these library examples are good uses of web scraping because they involve simply transporting our data from one system to another; scraping information from the catalog to display it elsewhere is a prime use case. We own the data, so there are no intellectual property issues, and they’re our own servers so we’re responsible for keeping them up.

Code Libraries

While we’ve used PHP above, there’s no need to limit ourselves to a particular programming language. Here’s a set of popular web scraping choices in a few languages:

To provide a sense of how the different tools above work, I’ve written a series of gists which uses each to scrape titles from the first page of a DOAJ search.

Notes
  1. See the NCSU or Stanford library websites for examples of this search style. Essentially, results from several different search engines—a catalog, databases, the library website, study guides—are all displaying on the same page in seperate “bento” compartments.
  2. The browser I’m in right now, Chrome, has this beauty for a user agent string: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36″. Yes, that’s right: Mozilla, Mac, AppleWebKit, KHTML, Gecko, Chrome, & Safari all make an appearance.

Demystifying Programming

We talk quite a bit about code here at Tech Connect and it’s not unusual to see snippets of it pasted into a post. But most of us, indeed most librarians, aren’t professional programmers or full-time developers; we had to learn like everyone else. Depending on your background, some parts of coding will be easy to pick up while others won’t make sense for years. Here’s an attempt to explain the fundamental building blocks of programming languages.

The Languages

There are a number of popular programming languages: C, C#, C++, Java, JavaScript, Objective C, Perl, PHP, Python, and Ruby. There are numerous others, but this semi-arbitrary selection cover the ones most commonly in use. It’s important to know that each programming language requires its own software to run. You can write Python code into a text file on a machine that doesn’t have the Python interpreter installed, but you can’t execute it and see the results.

A lot of learners stress over which language to learn first unnecessarily. Once you’ve picked up one language, you’ll understand all of the foundational pieces listed below. Then you’ll be able to transition quickly to another language by understanding a few syntax changes: Oh, in JavaScript I write function myFunction(x) to define a function, while in Python I write def myFunction(x). Programming languages differ in other ways too, but knowing the basics of one provides a huge head start on learning the basics of any other.

Finally, it’s worth briefly distinguishing compiled versus interpreted languages. Code written in a compiled language, such as all the capital C languages and Java, must first be passed to a compiler program which then spits out an executable—think a file ending if .exe if you’re on Windows—that will run the code. Interpreted languages, like Perl, PHP, Python, and Ruby, are quicker to program in because you just pass your code along to an interpreter program which immediately executes it. There’s one fewer step: for a compiled language you need to write code, generate an executable, and then run the executable while interpreted languages sort of skip that middle step.

Compiled languages tend to run faster (i.e. perform more actions or computations in a given amount of time) than interpreted ones, while interpreted ones tend to be easier to learn and more lenient towards the programmer. Again, it doesn’t matter too much which you start out with.

Variables

Variables are just like variables in algebra; they’re names which stand in for some value. In algebra, you might write:

x = 10 + 3

which is also valid code in many programming languages. Later on, if you used the value of x, it would be 13.

The biggest difference between variables in math and in programming is that programming variables can be all sort of things, not just numbers. They can be strings of text, for instance. Below, we combine two pieces of text which were stored in variables:

name = 'cat'
mood = ' is laughing'
both = name + mood

In the above code, both would have a value of ‘cat is laughing’. Note that text strings have to be wrapped in quotes—often either double or single quotes is acceptable—in order to distinguish them from the rest of the code. We also see above that variables can be the product of other variables.

Comments

Comments are pieces of text inside a program which are not interpreted as code. Why would you want to do that? Well, comments are very useful for documenting what’s going on in your code. Even if your code is never going to be seen by anyone else, writing comments helps understand what’s going on if you return to a project after not thinking about it for a while.

// This is a comment in JavaScript; code is below.
number = 5;
// And a second comment!

As seen above, comments typically work by having some special character(s) at the beginning of the line which tells the programming language that the rest of the line can be ignored. Common characters that indicate a line is a comment are # (Python, Ruby), // (C languages, Java, JavaScript, PHP), and /* (CSS, multi-line blocks of comments in many other languages).

Functions

As with variables, functions are akin to those in math: they take an input, perform some calculations with it, and return an output. In math, we might see:

f(x) = (x * 3)/4

f(8) = 6

Here, the first line is a function definition. It defines how many parameters can be passed to the function and what it will do with them. The second line is more akin to a function execution. It shows that the function returns the value 6 when passed the parameter 8. This is really, really close to programming already. Here’s the math above written in Python:

def f(x):
  return (x * 3)/4

f(8)
# which returns the number 6

Programming functions differ from mathematical ones in much the same way variables do: they’re not limited to accepting and producing numbers. They can take all sorts of data—including text—process it, and then return another sort of data. For instance, virtually all programming languages allow you to find the length of a text string using a function. This function takes text input and outputs a number. The combinations are endless! Here’s how that looks in Python:

len('how long?')
# returns the number 9

Python abbreviates the word “length” to simply “len” here, and we pass the text “how long?” to the function instead of a number.

Combining variables and functions, we might store the result of running a function in a variable, e.g. y = f(8) would store the value 6 in the variable y if f(x) is the same as above. This may seem silly—why don’t you just write y = 6 if that’s what you want!—but functions help by abstracting out blocks of code so you can reuse them over and over again.

Consider a program you’re writing to manage the e-resource URLs in your catalog, which are stored in MARC field 856 subfield U. You might have a variable named num_URLs (variable names can’t have spaces, thus the underscore) which represents the number of 856 $u subfields a record has. But as you work on records, that value is going to change; rather than manually calculate it each time and set num_URLs = 3 or num_URLs = 2 you can write a function to do this for you. Each time you pass the function a bibliographic record, it will return the number of 856 $u fields, substantially reducing how much repetitive code you have to write.

Conditionals

Many readers are probably familiar with IFTTT, the “IF This Then That” web service which can glue together various accounts, for instance “If I post a new photo to Instagram, then save it to my Dropbox backup folder.” These sorts of logical connections are essential to programming, because often whether or not you perform a particular action varies depending on some other condition.

Consider a program which counts the number of books by Virginia Woolf in your catalog. You want to count a book only if the author is Virginia Woolf. You can use Ruby code like this:

if author == 'Virginia Woolf'
  total = total + 1
end

There are three parts here: first we specify a condition, then there’s some code which runs only if the condition is true, and then we end the condition. Without some kind of indication that the block of code inside the condition has ended, the entire rest of our program would only run depending on if the variable author was set to the right string of text.

The == is definitely weird to see for the first time. Why two equals? Many programming languages use a variety of double-character comparisons because the single equals already has a meaning: single equals assigns a value to a variable (see the second line of the example above) while double-equals compares two values. There are other common comparisons:

  • != often means “is not equal to”
  • > and < are the typical greater or lesser than
  • >= and <= often mean “greater/lesser than or equal to”

Those can look weird at first, and indeed one of the more common mistakes (made by professionals and newbies alike!) is accidentally putting a single equals instead of a double.[1] While we’re on the topic of strange double-character equals signs, it’s worth pointing out that += and -= are also commonly seen in programming languages. These pairs of symbols respectively add or subtract a given number from a variable, so they do assign a value but they alter it slightly. For instance, above I could have written total += 1 which is identical in outcome as total = total + 1.

Lastly, conditional statements can be far more sophisticated than a mere “if this do that.” You can write code that says “if blah do this, but if bleh do that, and if neither do something else.” Here’s a Ruby script that would count books by Virginia Woolf, books by Ralph Ellison, and books by someone other than those two.

total_vw = 0
total_re = 0
total_others = 0
if author == 'Virginia Woolf'
  total_vw += 1
elsif author == 'Ralph Ellison'
  total_re += 1
else
  total_others += 1
end

Here, we set all three of our totals to zero first, then check to see what the current value of author is, adding one to the appropriate total using a three-part conditional statement. The elsif is short for “else if” and that condition is only tested if the first if wasn’t true. If neither of the first two conditions is true, our else section serves as a kind of fallback.

Arrays

An array is simply a list of variables, in fact the Python language has an array-like data type named “list.” They’re commonly denoted with square brackets, e.g. in Python a list looks like

stuff = [ "dog", "cat", "tree"]

Later, if I want to retrieve a single piece of the array, I just access it using its index wrapped in square brackets, starting from the number zero. Extending the Python example above:

stuff[0]
# returns "dog"
stuff[2]
# returns "tree"

Many programming languages also support associative arrays, in which the index values are strings instead of numbers. For instance, here’s an associative array in PHP:

$stuff = array(
  "awesome" => "sauce",
  "moderate" => "spice",
  "mediocre" => "condiment",
);
echo $stuff["mediocre"];
// prints out "condiment"

Arrays are useful for storing large groups of like items: instead of having three variables, which requires more typing and remembering names, we have just have one array containing everything. While our three natural numbers aren’t a lot to keep track of, imagine a program which deals with all the records in a library catalog, or all the search results returned from a query: having an array to store that large list of items suddenly becomes essential.

Loops

Loops repeat an action a set number of times or until a condition is met. Arrays are commonly combined with loops, since loops make it easy to repeat the same operation on each item in an array. Here’s a concise example in Python which prints every entry in the “names” array to the screen:

names = ['Joebob', 'Suebob', 'Bobob']
for name in names:
  print name

Without arrays and loops, we’d have to write:

name1 = 'Joebob'
name2 = 'Suebob'
name3 = 'Bobob'
print name1
print name2
print name3

You see how useful arrays are? As we’ve seen with both functions and arrays, programming languages like to expose tools that help you repeat lots of operations without typing too much text.

There are a few types of loops, including “for” loops and “while” loops loops. Our “for” loop earlier went through a whole array, printing each item out, but a “while” loop only keeps repeating while some condition is true. Here is a bit of PHP that prints out the first four natural numbers:

$counter = 1;
while ( $counter < 5 ) {
  echo $counter;
  $counter = $counter + 1;
}

Each time we go through the loop, the counter is increased by one. When it hits five, the loop stops. But be careful! If we left off the $counter = $counter + 1 line then the loop would never finish because the while condition would never be false. Infinite loops are another potential bug in a program.

Objects & Object-Oriented Programming

Object-oriented programming (oft-abbreviated OOP) is probably the toughest item in this post to explain, which is why I’d rather people see it in action by trying out Codecademy than read about it. Unfortunately, it’s not until the end of the JavaScript track that you really get to work with OOP, but it gives you a good sense of what it looks like in practice.

In general, objects are simply a means of organizing code. You can group related variables and functions under an object. You make an object inherit properties from another one, if it needs to use all the same variables and functions but also add some of its own.

For example, let’s say we have a program that deals with a series of people, each of which have a few properties like their name and age but also the ability to say hi. We can create a people class which is kind of like a template; it helps us stamp out new copies of objects without rewriting the same code over and over. Here’s an example in JavaScript:

function Person(name, age) {
  this.name = name;
  this.age = age;
  this.sayHi = function() {
    console.log("Hi, I'm " + name + ".");
  };
}

Joebob = new Person('Joebob', 39);
Suebob = new Person('Suebob', 40);
Bobob = new Person('Bobob', 3);
Bobob.sayHi();
// prints "Hi, I'm Bobob."
Suebob.sayHi();
// prints "Hi, I'm Suebob."

Our Person function is essentially a class here; it allows us to quickly create three people who are all objects with the same structure, yet they have unique values for their name and age.[2] The code is a bit complicated and JavaScript isn’t a great example, but basically think of this: if we wanted to do this without objects, we’d end up repeating the content of the Person block of code three times over.

The efficiency gained with objects is similar to how functions save us from writing lots of redundant code; identifying common structures and grouping them together under an object makes our code more concise and easier to maintain as we add new features. For instance, if we wanted to add a myAgeIs function that prints out the person’s age, we could just add it to the Person class and then all our people objects would be able to use it.

Modules & Libraries

Lest you worry that every little detail in your programs must be written from scratch, I should mention that all popular programming languages have mechanisms which allow you to reuse others’ code. Practically, this means that most projects start out by identifying a few fundamental building blocks which already exist. For instance, parsing MARC data is a non-trivial task which takes some serious knowledge both of the data structure and the programming language you’re using. Luckily, we don’t need to write a MARC parsing program on our own, because several exist already:

The Code4Lib wiki has an even more extensive list of options.

In general, it’s best to reuse as much prior work as possible rather than spend time working on problems that have already been solved. Complicated tasks like writing a full-fledged web application take a lot of time and expertise, but code libraries already exist for this. Particularly when you’re learning, it can be rewarding to use a major, well-developed project at first to get a sense of what’s possible with programming.

Attention to Detail

The biggest hangup for new programmers often isn’t conceptual: variables, functions, and these other constructs are all rather intuitive, especially once you’ve tried them a few times. Instead, many newcomers find out that programming languages are very literal and unyielding. They can’t read your mind and are happy to simply give up and spit out errors if they can’t understand what you’re trying to do.

For instance, earlier I mentioned that text variables are usually wrapped in quotes. What happens if I forget an end quote? Depending on the language, the program may either just tell you there’s an error or it might badly misinterpret your code, treating everything from your open quote down to the next instance of a quote mark as one big chunk of variable text. Similarly, accidentally misusing double equals or single equals or any of the other arcane combinations of mathematical symbols can have disastrous results.

Once you’ve worked with code a little, you’ll start to pick up tools that ease a lot of minor issues. Most code editors use syntax highlighting to distinguish different constructs  which helps to aid in error recognition. This very post uses a syntax highlighter for WordPress to color keywords like “function” and distinguish variable names. Other tools can “lint” your code for mistakes or code which, while technically valid, can easily lead to trouble. The text editor I commonly use does wonderful little things like provide closing quotes and parens, highlight lines which don’t pass linting tests, and enable me to test-run selected snippets of code.

There’s lots more…

Code isn’t magic; coders aren’t wizards. Yes, there’s a lot to programming and one can devote a lifetime to its study and practice. There are also thousands of resources available for learning, from MOOCs to books to workshops for beginners. With just a few building blocks like the ones described in this post, you can write useful code which helps you in your work.

Footnotes

[1]^ True story: while writing the very next example, I made this mistake.

[2]^ Functions which create objects are called constructor functions, which is another bit of jargon you probably don’t need to know if you’re just getting started.


Fear No Longer Regular Expressions

Regex, it’s your friend

You may have heard the term, “regular expressions” before. If you have, you will know that it usually comes in a notation that is quite hard to make out like this:

(?=^[0-5\- ]+$)(?!.*0123)\d{3}-\d{3,4}-\d{4}

Despite its appearance, regular expressions (regex) is an extremely useful tool to clean up and/or manipulate textual data.  I will show you an example that is easy to understand. Don’t worry if you can’t make sense of the regex above. We can talk about the crazy metacharacters and other confusing regex notations later. But hopefully, this example will help you appreciate the power of regex and give you some ideas of how to make use of regex to make your everyday library work easier.

What regex can do for you – an example

I looked for the zipcodes for all cities in Miami-Dade County and found a very nice PDF online (http://www.miamifocusgroup.com/usertpl/1vg116-three-column/miami-dade-zip-codes-and-map.pdf). But when I copy and paste the text from the PDF file to my text editor (Sublime), the format immediately goes haywire. The county name, ‘Miami-Dade,’ which should be in one line becomes three lines, each of which lists Miami, -, Dade.

Ugh, right? I do not like this one bit.  So let’s fix this using regular expressions.

(**Click the images to bring up the larger version.)

Screen Shot 2013-07-24 at 1.43.19 PM

Screen Shot 2013-07-24 at 1.43.39 PM

Like many text editors, Sublime offers the find/replace with regex feature. This is a much powerful tool than the usual find/replace in MS Word, for example, because it allows you to match many complex cases that fit a pattern. Regular expressions lets you specify that pattern.

In this example, I capture the three lines each saying Miami,-,Dade with this regex:

Miami\n-\nDade.

When I enter this into the ‘Find What’ field, Sublime starts highlighting all the matches.  I am sure you already guessed that \n means a new line. Now let me enter Miami-Dade in the ‘Replace With’ field and hit ‘Replace All.’

Screen Shot 2013-07-24 at 2.11.43 PM

As you can see below, things are much better now. But I  want each set of three lines – Miami-Dade, zipcode, and city – to be one line and each element to be separated by comma and a space such as ‘Miami-Dade, 33010, Hialeah’. So let’s do some more magic with regex.

Screen Shot 2013-07-24 at 2.18.17 PM

How do I describe the pattern of three lines – Miami-Dade, zipcode, and city? When I look at the PDF, I notice that the zipcode is a 5 digit number and the city name consists of alphabet characters and space. I don’t see any hypen or comma in the city name in this particular case. And the first line is always ‘Miami-Dade.” So the following regular expression captures this pattern.

Miami-Dade\n\d{5}\n[A-Za-z ]+

Can you guess what this means? You already know that \n means a new line. \d{5} means a 5 digit number. So it will match 33013, 33149, 98765, or any number that consists of five digits.  [A-Za-z ] means any alphabet character either in upper or lower case or space (N.B. Note the space at the end right before ‘]’).

Anything that goes inside [ ] is one character. Just like \d is one digit. So I need to specify how many of the characters are to be matched. if I put {5}, as I did in \d{5}, it will only match a city name that has five characters like ‘Miami,’ The pattern should match any length of city name as long as it is not zero. The + sign does that. [A-Za-z ]+ means that any alphabet character either in upper or lower case or space should appear at least or more than once. (N.B. * and ? are also quantifiers like +. See the metacharacter table below to find out what they do.)

Now I hit the “Find” button, and we can see the pattern worked as I intended. Hurrah!

Screen Shot 2013-07-24 at 2.24.47 PM

Now, let’s make these three lines one line each. One great thing about regex is that you can refer back to matched items. This is really useful for text manipulation. But in order to use the backreference feature in regex, we have to group the items with parentheses. So let’s change our regex to something like this:

(Miami-Dade)\n\(d{5})\n([A-Za-z ]+)

This regex shows three groups separated by a new line (\n). You will see that Sublime still matches the same three line sets in the text file. But now that we have grouped the units we want – county name, zipcode, and city name – we can refer back to them in the ‘Replace With’ field. There were three units, and each unit can be referred by backslash and the order of appearance. So the county name is \1, zipcode is \2, and the city name is \3. Since we want them to be all in one line and separated by a comma and a space, the following expression will work for our purpose. (N.B. Usually you can have up to nine backreferences in total from \1 to\9. So if you want to backreference the later group, you can opt not to create a backreference from a group by using (?: ) instead of (). )

\1, \2, \3

Do a few Replaces and then if satisfied, hit ‘Replace All’.

Ta-da! It’s magic.

Screen Shot 2013-07-24 at 2.54.13 PM

Regex Metacharacters

Regex notations look a bit funky. But it’s worth learning them since they enable you to specify a general pattern that can match many different cases that you cannot catch without the regular expression.

We have already learned the four regex metacharacters: \n, \d, { }, (). Not surprisingly, there are many more beyond these. Below is a pretty extensive list of regex metacharacters, which I borrowed from the regex tutorial here: http://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php . I also highly recommend this one-page Regex cheat sheet from MIT (http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf).

Note that \w will match not only a alphabetical character but also an underscore and a number. For example, \w+ matches Little999Prince_1892. Also remember that a small number of regular expression notations can vary depending on what programming language you use such as Perl, JavaScript, PHP, Ruby, or Python.

Metacharacter Description
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
^ Matches the position at the beginning of the input string.
$ Matches the position at the end of the input string.
* Matches the preceding subexpression zero or more times.
+ Matches the preceding subexpression one or more times.
? Matches the preceding subexpression zero or one time.
{n} Matches exactly n times, where n is a non-negative integer.
{n,} Matches at least n times, n is a non-negative integer.
{n,m} Matches at least n and at most m times, where m and n are non-negative integers and n <= m.
. Matches any single character except “\n”.
[xyz] A character set. Matches any one of the enclosed characters.
x|y Matches either x or y.
[^xyz] A negative character set. Matches any character not enclosed.
[a-z] A range of characters. Matches any character in the specified range.
[^a-z] A negative range characters. Matches any character not in the specified range.
\b Matches a word boundary, that is, the position between a word and a space.
\B Matches a nonword boundary. ‘er\B’ matches the ‘er’ in “verb” but not the ‘er’ in “never”.
\d Matches a digit character.
\D Matches a non-digit character.
\f Matches a form-feed character.
\n Matches a newline character.
\r Matches a carriage return character.
\s Matches any whitespace character including space, tab, form-feed, etc.
\S Matches any non-whitespace character.
\t Matches a tab character.
\v Matches a vertical tab character.
\w Matches any word character including underscore.
\W Matches any non-word character.
\un Matches n, where n is a Unicode character expressed as four hexadecimal digits. For example, \u00A9 matches the copyright symbol
Matching modes

You also need to know about the Regex matching modes. In order to use these modes, you write your regex as shown above, and then at the end you add one or more of these modes. Note that in text editors, these options often appear as checkboxes and may apply without you doing anything by default.

For example, [d]\w+[g] will match only the three lower case words in ding DONG dang DING dong DANG. On the other hand, [d]\w+[g]/i will match all six words whether they are in the upper or the lower case.

Look-ahead and Look-behind

There are also the ‘look-ahead’ and the ‘look-behind’ features in regular expressions. These often cause confusion and are considered to be a tricky part of regex. So, let me show you a simple example of how it can be used.

Below are several lines of a person’s last name, first name, middle name, separated by his or her department name. You can see that this is a snippet from a csv file. The problem is that a value in one field – the department name- also includes a comma, which is supposed to appear only between different fields not inside a field. So the comma becomes an unreliable separator. One way to solve this issue is to convert this csv file into a tab limited file, that is, using a tab instead of a comma as a field separater. That means that I need to replace all commas with tabs ‘except those commas that appear inside a department field.’

How do I achieve that? Luckily, the commas inside the department field value are all followed by a space character whereas the separator commas in between different fields are not so. Using the negative look-ahead regex, I can successfully specify the pattern of a comma that is not followed by (?!) a space \s.

,(?!\s)

Below, you can see that this regex matches all commas except those that are followed by a space.

lookbehind

For another example, the positive look-ahead regex, Ham(?=burg), will match ‘Ham‘ in Hamburg when it is applied to the text:  Hamilton, Hamburg, Hamlet, Hammock.

Below are the complete look-ahead and look-behind notations both positive and negative.

  • (?=pattern)is a positive look-ahead assertion
  • (?!pattern)is a negative look-ahead assertion
  • (?<=pattern)is a positive look-behind assertion
  • (?<!pattern)is a negative look-behind assertion

Can you think of any example where you can successfully apply a look-behind regular expression? (No? Then check out this page for more examples: http://www.rexegg.com/regex-lookarounds.html)

Now that we have covered even the look-ahead and the look-behind, you should be ready to tackle the very first crazy-looking regex that I introduced in the beginning of this post.

(?=^[0-5\- ]+$)(?!.*0123)\d{3}-\d{3,4}-\d{4}

Tell me what this will match! Post in the comment below and be proud of yourself.

More tools and resources for practicing regular expressions

There are many tools and resources out there that can help you practice regular expressions. Text editors such as EditPad Pro (Windows), Sublime, TextWrangler (Mac OS), Vi, EMacs all provide regex support. Wikipedia (https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features) offers a useful comparison chart of many text editors you can refer to. RegexPal.com is a convenient online Javascript Regex tester. FireFox also has Regular Expressions add-on (https://addons.mozilla.org/en-US/firefox/addon/rext/).

For more tools and resources, check out “Regular Expressions: 30 Useful Tools and Resources” http://www.hongkiat.com/blog/regular-expression-tools-resources/.

Library problems you can solve with regex

The best way to learn regex is to start using it right away every time you run into a problem that can be solved faster with regex. What library problem can you solve with regular expressions? What problem did you solve with regular expressions? I use regex often to clean up or manipulate large data. Suppose you have 500 links and you need to add either EZproxy suffix or prefix to each. With regex, you can get this done in a matter of a minute.

To give you an idea, I will wrap up this post with some regex use cases several librarians generously shared with me. (Big thanks to the librarians who shared their regex use cases through Twitter! )

  • Some ebook vendors don’t alert you to new (or removed!) books in their collections but do have on their website a big A-Z list of all of their titles. For each such vendor, each month, I run a script that downloads that page’s HTML, and uses a regex to identify the lines that have ebook links in them. It uses another regex to extract the useful data from those lines, like URL and Title. I compare the resulting spreadsheet against last month’s (using a tool like diff or vimdiff) to discover what has changed, and modify the catalog accordingly. (@zemkat)
  • Sometimes when I am cross-walking data into a MARC record, I find fields that includes irregular spacing that may have been used for alignment in the old setting but just looks weird in the new one. I use a regex to convert instances of more than two spaces into just two spaces. (@zemkat)
  • Recently while loading e-resource records for government documents, we noted that many of them were items that we had duplicated in fiche: the ones with a call number of the pattern “H, S, or J, followed directly by four digits”. We are identifying those duplicates using a regex for that pattern, and making holdings for those at the same time. (@zemkat)
  • I used regex as part of crosswalking metadata schemas in XML. I changed scientific OME-XML into MODS + METS to include research images into the library catalog.  (@KristinBriney)
  • Parsing MARC just has to be one. Simple things like getting the GMD out of a 245 field require an expression like this: |h\[(.*)\]  MARCedit supports regex search and replace, which catalogers can use. (@phette23)
  • I used regex to adjust the printed label in Millennium depending on several factors, including the material type in the MARC record. (@brianslone )
  • I strip out non-Unicode characters from the transformed finding aids daily with  regex and  replace them with Unicode equivalents. (@bryjbrown)