A Librarian’s Guide to OpenRefine
Posted: May 1, 2013 | Author: Margaret Heller | Filed under: coding, data, workflow | Tags: data, google refine, openrefine | 6 Comments »Academic librarians working in technical roles may rarely see stacks of books, but they doubtless see messy digital data on a daily basis. OpenRefine is an extremely useful tool for dealing with this data without sophisticated scripting skills and with a very low learning curve. Once you learn a few tricks with it, you may never need to force a student worker to copy and paste items onto Excel spreadsheets.
As this comparison by the creator of OpenRefine shows, the best use for the tool is to explore and transform data, and it allows you to make edits to many cells and rows at once while still seeing your data. This allows you to experiment and undo mistakes easily, which is a great advantage over databases or scripting where you can’t always see what’s happening or undo the typo you made. It’s also a lot faster than editing cell by cell like you would do with a spreadsheet.
Here’s an example of a project that I did in a spreadsheet and took hours, but then I redid in Google Refine and took a lot less time. One of the quickest things to do with OpenRefine is spot words or phrases that are almost the same, and possibly are the same thing. Recently I needed to turn a large export of data from the catalog into data that I could load into my institutional repository. There were only certain allowed values that could be used in the controlled vocabulary in the repository, so I had to modify the bibliographic data from the catalog (which was of course in more or less proper AACR2 style) to match the vocabularies available in the repository. The problem was that the data I had wasn’t consistent–there were multiple types of abbreviations, extra spaces, extra punctuation, and outright misspellings. An example is the History Department. I can look at “Department of History”, “Dep. of History”, “Dep of Hist.” and tell these are probably all referring to the same thing, but it’s difficult to predict those potential spellings. While I could deal with much of this with regular expressions in a text editor and find and replace in Excel, I kept running into additional problems that I couldn’t spot until I got an error. It took several attempts of loading the data until I cleared out all the errors.
In OpenRefine this is a much simpler task, since you can use it to find everything that probably is the same thing despite the slight differences in spelling, punctuation and spelling. So rather than trying to write a regular expression that accounts for all the differences between “Department of History”, “Dep. of History”, “Dep of Hist.”, you can find all the clusters of text that include those elements and change them all in one shot to “History”. I will have more detailed instructions on how to do this below.
Installation and Basics
OpenRefine was called, until last October, Google Refine, and while the content from the Google Refine page is being moved to the Open Refine page you should plan to look at both sites. Documentation and video tutorials refer interchangeably to Google Refine and OpenRefine. The official and current documentation is on the OpenRefine GitHub wiki. For specific questions you will probably want to use the OpenRefine Custom Search Engine, which brings together all the mix of documentation and tutorials on the web. OpenRefine is a web app that runs on your computer, so you don’t need an internet connection to run it. You can get the installation instructions on this page.
While you can jump in right away and get started playing around, it is well worth your time to watch the tutorial videos, which will cover the basic actions you need to take to start working with data. As I said, the learning curve is low, but not all of the commands will make sense until you see them in action. These videos will also give you an idea of what you might be able to do with a data set you have lying around. You may also want to browse the “recipes” on the OpenRefine site, as well search online for additional interesting things people have done. You will probably think of more ideas about what to try. The most important thing to know about OpenRefine is that you can undo anything, and go back to the beginning of the project before you messed up.
A basic understanding of the Google Refine Expression Language, or GREL will improve your ability to work with data. There isn’t a whole lot of detailed documentation, so you should feel free to experiment and see what happens when you try different functions. You will see from the tutorial videos the basics you need to know. Another essential tool is regular expressions. So much of the data you will be starting with is structured data (even if it’s not perfectly structured) that you will need to turn into something else. Regular expressions help you find patterns which you can use to break apart strings into something else. Spending a few minutes understanding regular expression syntax will save hours of inefficient find and replace. There are many tutorials–my go-to source is this one. The good news for librarians is that if you can construct a Dewey Decimal call number, you can construct a regular expression!
Some ideas for librarians
(A) Typos
Above I described how you would use OpenRefine to clean up messy and inconsistent catalog data. Here’s how to do it. Load in the data, and select “Text Facet” on the column in question. OpenRefine will show clusters of text that is similar and probably the same thing.
Click on Cluster to get a menu for working with multiple values. You can click on the “Merge” check box and then edit the text to whatever you need it to be. You can also edit each text cluster to be the correct text.
You can merge and re-cluster until you have fixed all the typos. Back on the first Text Facet, you can hover over any value to edit it. That way even if the automatic clustering misses some you can edit the errors, or change anything that is the same but you need to look different–for instance, change “Dept. of English” to just “English”.
(B) Bibliographies
The main thing that I have used OpenRefine for in my daily work is to change a bibliography in plain text into columns in a spreadsheet that I can run against an API. This was inspired by this article in the Code4Lib Journal: “Using XSLT and Google Scripts to Streamline Populating an Institutional Repository” by Stephen X. Flynn, Catalina Oyler, and Marsha Miles. I wanted to find a way to turn a text CV into something that would work with the SHERPA/RoMEO API, so that I could find out which past faculty publications could be posted in the institutional repository. Since CVs are lists of data presented in a structured format but with some inconsistencies, OpenRefine makes it very easy to present the data in a certain way as well as remove the inconsistencies, and then to extend the data with a web service. This is a very basic set of instructions for how to accomplish this.
The main thing to accomplish is to put the journal title in its own column. Here’s an example citation in APA format, in which I’ve colored all the “separator” punctuation in red:
Heller, M. (2011). A Review of “Strategic Planning for Social Media in Libraries”. Journal of Electronic Resources Librarianship, 24 (4), 339-240)
From the drop-down menu at the top of the column click on “Split into several columns…” from the “Edit Column” menu. You will get a menu like the one below. This example finds the opening parenthesis and removes that in creating a new column. The author’s name is its own column, and the rest of the text is in another column.
The rest of the column works the same way–find the next text, punctuation, or spacing that indicates a separation. You can then rename the column to be something that makes sense. In the end, you will end up with something like this:
When you have the journal titles separate, you may want to cluster the text and make sure that the journals have consistent titles or anything else to clean up the titles. Now you are a ready to build on this data with fetching data from a web service. The third video tutorial posted above will explain the basic idea, and this tutorial is also helpful. Use the pull-down menu at the top of the journal column to select “Edit column” and then “Add column by fetching URLs…”. You will get a box that will help you construct the right URL. You need to format your URL in the way required by SHERPA/RoMEO, and will need a free API key. For the purposes of this example, you can use 'http://www.sherpa.ac.uk/romeo/api29.php?ak=[YOUR API KEY HERE]&qtype=starts&jtitle=' + escape(value,'url'). Note that it will give you a preview to see if the URL is formatted in the way you expect. Give your column a name, and set the Throttle delay, which will keep the service from rejecting too many requests in a short time. I found 1000 worked fine.
After this runs, you will get a new column with the XML returned by SHERPA/RoMEO. You can use this to pull out anything you need, but for this example I want to get pre-archiving and post-archiving policies, as well as the conditions. A quick way to to this is to use the Googe Refine Expression Language parseHtml function. To use this, click on “Add column based on this column” from the “Edit Column” menu, and you will get a menu to fill in an expression.
In this example I use the code value.parseHtml().select("prearchiving")[0].htmlText(), which selects just the text from within the prearchving element. Conditions are a little different, since there are multiple conditions for each journal. In that case, you would use the following syntax (after join you can put whatever separator you want): forEach(value.parseHtml().select("condition"),v,v.htmlText()).join(". ")"
So in the end, you will end up with a neatly structured spreadsheet from your original CV with all the bibliographic information in its own column and the publisher conditions listed. You can imagine the possibilities for additional APIs to use–for instance, the WorldCat API could help you determine which faculty published books the library owns.
Once you find a set of actions that gets your desired result, you can save them for the future or to share with others. Click on Undo/Redo and then the Extract option. You will get a description of the actions you took, plus those actions represented in JSON.
Unselect the checkboxes next to any mistakes you made, and then copy and paste the text somewhere you can find it again. I have the full JSON for the example above in a Gist here. Make sure that if you save your JSON publicly you remove your personal API key! When you want to run the same recipe in the future, click on the Undo/Redo tab and then choose Apply. It will run through the steps for you. Note that if you have a mistake in your data you won’t catch it until it’s all finished, so make sure that you check the formatting of the data before running this script.
Learning More and Giving Back
Hopefully this quick tutorial got you excited about OpenRefine and thinking about what you can do. I encourage you to read through the list of External Resources to get additional ideas, some of which are library related. There is lots more to learn and lots of recipes you can create to share with the library community.
Have you used OpenRefine? Share how you’ve used it, and post your recipes.
- OpenRefine on Twitter. If you post something about OpenRefine on Twitter, they will usually retweet you as a way to showcase what people are doing.
- OpenRefine Google Group
The Setup: What We Use at ACRL TechConnect
Posted: February 25, 2013 | Author: Eric Phetteplace | Filed under: library, management, technology, workflow, writing | Tags: setup, tools | Leave a comment »Inspired by “The Setup” a few of us at Tech Connect have decided to share some of our favorite tools and techniques with you. What software, equipment, or time/stress management tools do you love? Leave us a note in the comments.
Eric – Homebrew package manager for OS X
I love Macs. I love their hardware, their operating system, even some of their apps like Garage Band. But there are certain headaches that Mac OS X comes with. While OS X exposes its inner workings via UNIX command line, it doesn’t provide a package manager like the apt of many Linux distros to install and update software.
Enter Homebrew, a lifesaver that’s helped me to up my game on the command line without much ancillary pain. Homebrew helps you find (“brew search php“), install (“brew install phantomjs“), and update (“brew upgrade git“) software from a central repository. I currently have 36 packages installed, among them utilities that Apple neglected to include like wget, programming tools like Node.js, and brilliant timesavers like z, a bookmarking system for the command line. Installing a lot of these tools can be tougher than using them, requiring permissions tweaks and enigmatic incantations. Homebrew makes installation easy and checking thirty-six separate websites for available updates becomes unnecessary.
As a bonus, some Homebrew commands now produce unicode beer mugs.
Updated Homebrew from bad98b12 to 150b5f96.
==> Updated Formulae
autojump berkeley-db gtk+ imagemagick libxml2
==> Upgrading 1 outdated package, with result:
libxml2 2.9.0
==> Upgrading libxml2
==> Downloading ftp://xmlsoft.org/libxml2/libxml2-2.9.0.tar.gz
####################################### 100.0%
==> Patching
patching file threads.c
patching file xpath.c
==> ./configure --prefix=/usr/local/Cellar/libxml2/2.9.0 --without-python
==> make
==> make install
==> Caveats
This formula is keg-only: so it was not symlinked into /usr/local.
==> Summary
usr/local/Cellar/libxml2/2.9.0: 273 files, 11M, built in 94 seconds
[Note: simulation, not verbatim output]
Magic! And a shameless plug: Homebrew has a Kickstarter currently to help them with some automated tests, so if you use Homebrew consider a donation.
Margaret – Pomodoro Technique/using time wisely
Everyone works differently and has more effective times of day to complete certain types of work. Some people have to start writing first thing in the mornings, others can’t do much of anything that early. For me personally I find late afternoon the most effective time to work on code or technical work—but late afternoon is a time very prone to being distractible. So many funny things have been posted on the internet, and my RSS reader is all full up again. The Pomodoro technique (as well as similar systems) is a promise to yourself that if you just work hard on something for a relatively short amount of time that you will finish it, and then can have a guilt-free break.
Read the website for the full description of how to use this technique and some tools, but here’s the basic idea. You list the tasks you need to do, and then pick a task to work on for 25 minutes. Then you set a timer and start work. After the timer goes off, you get a 5 minute break to do whatever you want, and then after a few Pomodoros you take a longer break. The timer ideally should have a visual component so that you know how much time you have left and remember to stay on task. My personal favorite is focus booster. This is what mine looks like right now:
Note that the line changes color as I get closer to the end. It will become blue and count down my break when that starts. Another one I like a lot, especially when I am not at my own computer is e.ggtimer.com. This is a simple display, and you can bookmark http://e.ggtimer.com/pomodoro to get a Pomodoro started.
I can’t do Pomodoros all day—as a librarian, I need to be available to work with others at certain times—that’s not an interruption, that’s my job. Other times I really need to focus and can’t. This is the best technique to get started—and sometimes once I am started I get so focused on the project that I don’t even notice I am supposed to be on a break.
Jim – Tomcat Server with Jersey servlet: a customizable middleware/API system
The Tomcat/Jersey stack is the backbone of the library’s technology prototyping initiative. With this tool, our staff of research programmers and student programmers can take any webpage/database source and turn it into an API that could then feed into a mobile app, a data feed in a website, or a widget in some other emerging technology. While using and leveraging the Tomcat/Jersey stack does require some Java background, it can be learned in a couple weeks by anyone who has some scripting and server experience. The hardest thing to this whole pipeline is finding enough time to keep cranking out the library APIs — one that I got running over the winter holiday is a feed of group rooms that are available to be checked out/scheduled within the next hour at the library.
The data feed sends back a JSON array of available rooms, like this (abbreviated):
[{"roomName":"Collaboration Room 02 - Undergraduate Library",
"startTime":"10:00 AM",
"endTime":"11:00 AM",
"date":"1/27/2013"}, …
Bohyun – Get into the mood for concentration and focus
I am one of those people who are easily excited by new happenings around me. I am also one of those people who often would do anything but the thing that I must be doing. That is, I am prone to distraction and procrastination. My work often requires focus and concentration but I have an extremely hard time getting into the right mood.

The two tools that I found help me quite a bit are (a) Scribblet and (b) Rainy Mood. Scribblet (http://scribblet.org/) is a simple Javascript bookmarklet that lets you literally scribble on your web browser. If you tend to read more efficiently while annotating, this simple tool will help you a great deal with online reading. Rainy Mood (http://www.rainymood.com/) is a website that displays the window of any rainy day with even the sound of thunder sprinkled in. I tend to get much calmer on a rainy day which can do wonders for my writing and other projects that require a calm and focused state of mind. This tool instantly makes me have a rainy day regardless of the weather.


Meghan – Evernote
Evernote is not a terribly technical tool, but it is one I love and constantly use. It provides the ability for you to take notes, clip items from the web, attach files to notes, organize into notebooks, share notebooks (or keep them private) and search existing notes. It is available to download for desktops but I use the web version primarily, along with the web clipper and the Android app on my phone. Everything syncs together, so it is easy to locate notes from any location. Here are three examples of how it fits into my daily life:
- An enormous pile of classified bookmarks: I am currently trying to get up to speed on Drupal development as well as looking at examples of online image collections and brainstorming for my next TechConnect blog entry. The web clipper allows me to save things into specific piles by using notebooks and then add tags for classification and easier searching. For example, I can classify an issue description or resolution in the my web development reference notebook, but tag it with the name for our site which is affected by the issue. This is especially useful when I know I have to change tasks and am likely to navigate away from my tabs in the browser. When I return to the task in a day or so, I can search for the helpful pages I saved. Classifying in notebooks is also good to build a list of sources that I consult every time I do a certain task, like building a server.
- Course and conference notes: Using the web or phone version, I can type notes during a lecture or conference session. I can also attach a pdf of the slides from a presentation for reference later. Frequently, I create a notebook for a particular conference that I can opt to share with others.
- Personal uses: I am learning to cook, and this tool has been really useful. Say I find a great recipe that I decide I want to (try and) make for dinner tonight. Clip the recipe using the web clipper, save it to my recipes notebook and then pull it up on my phone while I’m cooking to follow along (which also explains all the flour on my phone). In a few months if I want to use it again, I’ll probably have to search for it, because all I will remember is that it had chickpeas in it. But, that’s all I have to remember.

There are lots of other add-ins for this application, but I love and use the base service the most often.
Taking Google Forms to the Next Level
Posted: November 26, 2012 | Author: Margaret Heller | Filed under: coding, workflow | 1 Comment »Many libraries use Google Forms for collecting information from patrons, particularly for functions like registering for a one-time event or filling out a survey. It’s a popular option because these forms are very easy to set up and start using with no overhead. With a little additional effort and a very small amount of code you can make these forms even more functional.
In this post, we’ll look at the process for adapt a simple library workshop registration form to send a confirmation email and introduce you to the Google Apps Scripts documentation. This is adapted from a tutorial for creating a help desk application, which you can see here. I talked about the overall process of creating simple applications for free with minimal coding skills at this year’s LITA Forum, and you can see the complete presentation here. In this post I will focus on the Google Forms tricks.
A few things to keep in mind before you get started. Use a library account when you actually deploy the applications, since that will remain “owned” by the library even if the person who creates it moved on. These instructions are also intended for regular “consumer” Google accounts–there are additional tools available for Google Apps business customers, which I don’t address here.
Creating Your Form
Create a form as you normally would. Here’s an example of a simple workshop registration form.
There are a few potential problems with the way this form is set up, but here’s an even bigger problem. Once the person signing up clicks submit, the form disappears, and he receives a page saying “Thank you for registering!”
If this person did not record the workshop, he now has no real idea of what he signed up for. What he intended to do and what he actually did may not be the same thing!
What comes next? You, the librarian hosting the workshop, goes into the spreadsheet to see if anyone has signed up. If you want to confirm the sign-up, you can copy the patron’s email address into your email program, and then copy in a message to confirm the sign-up. If you only have a few people signed up, this may not take long, but it adds many unnecessary steps and requires you to remember to do it.
Luckily, Google has provided all the tools you need to add in an email confirmation function, and it’s not hard to use as long as you know some basic Javascript. Let’s look at an example.
Adding in an email confirmation
To access these functions, visit your spreadsheet, and click on Script Editor in the Tools menu.
You will get many options, which you can use, or you can simply create a script for a Blank Project (first option) You will get this in your blank project:
function myFunction() {
}
Change the name of the function to be something meaningful. Now you can fill in the details for the function. Basically we use the built-in Google Spreadsheet functions to grab the value of each column we want to include and store these in a variable. You just put in the column number–but remember we are starting from 0 (which is the Timestamp column in our current example).
function emailConfirm(e) {
var userEmail = e.values[3];
var firstName = e.values[1];
var lastName = e.values[2];
var workshopDate = e.values[4];
MailApp.sendEmail(userEmail,
"Registration confirmation",
"Thanks for registering for the library workshop on " + workshopDate + " \n\nYou will " +
"recieve a reminder email 24 hours prior. \n\nLibrary",
{name:"Library"});
}
The MailApp class is another built-in Google Apps script. The sendEmail method takes the following arguments: recipients, subject, body, optAdvancedArgs. You can see in the above example that the userEmail variable (the patron’s email address in the form) is the recipient, the subject is “Registration confirmation”, the body contains a generic thank you plus the date of the workshop, which we’ve stored in workshopDate variable. Then we’ve put in advanced arguments the name “Library”–this is optional, particularly if it’s coming from a library email account.
Note that if a patron hits “reply” to cancel or ask a question, the email will automatically go to the email account that deployed the application. But you may want reply emails to go somewhere else. You can modify the last “advanced” argument to be some other email address with the replyto argument. (Note that this doesn’t always work–and that people can see that the email comes from elsewhere, so make sure that someone is checking the email from which the application is deployed).
{name:"Library", replyto:"mheller@dom.edu"});
Running the script
Once you’ve filled in your script and hit save (it will do a quick debug when you save), you have to set up when the script should run. Select “Current script’s triggers…” from the Resources menu.
Now select the trigger “On form submit”. While you’re here, also click on notifications.
The notifications will tell you any time your script fails to run. For your first script, choose “immediately” so you can see what went wrong if it didn’t work. In the future you can select daily or weekly.
Before you can save either your trigger or failure notifications, you need to authorize that Google can run the script for you.
Now your script will work! Next time a patron fills out your form to register for a workshop, he will receive this email:
Doing More
After working with this very basic script you can explore the Google Apps Script documentation. If you are working with Google Forms, you will find the Spreadsheet Services classes very useful. There are also some helpful tutorials you can work through to learn how to use all the features. This one will teach you how to send emails from the spreadsheet–something you can use when it’s time to remind patrons of which workshops they have signed up for!
Tablets in Library Workflows: Revolution & Healthy Skepticism
Posted: September 25, 2012 | Author: Nicholas Schiller | Filed under: library, technology, workflow | 1 Comment »Tablet Revolution: Healthy Skepticism
Tablets and mobile computing have been the subject of a lot of Internet hype. A quick search for “tablet revolution” will confirm this, but if we’re appropriately skeptical about the hype cycle, we’ll want to test the impact of tablets on our library ourselves. We can do this in a few ways. We can check the literature to see what studies have been done. 1 We can check our web analytics to see which devices are being used to access our web sites. 2 We can also walk the public areas in our libraries and count patrons working on tablets. These investigations can tell us how and how often tablets are being used, but they don’t tell us how or if tablets are revolutionizing library use.
In order to better answer this question, I started a little project. Over the last year, I’ve been using informal methods to track the effects that tablet use have on my work. I secured some equipment funding and acquired an Apple iPad 2 and an Android tablet, the Asus Transformer Prime. I started doing my work on these devices, keeping an eye on how they changed my daily workflow, how suited they were to my daily tasks, and whether or not they increased my productivity or the quality of my work. Over the course of the year I can report that tablets have changed the way I work. Most of the changes are incremental, but there are at least a couple cases of genuine revolution to report.
Deploying Tablets in my Workflow
As I spent some time doing my work using the tablets, I discovered there were three possible results to my efforts to integrate them into my daily work. Some tasks simply did not translate well to the tablet environment. Other tasks translated fairly seamlessly to the tablet environment; what I could do on a computer I could also do on a tablet. Finally there were a few cases where the affordances of the tablets: touch interface, networked portability, and app environment enabled me to do my work in new ways, ways not possible using a traditional workstation or laptop.
The first sort of task, the kind in which tablets failed to produce positive results, tended to involve heavy processing requirements, the need to connect peripheral devices, or involved complex software programs not ported to mobile apps. Examples included editing image, sound, or video files; analyzing datasets; and creating presentation slides. The tablets lacked the processing power, peripheral interfaces, or fine interface control to make them adequate platforms for the editing tasks. Statistical analysis software shares the same heavy processor requirements and I was unable to find mobile apps equivalent to SPSS or Atlas TI. In the case of presentation slides, all the necessary conditions for success seemed to be present. Keynote for iOS is a great app, but I was never satisfied with the quality of my tablet-created presentations and soon returned to composing my slides in Keynote on my laptop. As a general rule of thumb, I found tasks that require lots of processing power, super-fine input control (fingers and even styli are imprecise on touch screens), or highly-specialized software environments to be poor candidates for moving to tablets.
The majority of my day-to-day work tasks fell into a second set of tasks, these tasks enabled me to easily replace my traditional computer with a tablet. I discovered that after a little research to discover the proper apps and a little time to learn how to use them, a tablet was a good as a computer, most of the time. At first, I experimented with treating the tablet as a small portable computer. I acquired Apple’s Bluetooth keyboard and the keyboard dock accessory for the Transformer and was able to do word processing, text editing and coding, email, instant messaging, and pretty much any browser-based activity without significant adjustment. I found text entry without a keyboard to be too clumsy a process for serious work. Tablets also are ideal for server-administration, since the computer on the other end handled the heavy lifting. There are SSH, FTP, and text editing apps that make tablets perfect remote administration environments. I also found text-based tasks like writing, email, chat, reading, and most things browser-based or whose files live in the cloud or on a server can be done just as well on a tablet as it can on a workstation or laptop.
The limitation to this general rule is that in some cases the iPad presented file management difficulties. The iOS defaults push users into using iTunes and iCloud to manage documents. If you like these options, there is no problem. I found these options lacking in flexibility, so I had to engage in a little hackery to get access to the files I needed on the iPad. Dropbox and Evernote are good examples of cloud storage apps that work once you learn how to route all your documents through them. In the end, I found myself preferring apps that access personal cloud space (Jungledisk) or my home NAS storage (Synology DS File) in my workflow. The Transformer Prime required fewer document-flow kluges and its keyboard accessory includes a USB flash-drive interface which is very useful for sharing documents with local colleagues and doesn’t require a fancy workaround.
A second limitation I encountered was in accessing web video content. Not frequently, but often enough to be noticed, certain web video files (Flash encoded) would not play on the iPad. The Android tablet is Flash capable and suffered fewer of these problems. Video isn’t a key part of my workflow, so for me this is mearly an annoyance, not a serious hindrance to productivity.
Touching Revolution
Of course, simply duplicating the capabilities of traditional computer environments in a smaller form-factor is not revolutionary. As long as I was using a tablet as if it were a smaller computer, then my work didn’t change, only the tools I was doing it with changed. It was when I started working outside of the keyboard and mouse interface model and started touching my work that new ways of approaching tasks presented themselves. When I started using a stylus to write on the screen of a tablet the revolution became apparent.
As an undergraduate, Mortimer Adler and Charles Van Doren’s How to Read a Book 3 was a required reading and their lesson on annotation while we read stuck with me. When it comes to professional development reading, annotation is absolutely necessary to comprehension and integration of content. Thus Amazon’s Kindle reader app for Android and iOS became my favorite ebook platform, due to its superior system for taking and sharing reading notes across platforms. I rely so heavily on annotations that I cannot do my work using ebook platforms that don’t allow me to take notes in text. In the same vein, I use personal copies of printed books for my research instead of borrowed library copies, because I have to write in the margins to process ideas.
Tablets revolutionized my reading when I discovered PDF annotation apps that allowed me to use a stylus to write on the top of documents. Apps like Notetaker HD and iAnnotate for iOS and ezPDF Reader for Android give readers the digital advantage of unlimited amounts of text without the bulk and weight of paper printouts. They also give the reader analog advantages of free-hand highlighting and writing notes in the margin. Combine these advantages with Zotero-friendly apps such as Zotpad, Zotfile, and Zandy that connect my favorite discovery tool to my tablet and I found myself reading more, taking better notes, and drawing clearer connections between documents. The portability of digital files on a mobile wirelessly connected device combined with the stylus and touch-screen method of text input enabled me to interact with my reading in ways impossible using either printed paper or a traditional computer monitor and keyboard. Now, my entire library and all of my reading lists came with me everywhere, so I carved out more time to read each week. When I opened a text, I was able to capture my thoughts about the reading more accurately and completely. This wasn’t just reading in a different medium, it was reading in a different method and it worked better than the way I had been doing before.
Tablets with reading annotation apps revolutionized the way that I read and organized my reading notes, but they had an even bigger impact on the way that I grade student papers. I love teaching, but grading essays is a task that I dread. Essays are heavy and hard to carry around. When I have essays with me, I have a constant and irrational background fear that someone will steal my car and I’ll lose irreplaceable student work. When I started using the tablet, I had my students submit their essays in PDF format. Then, I read their work in a similar manner to my professional reading. I read the essays on a tablet, using the stylus to highlight passages and write feedback in the margins. When I was finished, I could email the document back to the student and also keep an archived copy. This solved a number of paper distribution and unique copy problems. The students got better feedback more quickly and I always had a reference copy if questions arose later in the term.
A Personal Revolution
Taken by themselves, these reading and grading innovations may sound like incremental changes, not revolutions. For example, laptops are quite portable and we’ve had the ability to add notes and comments to PDF documents for a long time. There is no reason I couldn’t adopt this workflow without buying an additional expensive gadget, except that I couldn’t. I tried electronic reading and grading workflows before I had a tablet and rejected them. Reading on a computer monitor and typing comments into a PDF didn’t result in interesting thoughts about the reading. I tried grading by adding comments to PDF documents on a laptop and found my feedback comments to be arid and less helpful than the remarks I wrote in the margins of paper essays, so I switched back to colored pen on paper. These experiences are all anecdotal and personal, but accurately describe my experience. With a tablet, the feel of touching a screen and writing with a stylus enabled an organic flow of thoughts from my brain to the text. I can list the affordances of mobile computing that make this possible: ubiquitous wireless broadband networking, touch interface, lightweight and portable devices, a robust app ecology, and cloud storage of documents. The revolution lies in how these technical details combined in my workflow to creates an environment where I did better work with fewer distractions and more convenience.
Next Steps
One requirement to justify the time and expense of this project is that I share my findings. This post is an effort in that direction, but I will also be offering a series of faculty workshops on using tablets in academic workflows. I’m planning a workshop where faculty can put their hands on a range of tablet devices, a petting zoo of tablets. There will also be a workshop on reading app for tablets and one on grading workflows. One challenge to presenting what I’ve learned about tablets is that most of what I have learned is personal. I’ve spoken with scholars who do not share my preference for hand-written thoughts; my workflows are not revolutionary for them. What ultimately may be the most beneficial result of my project is uncovering a method for effectively communicating emerging technology experiences with non-technologically inclined colleagues.
- Pew. Tablet and E-book reader Ownership Nearly Double Over the Holiday Gift-Giving Period. Pew Internet Libraries. http://libraries.pewinternet.org/2012/01/23/tablet-and-e-book-reader-ownership-nearly-double-over-the-holiday-gift-giving-period/. ↩
- Wikipedia contributors. 2012. Mobile web analytics. Wikipedia, the free encyclopedia. Wikimedia Foundation, Inc., September 13. https://en.wikipedia.org/w/index.php?title=Mobile_web_analytics&oldid=510528022. ↩
- Mortimer Adler, How to read a book, Rev. and updated ed. (New York: Simon and Schuster, 1972). ↩
Workflow Automation in Technical Services: Part 2
Posted: April 30, 2012 | Author: Becky Yoose | Filed under: technology, workflow | Tags: automation, cataloging, quality control, technical services, workflow | Leave a comment »Note: This is part two of a two part series on workflow automation in Technical Services. Part one covered the what and process of workflow automation and an example of an item level workflow automation process. Part two will discuss batch level workflow automation and resources/tools for workflow automation.
Last time, we discussed the basics of workflow automation and some examples of item-level automation in cataloging and acquisitions workflows. Automating workflows on an item-to-item basis provides greater consistency and efficiency in daily tasks done by staff, allowing them to spend more time on more complex workflows and tasks that may not be so readily automated. Item level workflow automation can be a low barrier investment in creating a more efficient operation.
Then you have the electronic journals, ebooks, and databases. You have large record files that are tied to physical resources – for example, record downloads from WorldCat Cataloging Partners. And then there are all those records in the system – MARC, XML, whatnot – that have missing or incorrect information (the infamous “dirty data”). Why can’t we just stick with item-level processing for everything?
Item level automation or batch automation?
For item level automation, you have a very granular level of control over the process, dealing with items one at a time. If the items are very similar in nature or have only a couple differences in how each item will be processed, though, then going through each item individually probably doesn’t make a lot of sense. On the other hand, batch processing allows you to go through many items at once, which makes adding or maintaining resources a quicker job than going through item by item. You do give up a certain level of control over details with batch processing, however, which leaves you to decide where the “good enough” marker should go in terms of data quality.
Overall, you want to avoid sub-optimizing your workflow. Sub-optimization happens when a part of an organization focuses the success of its own area instead of the entire organization’s success [1]. Going through each resource record individually might give you the greatest control over the record, but if you’re going through a file containing 10,000+ records individually, even with an item level automated workflow, the turnaround time for creating access for all those resources will be much higher than if the file was processed at once. However, with the right tools, you can deal with record batches with speed and a good level of control over the data.
MarcEdit is your friend
Many people have at least heard about MarcEdit, or have colleagues who have used it extensively. MarcEdit is a freely available program (for Windows) created by Terry Reese that works with MARC records in a variety of ways. You can add, delete, or modify fields in records, create MARC records from data in spreadsheets, crosswalk to and from the MARC format, split files, join files, generate call numbers, de-duplicate records – and that’s only part of what you can do with MarcEdit. Also, if you find yourself going through the same batch workflow for the same files on a regular basis, MarcEdit’s Script Wizard helps with automating routine batch processing workflows.
Example: Missing 041 1_ subfield h, or, this item is a translation, not in two languages!
Many of you may have moved your older library catalogs to a newer discovery layer; I’ve survived one move at my previous place of work and will probably have another move under my belt soon. One consequence of moving to a new discovery layer is that data previously ignored by the previous layer sticks out like a sore thumb in the new layer. This example is one of those dirty data discoveries: a particular MARC variable field incorrectly indicated that an item is in two or more languages instead of a translation. Not only you have unhappy library users who thought you had a copy of The Little Prince in both French and English, but this error exists in a few thousand records, finding yourself with a potentially resource intensive cleanup project.
If you can isolate and export those records in one (or a couple of) files from your database, then you can use MarcEdit to clean up the field in a relatively short time. Open the file in MarcEdit’s MarcEditor, and make your way to the “Edit Subfield” under the tools menu. Let’s say that there are a lot of records that have engfre in the 041 field and you want to change all the records with that entry at once. Replace the engfre field data with eng$hfre and you’ve taken care of all those records in one pass.
Since you probably have more than engfre in your file, you can use regular expressions in MarcEdit to change multiple fields at once regardless of language code. Using the Find/Replace tool, search for the 041 field subfield a, but this time add your regular expression and mark the “Use regular expression” box. The following expression is assuming that the 041 field has two language codes that are three letters in length, so you will have to do a little cleanup after running this replace command to catch the three or more language codes as well as two letter language codes. (h/t to zemkat for the regular expression!)
Libraries and modules and packages, oh my!
What if you’ve been learning some code, or are looking for an excuse to learn? You’re in luck! Some of the common programming languages have tools to deal with MARC data. Rolling your own batch automation scripts and applications allows you the most flexibility in working with other library data formats as well. However, if you haven’t programmed before, choose smaller projects to start. In addition, if the script or application doesn’t work, you’re your own tech support.
Example: Creating order records for patron driven acquisition (PDA) items triggered for purchase
Patron driven acquisition usually involves the ingestion of several hundred to thousands of records into the local database for items that are not technically owned by the library at that point in time. Depending on the PDA vendor one uses, the item is triggered for purchase after it reaches a use threshold (for example, 10 page views). The library will receive an invoice with these purchases, but we will still need to create order records in the system to show that these items have been bought. Considering that on a given week, the number of purchases can range from single digits to higher double digits, that’s a lot of order records to manually key in.
After dabbling with pymarc at code4lib 2010, I thought this would be a good project to learn more about pymarc and python overall. Here is an outline of the script actions:
- In the trigger report spreadsheet, extract the local control numbers for the items triggered for purchase.
- Execute a SQL query against the local database for our locally developed next generation catalog, matching the local control number and extracting the MARC records from database.
- In each MARC record:
- add a 590 and 790 field for donor/fund information
- add a 949 field containing bibliographic record overlay and the order record creation information for the system, including cost of the item extracted from the spreadsheet.
- change the 947 field data to indicate that the item has been purchased (for statistical reporting later on)
- Write the MARC records to a file for import into the ILS.
The output file is then uploaded into the ILS manually, which gives staff the chance to address any issues with the records that the system might have before import. Overall, the process from downloading the trigger report spreadsheet to uploading the record file into the ILS takes a few minutes, depending on the size of the file.
Which automation tools and resources to use?
There are a multitude of other automation tools and resources that cannot be fully covered in two blog posts. Your mileage may vary with these tools; you might find Macro Express to be a better fit for your organization than AutoIt, or you find that working with ruby-marc is easier for you than MarcEdit (resource links listed below). The best way to figure out what’s right for you is to play around with various tools and get a feel for them. More often than not, you’ll end up using multiple tools for different levels and types of workflow automation.
Don’t forget about the built-in tools in existing applications as well! Sometimes the best tools for the job are already there for you to take advantage of them.
For your convenience, here are the tools mentioned in the two blog posts, including a few others:
- AutoIt
- AutoHotKey is a similar and popular automation application
- MarcEdit
- Macro Express
- Keyboard Express is a slimmed down version of ME by the same company
- Macro Scheduler
- The Working with MARC page on the Code4Lib wiki is a good place to start when looking for ways in dealing with MARC data in various languages: http://wiki.code4lib.org/index.php/Working_with_MaRC
[1] http://dictionary.cambridge.org/dictionary/business-english/sub-optimization
Workflow Automation in Technical Services: Part 1
Posted: March 26, 2012 | Author: Becky Yoose | Filed under: workflow | Tags: automation, cataloging, quality control, technical services, workflow | Leave a comment »Note: This is part one of a two part series on workflow automation in Technical Services. Part one will cover the what and process of workflow automation and an example of an item level workflow automation process. Part two will discuss batch level workflow automation and resources/tools for workflow automation.
The mysterious door at the library
A majority of you might have passed by this door many times in your library lives. Sometimes it isn’t even a door; maybe a room divider, or an invisible line that runs across the room. In any case, you may have ventured into the space called “Technical Services” (or a similar name), but do you know what goes on there? For most libraries, Technical Services staff acquire, create, and maintain access to library materials, spanning from books and a box of rocks to various electronic databases and digitized local collections. Without them, it would be hard for a library to serve its users: no physical items to borrow, no electronic journals to search for articles, and no metadata in the library discovery layer for users and staff to search for those resources. With the variety of items come a variety of workflows to process those items, many of which are repeated at various intervals: some once a week while others repeated multiple times a day. Staff time and resources are spoken for every time a workflow is repeated. Every time a workflow is manually repeated, less time and resources can be spent on other projects or on new projects that would add value to existing collections or add new collections for library users to use. Technology provides a variety of strategies for workflow automation that reduce time spent on repetitive workflows.
What is workflow automation?
The oversimplified answer to this question is that workflow automation is the process where you have the computer do the things that it can be programmed to do, thereby reducing repetitive manual actions by the staff member.
There are two types of automation to consider when you look at your workflows:
- Data Entry: This type of automation is fairly straight forward, and you’ve probably already done this type of automation already without realizing it. For example, the automation script completes a form with data that remains the same for each form or types out standard text in an email being sent to a vendor. Useful for automating repetitive keystrokes, be it system codes, text, or even creating new documents in certain applications, such as an item recor. The automation script is hard-coded, meaning that the output of that script will be the same every time you run it.
- Decision Making: This type of automation makes all the decisions for you! Okay, while it won’t make every decision for you, several automation languages and programs can handle fairly complex decision making flowcharts using standard conditionals. For example, if bibliographic record “A” has field “B”, then do action ”C”; else do action “D”. As you probably already guessed, this type of automation resembles coding to a certain extent. The automation script that is designed to deal with several possible outcomes is not hard-coded like the data entry script described above.
What can be automated?
Most Technical Services departments acquire, create, and maintain access to a variety of different formats, from physical to electronic formats. Traditionally, workflows focus on the individual item going through the department and its various teams: acquisitions, cataloging, and processing, for example. With the changeover to electronic formats, workflows are going more towards a batch approach, processing and/or cataloging multiple items (for example, a collection of ebooks) at once.
In addition to adding materials to library collections, a library’s Technical Services staff do a fair amount of database maintenance for the library’s ILS (Integrated Library System). The term “dirty data” is thrown around the TS departments, covering database projects dealing with misspellings, outdated codes, or incorrect codes – anything that could inhibit a library user’s access to the resource.
Why should I automate my workflows?
- Better quality control of workflow and data. Any time you let a human near a workflow, errors can be introduced into a workflow: incorrect codes, mistyped text, or mishandled items. Having an automated workflow cuts down on the workflow’s fail points and allow for better overall consistency and accuracy.
- Save staff time. You and your staff spend a good amount of time with repetitive keystrokes and decisions. Even small repetitive actions add up during the work day, resulting in hours of valuable staff time and resources. By automating the repetitive actions, you free up staff time to work on more complex workflows which are not as easily automated.
How do you decide what workflows to automate?
- Flowchart your workflow. A simple flowchart from the beginning of the workflow to the end might reveal several places where current manual decision making can be relegated to a script. If a person is currently looking for a code in the order record to figure out what location code they should enter in the item record, the script could be set to do the same.
- What are the patterns? In each step, what data remains constant throughout all items? What codes, phrases, or fields do you insert every time you go through the workflow? Is there a pattern of going from one application to another at the same point in every workflow? One record to another?
- How will the script access the data? Working with a file of MARC records will be different than working with a bibliographic record that is open in your ILS. Having a file of data is easier, but if you’re automating an item-level workflow, you will be dealing with windows that you have to work with. Getting data from a window can be tricky; sometimes you are able to access the data directly, and other times you will have to scrape the screen to get to the data that you want to work on with the script.
Example: Receipt Cataloging
At my former place of work, Technical Services had three levels of cataloging: receipt cataloging, copy cataloging, and original cataloging. All monographs would go through the receipt cataloging process, with items being bumped to the two higher levels of cataloging. The majority of items that go through receipt cataloging, having met a list of 40+ criteria, are fast-tracked to physical processing, shortening the time between the item arriving at the library to being placed on the shelf, which is the overreaching goal of receipt cataloging. The criteria range from determining if the record is DLC (Library of Congress) to determining if the 008, 050, and 260 ‡c dates match in the bibliographic record (if not a conference publication).
Given that the criteria and the decision making flowchart are fairly standard and straightforward, this workflow was built with automation in mind. My predecessor used Macro Express (ME) for the first version of the receipt cataloging macros. When we got to the point where we were bumping up against ME’s limits, I migrated the macros to AutoIt, where I was able to include many more quality control checks on the bibliographic and item records.
Below is a screencast where I walk through the receipt cataloging process. If I wasn’t explaining what was happening, the whole process would have taken a minute and 10 seconds to complete, a couple of seconds more if the item was bumped to another team in the department. Compared to a five minute turnaround time if our staff manually checked every criteria, the macros allows the department to go through more items during the day with better quality control.
Bonus Example: Ordering from GOBI
Another workflow at my former place of work involved ordering monographs from GOBI. The workflow, unlike receipt cataloging, have a lot more complex decision making flowchart and more exceptions. While I could not automate on the level of receipt cataloging, there were still patterns and routines that I could automate, such as searching the library catalog with information supplied by GOBI, and determining which codes to enter in the 949 field in the OCLC record (for exporting into our database).
Below is a screencast that shows a part of the notification ordering automation script set.
Preview for Part 2
In this post, I covered more of the item level workflow automation possibilities. More of Technical Services workflows, however, are changing towards dealing with many items at once. In part 2, I will discuss some examples of batch process automation and several tools (including those mentioned in this post) that can assist in making life easier in Technical Services.



















