Fear No Longer Regular Expressions

Regex, it’s your friend

You may have heard the term, “regular expressions” before. If you have, you will know that it usually comes in a notation that is quite hard to make out like this:

(?=^[0-5\- ]+$)(?!.*0123)\d{3}-\d{3,4}-\d{4}

Despite its appearance, regular expressions (regex) is an extremely useful tool to clean up and/or manipulate textual data.  I will show you an example that is easy to understand. Don’t worry if you can’t make sense of the regex above. We can talk about the crazy metacharacters and other confusing regex notations later. But hopefully, this example will help you appreciate the power of regex and give you some ideas of how to make use of regex to make your everyday library work easier.

What regex can do for you – an example

I looked for the zipcodes for all cities in Miami-Dade County and found a very nice PDF online (http://www.miamifocusgroup.com/usertpl/1vg116-three-column/miami-dade-zip-codes-and-map.pdf). But when I copy and paste the text from the PDF file to my text editor (Sublime), the format immediately goes haywire. The county name, ‘Miami-Dade,’ which should be in one line becomes three lines, each of which lists Miami, -, Dade.

Ugh, right? I do not like this one bit.  So let’s fix this using regular expressions.

(**Click the images to bring up the larger version.)

Screen Shot 2013-07-24 at 1.43.19 PM

Screen Shot 2013-07-24 at 1.43.39 PM

Like many text editors, Sublime offers the find/replace with regex feature. This is a much powerful tool than the usual find/replace in MS Word, for example, because it allows you to match many complex cases that fit a pattern. Regular expressions lets you specify that pattern.

In this example, I capture the three lines each saying Miami,-,Dade with this regex:

Miami\n-\nDade.

When I enter this into the ‘Find What’ field, Sublime starts highlighting all the matches.  I am sure you already guessed that \n means a new line. Now let me enter Miami-Dade in the ‘Replace With’ field and hit ‘Replace All.’

Screen Shot 2013-07-24 at 2.11.43 PM

As you can see below, things are much better now. But I  want each set of three lines – Miami-Dade, zipcode, and city – to be one line and each element to be separated by comma and a space such as ‘Miami-Dade, 33010, Hialeah’. So let’s do some more magic with regex.

Screen Shot 2013-07-24 at 2.18.17 PM

How do I describe the pattern of three lines – Miami-Dade, zipcode, and city? When I look at the PDF, I notice that the zipcode is a 5 digit number and the city name consists of alphabet characters and space. I don’t see any hypen or comma in the city name in this particular case. And the first line is always ‘Miami-Dade.” So the following regular expression captures this pattern.

Miami-Dade\n\d{5}\n[A-Za-z ]+

Can you guess what this means? You already know that \n means a new line. \d{5} means a 5 digit number. So it will match 33013, 33149, 98765, or any number that consists of five digits.  [A-Za-z ] means any alphabet character either in upper or lower case or space (N.B. Note the space at the end right before ‘]’).

Anything that goes inside [ ] is one character. Just like \d is one digit. So I need to specify how many of the characters are to be matched. if I put {5}, as I did in \d{5}, it will only match a city name that has five characters like ‘Miami,’ The pattern should match any length of city name as long as it is not zero. The + sign does that. [A-Za-z ]+ means that any alphabet character either in upper or lower case or space should appear at least or more than once. (N.B. * and ? are also quantifiers like +. See the metacharacter table below to find out what they do.)

Now I hit the “Find” button, and we can see the pattern worked as I intended. Hurrah!

Screen Shot 2013-07-24 at 2.24.47 PM

Now, let’s make these three lines one line each. One great thing about regex is that you can refer back to matched items. This is really useful for text manipulation. But in order to use the backreference feature in regex, we have to group the items with parentheses. So let’s change our regex to something like this:

(Miami-Dade)\n\(d{5})\n([A-Za-z ]+)

This regex shows three groups separated by a new line (\n). You will see that Sublime still matches the same three line sets in the text file. But now that we have grouped the units we want – county name, zipcode, and city name – we can refer back to them in the ‘Replace With’ field. There were three units, and each unit can be referred by backslash and the order of appearance. So the county name is \1, zipcode is \2, and the city name is \3. Since we want them to be all in one line and separated by a comma and a space, the following expression will work for our purpose. (N.B. Usually you can have up to nine backreferences in total from \1 to\9. So if you want to backreference the later group, you can opt not to create a backreference from a group by using (?: ) instead of (). )

\1, \2, \3

Do a few Replaces and then if satisfied, hit ‘Replace All’.

Ta-da! It’s magic.

Screen Shot 2013-07-24 at 2.54.13 PM

Regex Metacharacters

Regex notations look a bit funky. But it’s worth learning them since they enable you to specify a general pattern that can match many different cases that you cannot catch without the regular expression.

We have already learned the four regex metacharacters: \n, \d, { }, (). Not surprisingly, there are many more beyond these. Below is a pretty extensive list of regex metacharacters, which I borrowed from the regex tutorial here: http://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php . I also highly recommend this one-page Regex cheat sheet from MIT (http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf).

Note that \w will match not only a alphabetical character but also an underscore and a number. For example, \w+ matches Little999Prince_1892. Also remember that a small number of regular expression notations can vary depending on what programming language you use such as Perl, JavaScript, PHP, Ruby, or Python.

Metacharacter Description
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
^ Matches the position at the beginning of the input string.
$ Matches the position at the end of the input string.
* Matches the preceding subexpression zero or more times.
+ Matches the preceding subexpression one or more times.
? Matches the preceding subexpression zero or one time.
{n} Matches exactly n times, where n is a non-negative integer.
{n,} Matches at least n times, n is a non-negative integer.
{n,m} Matches at least n and at most m times, where m and n are non-negative integers and n <= m.
. Matches any single character except “\n”.
[xyz] A character set. Matches any one of the enclosed characters.
x|y Matches either x or y.
[^xyz] A negative character set. Matches any character not enclosed.
[a-z] A range of characters. Matches any character in the specified range.
[^a-z] A negative range characters. Matches any character not in the specified range.
\b Matches a word boundary, that is, the position between a word and a space.
\B Matches a nonword boundary. ‘er\B’ matches the ‘er’ in “verb” but not the ‘er’ in “never”.
\d Matches a digit character.
\D Matches a non-digit character.
\f Matches a form-feed character.
\n Matches a newline character.
\r Matches a carriage return character.
\s Matches any whitespace character including space, tab, form-feed, etc.
\S Matches any non-whitespace character.
\t Matches a tab character.
\v Matches a vertical tab character.
\w Matches any word character including underscore.
\W Matches any non-word character.
\un Matches n, where n is a Unicode character expressed as four hexadecimal digits. For example, \u00A9 matches the copyright symbol
Matching modes

You also need to know about the Regex matching modes. In order to use these modes, you write your regex as shown above, and then at the end you add one or more of these modes. Note that in text editors, these options often appear as checkboxes and may apply without you doing anything by default.

For example, [d]\w+[g] will match only the three lower case words in ding DONG dang DING dong DANG. On the other hand, [d]\w+[g]/i will match all six words whether they are in the upper or the lower case.

Look-ahead and Look-behind

There are also the ‘look-ahead’ and the ‘look-behind’ features in regular expressions. These often cause confusion and are considered to be a tricky part of regex. So, let me show you a simple example of how it can be used.

Below are several lines of a person’s last name, first name, middle name, separated by his or her department name. You can see that this is a snippet from a csv file. The problem is that a value in one field – the department name- also includes a comma, which is supposed to appear only between different fields not inside a field. So the comma becomes an unreliable separator. One way to solve this issue is to convert this csv file into a tab limited file, that is, using a tab instead of a comma as a field separater. That means that I need to replace all commas with tabs ‘except those commas that appear inside a department field.’

How do I achieve that? Luckily, the commas inside the department field value are all followed by a space character whereas the separator commas in between different fields are not so. Using the negative look-ahead regex, I can successfully specify the pattern of a comma that is not followed by (?!) a space \s.

,(?!\s)

Below, you can see that this regex matches all commas except those that are followed by a space.

lookbehind

For another example, the positive look-ahead regex, Ham(?=burg), will match ’Ham‘ in Hamburg when it is applied to the text:  Hamilton, Hamburg, Hamlet, Hammock.

Below are the complete look-ahead and look-behind notations both positive and negative.

  • (?=pattern)is a positive look-ahead assertion
  • (?!pattern)is a negative look-ahead assertion
  • (?<=pattern)is a positive look-behind assertion
  • (?<!pattern)is a negative look-behind assertion

Can you think of any example where you can successfully apply a look-behind regular expression? (No? Then check out this page for more examples: http://www.rexegg.com/regex-lookarounds.html)

Now that we have covered even the look-ahead and the look-behind, you should be ready to tackle the very first crazy-looking regex that I introduced in the beginning of this post.

(?=^[0-5\- ]+$)(?!.*0123)\d{3}-\d{3,4}-\d{4}

Tell me what this will match! Post in the comment below and be proud of yourself.

More tools and resources for practicing regular expressions

There are many tools and resources out there that can help you practice regular expressions. Text editors such as EditPad Pro (Windows), Sublime, TextWrangler (Mac OS), Vi, EMacs all provide regex support. Wikipedia (https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features) offers a useful comparison chart of many text editors you can refer to. RegexPal.com is a convenient online Javascript Regex tester. FireFox also has Regular Expressions add-on (https://addons.mozilla.org/en-US/firefox/addon/rext/).

For more tools and resources, check out “Regular Expressions: 30 Useful Tools and Resources” http://www.hongkiat.com/blog/regular-expression-tools-resources/.

Library problems you can solve with regex

The best way to learn regex is to start using it right away every time you run into a problem that can be solved faster with regex. What library problem can you solve with regular expressions? What problem did you solve with regular expressions? I use regex often to clean up or manipulate large data. Suppose you have 500 links and you need to add either EZproxy suffix or prefix to each. With regex, you can get this done in a matter of a minute.

To give you an idea, I will wrap up this post with some regex use cases several librarians generously shared with me. (Big thanks to the librarians who shared their regex use cases through Twitter! )

  • Some ebook vendors don’t alert you to new (or removed!) books in their collections but do have on their website a big A-Z list of all of their titles. For each such vendor, each month, I run a script that downloads that page’s HTML, and uses a regex to identify the lines that have ebook links in them. It uses another regex to extract the useful data from those lines, like URL and Title. I compare the resulting spreadsheet against last month’s (using a tool like diff or vimdiff) to discover what has changed, and modify the catalog accordingly. (@zemkat)
  • Sometimes when I am cross-walking data into a MARC record, I find fields that includes irregular spacing that may have been used for alignment in the old setting but just looks weird in the new one. I use a regex to convert instances of more than two spaces into just two spaces. (@zemkat)
  • Recently while loading e-resource records for government documents, we noted that many of them were items that we had duplicated in fiche: the ones with a call number of the pattern “H, S, or J, followed directly by four digits”. We are identifying those duplicates using a regex for that pattern, and making holdings for those at the same time. (@zemkat)
  • I used regex as part of crosswalking metadata schemas in XML. I changed scientific OME-XML into MODS + METS to include research images into the library catalog.  (@KristinBriney)
  • Parsing MARC just has to be one. Simple things like getting the GMD out of a 245 field require an expression like this: |h\[(.*)\]  MARCedit supports regex search and replace, which catalogers can use. (@phette23)
  • I used regex to adjust the printed label in Millennium depending on several factors, including the material type in the MARC record. (@brianslone )
  • I strip out non-Unicode characters from the transformed finding aids daily with  regex and  replace them with Unicode equivalents. (@bryjbrown)

 


Python Preconference at ALA Annual: Attendee Perspective

I attended the Library Code Year Interest Group‘s preconference on Python, sponsored by LITA (Library Information Technology Association)  and ALCTS (Association for Library Collections and Technical Services), at the American Library Association’s conference in Chicago this year. The workshop was taught and staffed by IG members Andromeda Yelton, Becky Yoose, Bohyun Kim, Carli Spina, Eric Phetteplace, Jen Young, and Shana McDanold. It was based on work done by the Boston Python Workshop group, and with tools developed there and by Coding Bat. The preconference was designed to provide a basic introduction to the Python programming language, and succeeded admirably.

Here’s why I think it’s important for librarians to learn to code: it provides options in lots of conversations where librarians have traditionally not had many. One of the lightning talks (by Heidi Frank at NYU) concerned using Python to manipulate MARC records. The tools to do that kind of thing have tended to be either a) the property of vendors who provide the features the majority of their customers want or b) the province of overworked library systems staff. Both scenarios tend to lead to tools which are limiting or aggravating for the individual needs of almost every individual user. Ever try to change a graphic in your OPAC? Export a report in the file format *you* need? Learning to code is one way to solve those kinds of problems. Code empowers.

The preconference was very self-directed. There were a set of introductory tutorials before  a late-morning lecture, then a lovely lunch (provided by the Python Foundation), then the option of spending more time on the morning’s activities or using the morning’s skills to work on one of two projects. The first project, ColorWall, used Python to create a bunch of very pretty cascading displays of color. The second, Wordplay, used it to answer crossword puzzle questions–”How many words fit the pattern E*G****R?” “How many words have all five vowels in order?”. My table opted to stick with the morning exercises, and learned an important lesson about what kinds of things Python can infer and what kinds it can’t. Python and your computer are very fast and systematic, but also very literal-minded. I suspect that they’d much rather be doing math than whatever you’re asking them to do for you.

My own background is moderately technical. I remember writing BASIC programs for my Commodore 128. I’ve done web design and systems work for a couple of libraries, and I have a lot of experience working with software of various kinds. I used this set of directions to install Linux on the Chromebook I brought to the preconference. I have installed new Linux operating systems on computers dozens of times. This doesn’t scare me:

And I still had the feeling of 8th-grade-math-class dread when I got stuck, as I frequently did, during the second part of the guided introduction. Did I miss something? Am I just not smart enough to do this? The whole litany of self-doubt. I got completely stuck on when to use loops and when not to use them. Loops are one of the most basic Python functions, a way of telling Python to systematically go down a list of things and do something to them. Utterly baffling, because it requires you to ask the question like a computer would, rather than like a human. Totally normal for me and anyone else starting out. Or, in fact, for anyone who programs. Programming is failure. The trick is learning to treat failure as part of the process, rather than as a calamity.

What’s powerful about the approach the IG used is that there were lots of people available to help, about seven teaching assistants to forty attendees. The accompanying documentation was cheerful and clear, and the attitude was “let’s figure this out together”. This is the polar opposite of a common approach to teaching programming, of which the polite paraphrase is “Read The Fine Manual”. My experience with music lessons as an adult and with the programming instruction I’ve looked at is that the (usually) well-meaning people doing the teaching tend to be people who learn best by trying things until they work. Lots of people learn best that way; lots of people do not. And librarians tend more often to be in the second category: wanting a bit of a sense of the forest before dealing with all of the trees individually. There is also a genderedness to the traditional approach. The Boston Python group’s approach (self-directed, with vast amounts of personal help available) was specifically designed to be welcoming to women and newcomers. Used here, it definitely worked. Attendees were 60% female, which is striking for a programming event, even at a library conference.

For me, learning Python is an investment in supporting the digital humanities work I will be increasingly involved in. I’m looking forward to learning how to use it to manipulate and analyze text. As I look more closely, I see that Python has modules for manipulating CSV files. One of my ongoing projects involves mapping indexes like MLA Bibliography to our full-text coverage so my students and faculty know what to expect when they use them. I’ve been using complicated Excel spreadsheets to do this, with only marginally satisfying results. I think that Python will give me better results, and hopefully results I can share with others.

The immediate takeaways for me are more about relationships and affiliation than code, though I do have a structure, in the form of the documentation for the workshop, which I will use to follow-up (you can use it, too!). I am lucky enough to be in the Boston area, so I will take advantage of the active Boston Python Meetup group, which has frequent workshops and events for programmers at all levels. Most importantly, I am clear from the workshop that Python is not inherently more complicated to learn than MARC formatting or old-style DIALOG searching. I wouldn’t say the workshop demystified Python for me–there’s still a lot of work for me to do–, but I will say that learning a useful amount of Python now seems entirely doable and worthwhile.

Computer code is crucial to the present and future of librarianship. Librarians have a unique facility with explaining the technical to non-technical people. Librarians learning to code together is an investment in ourselves and our profession.

About our guest author:

Chris Strauber is Humanities Research and Instruction Librarian and Coordinator of Instructional Design at Tufts University’s Tisch Library.


Libraries & Privacy in the Internet Age

Recently, we covered library data collection practices with an eye towards identifying what your library really needs to retain. In an era of seemingly comprehensive surveillance, libraries do their best to afford their patrons some privacy. Limiting our circulation statistics is a prime example: while many libraries track how many times a particular item circulates, it’s common practice to delete loan histories in patron records once items have been returned. Thus, in keeping with the Library Code of Ethics, we can “protect each library user’s right to privacy and confidentiality” while at once using data to improve our services.

However, not all information lives in books and our privacy protections must stay current with technology. Obfuscating the circulation of physical items is one thing, but what about all of our online resources? Most of the data noted in the data collection post is in and of the digital: web analytics, server logs, and heat maps. Today, people expose more and more of their personal information online and do so mostly on for-profit websites. In this post, I’ll go beyond library-specific data to talk further about how we can offer patrons enhanced privacy even when they’re not using resources we control, such as the library website or ILS.

Public Computers

Libraries are a great bastion of public computer access. We’re pretty much the only institution in modern society that a community can rely upon for free software use and web access. But how much thought do we put into the configuration of our public computers? Are we sure that each user’s session is completely isolated, unable to be accessed by others?

For a while, I tried to do quantitative research on how well libraries handled web browser settings on public computers. I went to whatever libraries I could—public, academic, law, anyone who would let me in the door and sit down at a computer, typically without a library card, which is not everyone. If I could get to a machine, I ran a brief audit of sorts, these being the main items:

  • List the web browsers on the machine, their versions, settings, & any add-ons present
  • Run Mozilla’s Plugin Check to test for outdated plugins, a common security vulnerability for browsers
  • Attempt to install problematic add-ons, such as keyloggers [1]
  • Attempt to change the browser’s settings, e.g. set it to offer to save passwords
  • Close the browser, then reopen it to see if my history and settings changes persisted
  • DELETE ALL THE THINGS

After awhile, I gave up on this effort, because I became busy with other projects and I never received a satisfactory sample size. Of the fourteen browsers across six (see what I mean about sample size?) libraries I tested, results were discouraging:

  • 93% (all but one) of browsers were outdated
  • On average, browsers had two plug-ins with known security vulnerabilities and two-and-a-half more which were outdated
  • The majority of browsers (79%) retained their history after being closed
  • A few (36%) offered to remember passwords, which could lead to dangerous accidents on shared computers
  • The majority (86%) had no add-ons installed
  • All but one allowed its settings to be changed
  • All but one allowed arbitrary add-ons to be installed

I understand that IT departments often control public computer settings and that there are issues beyond privacy which dictate these settings. But these are miserable figures by any standard. I encourage all librarians to run similar audits on their public computers and see if any improvements can be made. We’re allowing users’ sessions to bleed over into each other and giving them too much power to monitor each others’ activity. Much as libraries commonly anonymize circulation information to protect patrons from invasive government investigations, we should strive to keep web activities safe with sensible defaults. [2]

Many libraries force users to sign in or reserve a computer. Academic libraries may use Active Directory, wherein students sign in with a common login they use for other services like email, while public libraries may use PC reservation software like EnvisionWare. These approaches go a long way towards isolating user sessions, but at the cost of imposing access barriers and slowing start-up times. Now users need an AD account or library card to use your computers. Furthermore, users don’t always remember to sign off at the end of their session, meaning someone else could still sit down at their machine and potentially access their information. These can seem like unimportant edge cases, but they’re still worthy of consideration. Privacy almost always involves some kind of tradeoff, for users and for libraries. We need to ensure we’re making the right tradeoffs with due diligence.

Proactive Privacy

Libraries needn’t be on the defensive about privacy. We can also proactively help patrons in two ways: by modifying the browsers on our public computers to offer enhanced protections and by educating the public about their privacy.

While providing sensible defaults, such as not offering to remember passwords and preventing the installation of keylogging software, is helpful, it does little to offer privacy above and beyond what one would experience on a personal machine. However, libraries can use a little knowledge and research to offer default settings which are unobtrusive and advantageous. The most obvious example is HTTPS. HTTPS is probably familiar to most people; when you see a lock or other security-connoting icon in your browser’s address bar, it’ll be right alongside a URL that begins with the HTTPS scheme. You can think of the S in HTTPS as standing for “Secure,” meaning your web traffic is encrypted as it goes from node to node in between your browser and the server delivering data.

Banking sites, social media, and indeed most web accounts are commonly accessed over HTTPS connections. They operate rather seamlessly, the same as HTTP connections, with one slight caveat: HTTPS sites don’t load HTTP resources (e.g. if https://example.com happens to include the image http://cats.com/lol.jpg) by default, meaning sometimes pieces of a page are missing or broken. This commonly results in a “mixed content” warning which the user can override, though how intuitive that process is varies widely across browser user interfaces.

In any case, mixed content happens rarely enough that HTTPS, when available, is a no-brainer benefit. But here’s the rub: not all sites default to HTTPS, even if they should. Most notably, Facebook doesn’t. Do you want your patrons logging into Facebook with unencrypted credentials? No, you don’t, because anyone monitoring network traffic, using a tool like Firesheep for instance, can grab and reuse those credentials. So installing an extension like the superlative HTTPS Everywhere [3], which uses a crowdsourced set of formulas to deliver HTTPS sites where available, is of immense benefit to users even though they likely will never notice it.

HTTPS is just a start: there are numerous add-ons which offer security and privacy enhancements, from blocking tracking cookies to the NoScript Security Suite which blocks, well, pretty much everything. How disruptive these add-ons are is variable and putting NoScript or a similar script-blocking tool on public computers is probably a bad idea; it’s simply too strange for unacquainted users to understand. But awareness of these tools is vital and some of the less disruptive ones still offer benefits that the majority of your patrons would enjoy. If you’re on the fence about a particular option, a little targeted usability testing could highlight whether it’s worth it or not.

Education

In terms of education, online privacy is a massively under-taught field. Workshops in public libraries and courses in academic libraries are obvious and in-demand services we can provide. They can cater to users of all skill levels. A basic introduction might appeal to people just beginning to use the web, covering core concepts like HTTPS, session data (e.g. cookies), and the importance of auditing account settings. An advanced workshop could cover privacy software, two-factor authentication, and pivotal extensions that have a more niche appeal.

Password management alone is a rich topic. Why? Because it’s a problem for everyone. Being a modern web user virtually necessitates maintaining a double-digit number of accounts. Password best practices are fairly well-known: use lengthy, unique passwords with a mixture of character types (lowercase and uppercase letters, numbers, and punctuation). Applying them is another matter. Repeating one password across accounts means if one company get hacked, suddenly all your accounts are potentially vulnerable. Using tricky number-letter replacement strategies can lead to painful forgetting—was it LibrarianFervor with “1″s instead of “i”s, a “3″ instead of an “e”, a “0″ instead of an “o”, or any combination thereof? Or did I spell it in reverse? These strategies aren’t much more secure and yet they make remembering passwords tough.

Users aren’t to be blamed: creating a well-considered and scalable approach to managing online accounts is difficult. But many wonderful software packages exist for this, e.g. the open source KeePass or paid solutions like 1Password and LastPass. Merely showing users these options and explaining their immense benefits is a public service.

To use a specific example, I co-taught an interdisciplinary course recently with a title broad enough—”The Nature of Knowledge,” try that on for size—that sneaking in privacy, social media, and web browsers was easy. One task I had willing students perform was to install the PrivacyFix extension and then report back on their findings. PrivacyFix analyzes your Google and Facebook settings, telling you how much you’re worth to each company and pointing out places where you might be overexposing your information. It also includes a database of site ratings, judging sites based on how well they handle users data.

Our class was as diverse as any at my community college: we had adult students, teenage students, working parents, athletes, future teachers, future nurses, future police officers, black students, white students, Latino students, women, men. And you know what? Virtually everyone was shocked by their findings. They gasped, they changed their settings, they did independent research on online privacy, and at the end of the course they said still wanted to learn more. I hardly think this class was an anomaly. Americans know they’re being monitored at every turn. They want to share information online but they want to do so intelligently. If we offer them the tools to do so, they’ll jump at the chance.

Exeunt

For those who are curious about browser extensions, I wrote (shameless plug) a RUSQ column on web privacy that covers most of this post but goes further in detail in terms of recommendations. The Sec4Lib listserv is worth keeping an eye on as well, and if you really want to go the extra mile you could attend the Security preconference at the upcoming LITA Forum in November. Online privacy is not likely to get any less complicated in the future, but libraries should see that as an opportunity. We’re uniquely poised, both as information professionals with a devotion to privacy and as providers of public computing services, to tackle this issue. And no one is going to do it for us.

Footnotes

[1]^ Keyloggers are software which record each keystroke. As such, they can be used to find username and password information entered into web forms. I couldn’t find a free keylogger add-on for every browser so I only tested in browsers which had one available.

[2]^ I have a GitHub repository of what I consider to be sensible defaults for Mozilla Firefox, which happens to store settings in a JavaScript file and thus makes it easy to version and share them. The settings are liberally commented but not tested in production.

[3]^ As you’ll notice if you visit that link, HTTPS Everywhere is only available for Google Chrome and Mozilla Firefox. In my experience, it almost never causes problems, especially with major websites like Facebook, and there are a few similar extensions which one could try e.g. KB SSL for Chrome. Unfortunately, Internet Explorer has a much weaker add-on ecosystem with no real HTTPS solution that I’m aware of. Safari also has a weak extension ecosystem, though there is at least one HTTPS Everywhere-type option that I haven’t tried and has acknowledged limitations.

At the very least, installing HTTPS Everywhere on Firefox and Chrome still helps users who employ those browsers, without affecting users who prefer the others.


Getting Beyond Library Assessment Fatigue: or how SEO taught me to stop whining and love the data

This post is a bit of a thought-experiment. It grew out of a conversation I had with a colleague about something I like to call “assessment fatigue.” I believe we need quality assessment, but I also get extremely tired of hearing the assessment fad everywhere. My fatigue with assessment-speak has been making it difficult to engage with the real work of assessment, but a recent conversation about Search Engine Optimization and Web Analytics (of all things) is helping me get beyond this. I’m hopeful that by sharing and exploring this thought-arc with you, we can profitably move beyond assessment-speak and assessment-fatigue and on to the thoughtful and intentional work of building library services informed with data.

TLDR: Jump to the list of three rules from SEO that can apply to library assessment fatigue at the bottom of this article.

Assessment Fatigue

Assessment fatigue is the state of not wanting to hear another another word about measuring, rubrics, or demonstrating value. I am a frequent sufferer of assessment fatigue, despite the fact that I am convinced that assessment is absolutely necessary to guide the work of the library. I don’t know of a viable alternative to the outcomes-assessment model1 of goal setting and performance evaluation. I think there is great work out there 2 about how to incorporate assessment into the work of academic libraries. I’ve seen it lead libraries to achieving amazing things and thus I’m a believer in the power of outcomes and data driven planning.

I’m also sick to death of hearing about it. It is frighteningly easy to turn talk of assessment into a dry and empty caricature of what it can be. So much so, that I’m usually hesitant to get on board with a new assessment project because they can turn into something out of a Kafka novel or Terry Gilliam movie at the drop of a hat. This gives me a bad attitude and my internal monologue can resemble: “Oh yes, let’s reduce the complexities of academic work to the things that are most easily quantified and then plot our success on a rubric.” or “Let’s reduce information literacy to a standardized test and then make our instruction program teach to that test.” I also hear Leonard Nimoy’s voice from Civilization IV in my head saying “The bureaucracy is expanding to meet the needs of the expanding bureaucracy.” These snarky thoughts are at best unhelpful and at worst get in the way of the work of the library, but I’d be lying if I denied indulging them from time to time. Assessment is undeniably necessary, but it can also be tremendously annoying for the rank and file librarians required to add gathering data to their already over-full workloads.

Happily, I’ve discovered something that rescues me from my whining and helps me engage in useful assessment activities. It comes from, oddly enough, what I’ve learned about Search Engine Optimization (SEO). This connection may appear initially to be tenuous, but using it has been profitable for me and helped both my attitude and my productivity. To help make this all a little more clear I’m going to begin by explaining what I’ve learned through teaching SEO to undergraduates and then I’ll demonstrate that SEO and library assessment share some key characteristics, namely they both have suffer from a bad reputation among those who carry them out, they are both absolutely required in order to do the highest quality work in their respective fields, and both are ultimately justified by the power of data-driven decision making.

Teaching SEO

I include a unit on Search Engine Optimization in a course I teach on information architecture. In the class we cover basic organization theory, database structures, searching databases, search engine structure, searching the web, SEO, and microdata markup. I was reluctant at first to add the SEO unit, because I understood SEO as a largely seedy and underhanded marketing affair. Once I taught it, however, I realized that doing SEO the right way requires a nuanced understanding of how web search works. Students who learned how to do SEO also learned how to search and their insights on web search bled over and made them better database searchers as well.

Quick Primer on Web Search & SEO

What makes students who understand SEO and web architecture more effective database searchers has to do with a little known detail of full-text keyword searching: by itself, keyword searching doesn’t work very well. More finely put, keyword search works just fine, but the results of these searches, by themselves, aren’t very useful. Finding keyword matches is easy, the real challenge is in packaging the results of a keyword search in a manner that is useful to the searcher. Unlike databases with well-organized metadata structures, keyword searches don’t have a way of telling what keywords mean. Web content has no title, author, or subject fields3. So when I search for “blues” the keyword search doesn’t know if I’m looking for colors, moods, music, jeans, cheeses, or the French national football side.

Because of this lack of context, search engines create useful results rankings by treating HTML tags and web structural elements as implicit metadata. If a keyword is found inside a URL, <title> or <h1> tag, the site is ranked more highly than a site where the keyword appears only in thet <body> tag. Anchor-link text, the words underlined in blue in a web link, are especially valuable, since they contain another person’s description of what a site is. In the following example, the anchor-link text “The ACRL TechConnect blog, a site about library technology for academic librarians” succinctly and accurately describes the content being linked to. This makes the content more findable to readers using search engines.

<a href="http://acrl.ala.org/techconnect/">The ACRL TechConnect blog, a site about library technology for academic librarians.</a>

Thus, when we code a site or even make a link, we are, in effect, cataloging the web. This is also why we should never use “click here” in our anchor-link text. When we do that we are squandering an opportunity to add descriptive information to the link, and make it more difficult for potential readers to discover our content. The following is the WRONG way to write a web link.

<a href="http://acrl.ala.org/techconnect/">Click here</a> for the The ACRL TechConnect blog, a site about library technology for academic librarians.

In this example, the descriptive information is outside the link (outside the <a></a> tags) and thus unrecognisable as descriptive information to a search engine

Search companies like Ask, Bing, Google, and Yahoo! don’t organize the web, they just capture how users and content creators organize and describe their own and each others’ content. SEO, very basically speaking, is the practice of putting knowledge of web search architecture into practice. When we use short but descriptive text in our URLs, <title> tags, <h1> tags, and write descriptive anchor link text–when we practice responsible SEO in other words–we are performing the public service of making the web more accessible for everyone. Search engine architecture and SEO are, of course, much more complicated than these short paragraphs can detail, but this is the general concept: because there is no standardized way of cataloging pages, search engine companies have found workarounds to make “a vast collection of completely uncontrolled heterogeneous documents” [Brin quote] act like a database. Using that loose metaphor, SEO can be seen as the process of getting web designers to think like catalogers.

SEO’s Bad Reputation

When viewed from this perspective, SEO doesn’t seem all that bad. It seems like a natural process of understanding how search engines use web site data and using that knowledge to maximize public access to one’s site data. In reality, it doesn’t always work so cleanly. Ad revenue is based on page views and clicks, so practically speaking, SEO often becomes the process of maximizing revenue by driving traffic to a site by any means. In other words, all too often SEO experts act like marketers, not like catalogers. Because of these abuses SEO is commonly understood as the process of maximizing search ranking regardless of relevance, user intent, or ethics.

If you want to test my hypothesis here, simply send a tweet containing the letters SEO or #seo and examine the quality of your new followers and messages. (Spoiler: you’ll get a lot of spam comments and spam followers so don’t try this at home.)

Of course, SEO doesn’t have to be shady or immoral, but since there are profits to be made off by shady and immoral SEO ‘experts’, the field has earned its bad reputation. Any web designer who wants people to find her content needs to perform fundamental SEO tasks, but there is little talk about the process, out of fear of being lumped in with the shady side of things. For me, the best argument for doing SEO is to keep the reason for SEO in the front of my mind: we need to bother with the mess of SEO because SEO is what connects our content to our audience. Because I care about both my audience and my content, I’m willing to do unpleasant tasks necessary to ensure my audience can find my content.

The Connection Between Library Assessment and SEO

It seems clear to me that library assessment suffers from some of the same reputation problems that SEO has. Regardless of how quality assessment is integral to library performance, the current fad status of assessment can make it difficult for librarians in their daily work to see any benefits behind the hype. Failures of past assessment fads to bring about promised changes (TQM anyone?) make librarians wary of drinking the assessment Kool-Aid. I’m not focusing here on grumpy or curmudgeonly librarians, but hard-working professionals who have heard too many assessment pep-talks to get excited about the next project.

This is why I find SEO to be a useful analogue for library assessment. Both SEO and library assessment are things that are absolutely necessary to the success of our efforts, but both are also held in distaste by many of the rank and file who are required to engage in these activities. One key to getting beyond the initial negative reaction and the bad reputations of these activities is to focus on the reasons we have for engaging in them. For example, we do SEO because we want to connect users with our content. We do assessment because we want to make decisions based on data, not whim. We do assessment because we want to know if our efforts to serve our users are actually serving our users. In other words: because I care deeply about providing the highest quality service to our library patrons, I am willing to do underlying work to make certain our efforts are having the desired effects.

Keeping the ultimate goal in mind is not only helpful for setting priorities, but it also helps us govern the potentially insidious natures of both SEO and assessment. By this I mean that if we keep in mind that SEO is about connecting our intended audience to our quality content, we are much less likely to be tempted to engage in unsavory marketing schemes, because those are not about either our intended audience or quality content. In the same vein, if we remain mindful that library assessment is about using data to improve how we serve our users, we are unlikely to take shortcuts such as teaching to a standardized test, choosing only easily quantifiable measures, or assessing only areas of strength. These shortcuts will serve only to undermine that goal in the long run.

Moving Forward

Returning to the conversation with my colleague that sparked this post, after I finished whining about my assessment fatigue, I was explained why I felt it was necessary to add a section on web analytics to my SEO course unit. My students worked on a project where the analysed a website for its SEO and made suggestions to improve access to the content. Without the data provided by web analytics, they had no way of knowing if their suggestions made things better, worse, or had no effect. She replied that this is the precise reason that librarians need to collect assessment data. Without assessment data, we have no tools to tell if our work choices improve, worsen, or have no effect on library services. She was, of course, absolutely right. Without quality assessment data, we have no way of knowing whether our decisions about library service lead to increased access to relevant information and improved patron experience.

Three SEO Rules that Apply to Library Assessment

In conclusion and in continuation of the metaphor that library assessment is a lot like SEO, here are three rules from SEO that can speak to our library assessment efforts.

1. Know how search engines function. (Know how accreditation functions.)

If you want people who use search engines to successfully find your site, you have to know how search engines function and incorporate that knowledge into your site design. Similarly, if you are assessing library performance in order to demonstrate value to stakeholders such as accreditors or campus administrators outside the library, you need to know what these bodies value and write your assessment to measure for these values.

2. Know your content and your audience. (Know your library and your users.)

The most common error in SEO efforts is designing to generate maximum traffic to your site. If successful, this approach can generate a large quantity of traffic, traffic that is collectively annoyed to find themselves at your site. The proper approach is to know your content and design your SEO to attract genuinely interested traffic. A similar temptation applies to library assessment. It is possible to skew your analytics to show only amazing success in all areas, but this comes at the cost of gathering useful about actual library services and at the cost of being able to improve services based on that data. Assessment data is valuable because it tells us about how the library serves our users. Data skewed to show only positive results is useless when it comes to helping the library achieve its mission.

3. Design for humans, not for machines. (Assess for library users.)

This rule sums up the law and the prophets for SEO: design for humans and not for machines. What it means is don’t let your desire for search ranking tempt you into designing an ugly page that your audience hates. Put the people first. When you have a choice to make between a design element that favors human readers and a design element that favors search engines crawlers, ALWAYS choose people. While SEO efforts have a real impact on search ranking, a quality web site is more important for search ranking than quality SEO effort. Similarly, if you find yourself tempted to compromise service to patrons in order to focus on assessment, always err on the side of the patron. Librarian time and attention are finite resources, but if we consciously and consistently prioritize our patrons ahead of our assessment efforts, our assessment efforts will uncover more favorable data than if we put the data ahead of the people we are here to serve.


Advice on Being a Solo Library Technologist

I am an Emerging Technologies Librarian at a small library in the middle of a cornfield. There are three librarians on staff. The vast majority of our books fit on one floor of open stacks. Being so small can pose challenges to a technologist. When I’m banging my head trying to figure out what the heck “this” refers to in a particular JavaScript function, to whom do I turn? That’s but an example of a wide-ranging set of problems:

  • Lack of colleagues with similar skill sets. This has wide-ranging ill effects, from giving me no one to ask questions to or bounce ideas off of, to making it more difficult to sell my ideas.
  • Broad responsibilities that limit time spent on technology
  • Difficulty creating endurable projects that can be easily maintained
  • Difficulty determining which projects are appropriate to our scale

Though listservs and online sources alleviate some of these concerns, there’s a certain knack to be a library technologist at a small institution.[1] While I still have a lot to learn, I want to share some strategies that have helped me thus far.

Know Thy Allies

At my current position, it took me a long time to figure out how the college was structured. Who is responsible for managing the library’s public computers? Who develops the website? If I want some assessment data, where do I go? Knowing the responsibilities of your coworkers is vital and effective collaboration is a necessary element of being a technologist. I’ve been very fortunate to work with coworkers who are immensely helpful.

IT Support can help with both your personal workstation and the library’s setup. Remember that IT’s priorities are necessarily adverse to yours: they want to keep everything up and running, you want to experiment and kick the tires. When IT denies a request or takes ages to fix something that seems trivial to you, remember that they’re just as overburdened as you are. Their assistance in installing and troubleshooting software is invaluable. This is a two-way street: you often have valuable insight into how users behave and what setups are most beneficial. Try to give and take, asking for favors at the same time that you volunteer your services.

Institutional Research probably goes by a dozen different names at any given dozen institutions. These names may include “Assessment Office,” “Institutional Computing,” or even the fearsome “Institutional Review Board” of research universities. These are your data collection and management people and—whether you know it or not—they have some great stuff for you. It took me far too long to browse the IR folder on our shared drive which contains insightful survey data from the CCSSE and in-house reports. There’s a post-graduate survey which essentially says “the library here is awesome,” good to have when arguing for funding. But they also help the library work with the assessment data that our college gathers; we hope to identify struggling courses and offer our assistance.

The web designer should be an obvious contact point. Most technology is administered through the web these days—shocking, I know. The webmaster will not only be able to get you access to institutional servers but they may have learned valuable lessons from their own positions. They, too, struggle to complete a wide range of tasks. They have to negotiate many stakeholders who all want a slice of the vaunted homepage, often the subject of territorial battles. They may have a folder of good PR images or a style guide sitting around somewhere; at the very least, some O’Reilly books you want to borrow.

The Learning Management System administrator is similar to the webmaster. They probably have some coding skills and carry an immense, important burden. At my college, we have a slew of educational technologists who work in the “Faculty Development Center” and preside over the LMS. They’re not only technologically savvy, often introducing me to new tools or techniques, but they know how faculty structure their courses and have a handle on pedagogical theory. Their input can not only generate new ideas but help you ground your initiatives in a solid theoretical basis.

Finally, my list of allies is obviously biased towards academic libraries. But public librarians have similar resources available, they just go by different names. Your local government has many of these same positions: data management, web developer, technology guru. Find out who they are and reach out to them. Anyone can look for local hacker/makerspaces or meetups, which can be a great way not only to develop your skills but to meet people who may have brilliant ideas and insight.

Build Sustainably

Building projects that will last is my greatest struggle. It’s not so hard to produce an intricate, beautiful project if I pour months of work into it, but what happens the month after it’s “complete”? A shortage of ideas has never been my problem, it’s finding ones that are doable. Too often, I’ll get halfway into a project and realize there’s simply no way I can handle the upkeep on top of my usual responsibilities, which stubbornly do not diminish. I have to staff a reference desk, teach information literacy, and make purchases for our collection. Those are important responsibilities and they often provide a platform for experimentation, but they’re also stable obligations that cannot be shirked.

One of the best ways to determine if a project is feasible is to look around at what other libraries are doing. Is there an established project—for instance, a piece of open source software with a broad community base—which you can reuse? Or are other libraries devoting teams of librarians to similar tasks? If you’re seeing larger institutions struggle to perfect something, then maybe it’s best to wait until the technology is more mature. On the other hand, dipping your toe in the water can quickly give you a sense of how much time you’ll need to invest. Creating a prototype or bringing coworkers on board at early stages lets you see how much traction you have. If others are resistant or if your initial design is shown to have gaping flaws, perhaps another project is more worthy of your time. It’s an art but often saying no, dropping a difficult initiative, or recognizing that an experiment has failed is the right thing to do.

Documentation, Documentation, Documentation

One of the first items I accomplished on arrival at my current position was setting up a staff-side wiki on PBworks. While I’m still working on getting other staff members to contribute to it (approximately 90% of the edits are mine), it’s been an invaluable information-sharing resource. Part-time staff members in particular have noted how it’s nice to have one consistent place to look for updates and insider information.

How does this relate to technology? In the last couple years, my institution has added or redesigned dozens of major services. I was going to write a ludicrously long list but…just trust me, we’ve changed a lot of stuff. A new technology or service cannot succeed without buy-in, and you don’t get buy-in if no one knows how to use it. You need documentation: well-written, illustrative documentation. I try to keep things short and sweet, providing screencasts and annotated images to highlight important nuances. Beyond helping others, it’s been invaluable to me as well. Remember when I said I wasn’t so great at building sustainably? Well, I’ll admit that there are some workflows or code snippets that are Greek each time I revisit them. Without my own instructions or blocks of comments, I would have to reverse engineer the whole process before I could complete it again.

Furthermore, not all my fellow staff are on par with my technical skills. I’m comfortable logging into servers, running Drush commands, analyzing the statistics I collect. And that’s not an indictment of my coworkers; they shouldn’t need to do any of this stuff. But some of my projects are reliant on arcane data schemas or esoteric commands. If I were to win the lottery and promptly retire, sophisticated projects lacking documentation would grind to a halt. Instead, I try to write instructions such that anyone could login to Drupal and apply module updates, for instance, even if they were previously unfamiliar with the CMS. I feel a lot better knowing that my bus factor is a little lower and that I can perhaps even take a vacation without checking email, some day.

Choose Wisely

The honest truth is that smaller institutions cannot afford to invest in every new and shiny object that crosses their path. I see numerous awesome innovations at other libraries which simply are not wise investments for a college of our size. We don’t have the scale, skills, and budget for much of the technology out there. Even open source solutions are a challenge because they require skill to configure and maintain. Everything I wrote about sustainability and allies is trying to mitigate this lack of scale, but the truth is some things are just not right for us. It isn’t helpful to build projects that only you can continue, or develop ones which require so much attention that other fundamental responsibilities (doubtless less sexy—no less important) fall through the cracks.

I record my personal activities in Remember the Milk, tagging tasks according to topic. What do you think was the tag I used most last year? Makerspace? Linked data? APIs? Node.js? Nope, it was infolit. That is hardly an “emerging” field but it’s a vital aspect of my position nonetheless.

I find that the best way to select amongst initiatives is to work backwards: what is crucial to your library? What are the major challenges, obvious issues that you’re facing? While I would not abandon pet projects entirely, because sometimes they can have surprisingly wide-ranging effects, it helps to ground your priorities properly.[2] Working on a major issue virtually guarantees that your work will attract more support from your institution. You may find more allies willing to help, or at least coworkers who are sympathetic when you plead with them to cover a reference shift or swap an instruction session because you’re overwhelmed. The big issues themselves are easy to find: user experience, ebooks, discovery, digital preservation, {{insert library school course title here}}. At my college, developmental education and information literacy are huge. It’s not hard to align my priorities with the institution’s.

Enjoy Yourself

No doubt working on your own or with relatively little support is challenging and stressful. It can be disappointing to pass up new technologies because they’re too tough to implement, or when a project fails due to one of the bullet points listed above. But being a technologist should always be fun and bring feelings of accomplishment. Try to inject a little levity and experimentation into the places where it’s least expected; who knows, maybe you’ll strike a chord.

There are also at least a couple advantages to being at a smaller institution. For one, you often have greater freedom and less bureaucracy. What a single individual does on your campus may be done by a committee (or even—the horror—multiple committees) elsewhere. As such, building consensus or acquiring approval can be a much simplified process. A few informal conversations can substitute for mountains of policies, forms, meetings, and regulations.

Secondly, workers at smaller places are more likely to be jack-of-all trades librarians. While I’m a technologist, I wear plenty of more traditional librarian hats as well. On the one hand, that certainly means I have less time to devote to each responsibility than a specialist would; on the other, it gives me a uniquely holistic view of the library’s operations. I not only understand how the pieces fit together, but am better able to identify high-level problems affecting multiple areas of service.

I’m still working through a lot of these issues, on my own. How do you survive as a library technologist? Is it just as tough being a large institution? I’m all eyes.

Footnotes

[1]^ Here are a few of my favorite sources for being a technology librarian:

  • Listservs, particularly Code4Lib and Drupal4Lib. Drupal4Lib is a great place to be if you’re using Drupal and are running into issues, there are a lot of “why won’t this work” and “how do you do X at your library” threads and several helpful experts who hang around the list.
  • For professional journals, once again Code4Lib is very helpful. ITAL is also open access and periodically good tech tips appear in C&RL News or C&RL. Part of being at a small institution is being limited to open access journals; these are the ones I read most often.
  • Google. Google is great. For answering factual questions or figuring out what the most common tool is for a particular task, a quick search can almost always turn up the answer. I’d be remiss if I didn’t mention that Google usually leads me to one of a couple excellent sources, like Stack Overflow or the Mozilla Developer Network.
  • Twitter. Twitter is great, too. I follow many innovative librarians but also leading figures in other fields.
  • GitHub. GitHub can help you find reusable code, but there’s also a librarian community and you can watch as they “star” projects and produce new repositories. I find GitHub useful as a set of instructive code; if I’m wondering how to accomplish a task, I can visit a repo that does something similar and learn from how better developers do it.

[2]^ We’ve covered managing side projects and work priorities previously in “From Cool to Useful: Incorporating hobby projects into library work.”