Information Architecture for a Library Website Redesign

My library is about to embark upon a large website redesign during this summer semester. This isn’t going to be just a new layer of CSS, or a minor version upgrade to Drupal, or moving a few pages around within the same general site. No, it’s going to be a huge, sweeping change that affects the whole of our web presence. With such an enormous task at hand, I wanted to discuss some of the tools and approaches that we’re using to make sure the new site meets our needs.

Why Redesign?

I’ve heard about why the wholesale website redesign is a flawed approach, why we should be continually, iteratively working on our sites. Continual changes stop problems from building up, plus large swaths of changes can disrupt our users who were used to the old site. The gradual redesign makes a lot of sense to me, and also seems like a complete luxury that I’ve never had in my library positions.

The primary problem with a series of smaller changes is that that approach assumes a solid fundamental to begin with. Our current site, however, has a host of interconnected problems that makes tackling any individual issue a challenge. It’s like your holiday lights sitting in a box all year; they’re hopelessly tangled by the time you take them out again.

Our site has decades of discarded, forgotten content. That’s mostly harmless; it’s hard to find and sees virtually no traffic. But it’s still not great to have outdated information scattered around. In particular, I’m not thrilled that a lot of it is static HTML, images, and documents sitting outside our content management system. It’s hard to know how much content we even have because it cannot be managed in one place.

We also fell into a pattern of adding content to the site but never removing or re-organizing existing content. Someone would ask for a button here, or a page dictating a policy there, or a new FAQ entry. Pages that were added didn’t have particular owners responsible for their currency and maintenance; I, as Systems Librarian, was expected to run the technical aspects of the site but also be its primary content editor. That’s simply an impossible task, as I don’t know every detail of the library’s operations or have the time to keep on top of a menagerie of pages of dubious importance.

I tried to create a “website changes form” to manage things, but it didn’t work for staff nor myself. The few staff who did fill out the form ended up requesting things that were difficult to do, large theme changes that I wasn’t comfortable making without user testing or approval from our other librarians. The little content that was added was minor text being ferried through this form and myself, essentially slowing down the editorial process and furthering this idea that web content was solely my domain.

To top our content troubles off, we’re also on an unsupported, outdated version of Drupal. Upgrading or switching a CMS isn’t necessarily related to a website redesign. If you have a functional website on a broken piece of software, you probably don’t want to toss out the good with the bad. But in our case, similar to how our ILS migration gave us the opportunity to clean up our bibliographic records, a CMS migration gives us a chance to rebuild a crumbling website. It just doesn’t make sense to invest technical effort in migrating all our existing content when it’s so clearly in need of major structural change.

Card Sort

Making a card sort

Cards in the middle of being constructed.

Not wanting to go into a redesign process blind, we set out to collect data on our current site and how it could be improved. One of the first ways we gathered data was to ask all library staff to perform a card sort. A card sort is an activity wherein pieces of web content are put on cards which can then be placed into categories; the idea is to form a rough information architecture for your site which can dictate structure and main menus. You can do either open or closed card sorts, meaning the categories are up to the participants to invent or provided ahead of time.

For our card sort, I chose to do an open card sort since we were so uncertain on the categories. Secondly, I selected web content based on our existing site’s analytics. It was clear to me that our current site was bloated and disorganized; there were pages tucked into the nooks of cyberspace that no one had visited in years. There was all sorts of overlapping and unnecessary content. So I selected ≈20 popular pages but also gave each group two pieces of blank paper on which to add whatever content they felt was missing.

Finally, trying to get as much and as useful data as possible, I modified the card sort procedure in a couple ways. I asked people to role play as different types of stakeholders (graduate & undergraduate students, faculty, administrators) and to justify their decisions from that vantage point. I also had everyone, after sorting was done, put dots on content they felt was important enough for the home page. Since one of our current site’s primary challenges in maintenance, or the lack thereof, I wanted to add one last activity wherein participants would write a “responsible staff member” on each card (e.g. the instruction librarian maintains the instruction policy page). Sadly, we ran out of time and couldn’t do that bit.

The results of the card sort were informative. A few categories emerged as a commonality across everyone’s sorts: collections, “about us”, policies, and current events/news. We discovered a need for new content to cover workshops, exhibits, and events happening in the library which were currently only represented (and not very well) on blog posts. In terms of the home page, it was clear that LibGuides, collections, news, and most importantly our open hours needed to represented.

Treejack & Analytics

Once we had enough information to build out the site’s architecture, I organized our content into a few major categories. But there were still several questions on my mind: would users understand terms like “special collections”? Would they understand where to look for LibGuides? Would they know how to find the right contact for various questions? To answer some of these questions, I turned to Optimal Workshop’s “Treejack” tool. Treejack tests a site’s information architecture by having users navigate basic text links to perform basic tasks. We created a few tasks aimed at answering our questions and recruited students to perform them. While we’re only using the free tier of Optimal Workshop, and only using student stakeholders, the data was till informative.

For one, Optimal Workshop’s results data is rich and visualized well. It shows the exact routes each user took through our site’s content, the time it took to complete a task, and whether a task was completed directly, completed indirectly, or failed. Completed directly means the user took an ideal route through our content; no bouncing up and down the site’s hierarchy. Indirect completion means they eventually got to the right place, but didn’t take a perfect path there, while failure means they ended in the wrong place. The graph’s the demonstrate each tasks’ outcomes are wonderful:

Data & charts for a task

The data & charts Treejack shows for a moderately successful task.

"Pie tree" visualizing users' paths

A “pie tree” showing users’ paths while attempting a task.

We can see here that most of our users found their way to LibGuides (named “study guides” here). But a few people expected to find them under our “Collections” category and bounced around in there, clearly lost. This tells us we should represent our guides under Collections alongside items like databases, print collections, and course reserves. While building and running your own Treejack-type tests would be easy, I definitely recommend Optimal Workshop as a great product which provides much insight.

There’s much work to be done in terms of testing—ideally we would adjust our architecture to address the difficulties that users had, recruit different sets of users (faculty & staff), and attempt to answer more questions. That’ll be difficult during the summer while there are fewer people on campus but we know enough now to start adjusting our site and moving along in the redesign process.

Another piece of our redesign philosophy is using analytics about the current site to inform our decisions about the new one. For instance, I track interactions with our home page search box using Google Analytics events 1. The search box has three tabs corresponding to our discovery layer, catalog, and LibGuides. Despite thousands of searches and interactions with the search box, LibGuides search is seeing only trace usage. The tab was clicked on a mere 181 times this year; what’s worse, only 51 times did a user actually search afterwards. This trace amount of usage, plus the fact that users are clearly clicking onto the tab and then not finding what they want there, indicates it’s just not worth any real estate on the home page. When you add in that our LibGuides now appear in our discovery layer, their search tab is clearly disposable.

What’s Next

Data, tests, and conceptual frameworks aside, our next stage will involve building something much closer to an actual, functional website. Tools like Optimal Workshop are wonderful for providing high-level views on how to structure our information, but watching a user interact with a prototype site is so much richer. We can see their hesitation, hear them discuss the meanings of our terms, get their opinions on our stylistic choices. Prototype testing has been a struggle for me in the past; users tend to fixate on the unfinished or unrefined nature of the prototype, providing feedback that tells me what I already know (yes, we need to replace the placeholder images; yes, “Lorem ipsum dolor sit amet” is written on every page) rather than something new. I hope to counter that by setting appropriate expectations and building a small but fairly robust prototype.

We’re also building our site in an entirely new piece of software, Wagtail. Wagtail is exciting for a number of reasons, and will probably have to be the subject of future posts, but it does help address some of the existing issues I noted earlier. We’re excited by the innovative Streamfield approach to content—a replacement for large, rich text fields which are unstructured and often let users override a site’s base styles. We’ve also heard whispers of new workflow features which let us send reminders to owners of different content pages to revisit them periodically. While I could do something like this myself with an ad hoc mess of calendar events and spreadsheets, having it build right into the CMS bodes well for our future maintenace plans. Obviously, the concepts underlying Wagtail and the tools it offers will influence how we implement our information architecture. But we also started gathering data long before we knew what software we’d use, so exactly how it will work remains to be figured out.

Has your library done a website redesign or information architecture test recently? What tools or approaches did you find useful? Let us know in the comments!

Notes

  1. I described Google Analytics events before in a previous Tech Connect post

Representing Online Journal Holdings in the Library Catalog

The Problem

It isn’t easy to communicate to patrons what serials they have access to and in what form (print, online). They can find these details, sure, but it’s scattered across our library’s web presence. What’s most frustrating is that we clearly have all the necessary information but the systems offer no built-in way to produce a clear display of it. My fellow librarians noted that “it’d be nice if the catalog showed our exact online holdings” and my initial response was to sigh and say “yes, that would be nice”.

To illustrate the scope of the problem, a user can search for journals in a few of our disparate systems:

  • we use a knowledgebase to track database subscriptions and which journals are included in each subscription package
  • the public catalog for our Koha ILS has records for our print journals, sometimes with a MARC 856$u 1 link to our online holdings in the knowledgebase
  • our discovery layer has both article-level results for the journals in our knowledgebase and journal-level search results for the ones in our catalog

While these systems overlap, they also serve distinct purposes, so it’s not so awful. However, there are a few downsides to our triad of serials information systems. First of all, if a patron searches the knowledgebase looking for a journal which we only have in print, our database holdings wouldn’t show that they have access to print issues. To work around this, we track our print issues both in our ILS and the knowledgebase, which duplicates work and introduces possible inconsistencies.

Secondly, someone might start their research in the discovery layer, finding a journal-level record that links out to our catalog. But it’s too much to ask a user to search the discovery layer, click into the catalog, click a link out to the knowledgebase, and only then discover our online holdings don’t include the particular volume they’re looking for. Possessing three interconnected systems creates labyrinthine search patterns and confusion amongst patrons. Simply describing the systems and their nuanced areas of overlap in this post feels like challenge, and the audience is librarians. I can imagine how our users must feel when we try to outline the differences.

The 360 XML API

Our knowledgebase is Serials Solutions 360KB. I went looking in the vendor’s help documentation for answers, which refers to an API for the product but apparently provides no information on using said API. Luckily, a quick search through GitHub projects yielded several using the API and I was able to determine its URL structure: http://{{your Serials Solution ID}}.openurl.xml.serialssolutions.com/openurlxml?version=1.0&url_ver=Z39.88-2004&issn={{the journal’s ISSN}}

It’s probably possible to search by other parameters as well, but for my purposes ISSN was ideal so I didn’t bother investigating further. If you send a request to the address above, you receive XML in response:

<ssopenurl:openURLResponse xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ssdiag="http://xml.serialssolutions.com/ns/diagnostics/v1.0" xmlns:ssopenurl="http://xml.serialssolutions.com/ns/openurl/v1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://xml.serialssolutions.com/ns/openurl/v1.0 http://xml.serialssolutions.com/ns/openurl/v1.0/ssopenurl.xsd http://xml.serialssolutions.com/ns/diagnostics/v1.0 http://xml.serialssolutions.com/ns/diagnostics/v1.0/diagnostics.xsd">
    <ssopenurl:version>1.0</ssopenurl:version>
    <ssopenurl:results dbDate="2017-02-15">
        <ssopenurl:result format="journal">
            <ssopenurl:citation>
                <dc:source>Croquis</dc:source>
                <ssopenurl:issn type="print">0212-5633</ssopenurl:issn>
            </ssopenurl:citation>
            <ssopenurl:linkGroups>
                <ssopenurl:linkGroup type="holding">
                    <ssopenurl:holdingData>
                        <ssopenurl:startDate>1989</ssopenurl:startDate>
                        <ssopenurl:providerId>PRVLSH</ssopenurl:providerId>
                        <ssopenurl:providerName>Library Specific Holdings</ssopenurl:providerName>
                        <ssopenurl:databaseId>ZYW</ssopenurl:databaseId>
                        <ssopenurl:databaseName>CCA Print Holdings</ssopenurl:databaseName>
                        <ssopenurl:normalizedData>
                            <ssopenurl:startDate>1989-01-01</ssopenurl:startDate>
                        </ssopenurl:normalizedData>
                    </ssopenurl:holdingData>
                    <ssopenurl:url type="source">https://library.cca.edu/</ssopenurl:url>
                    <ssopenurl:url type="journal">
                    https://library.cca.edu/cgi-bin/koha/opac-search.pl?idx=ns&q=0212-5633
                    </ssopenurl:url>
                </ssopenurl:linkGroup>
            </ssopenurl:linkGroups>
        </ssopenurl:result>
    </ssopenurl:results>
    <ssopenurl:echoedQuery timeStamp="2017-02-15T16:14:12">
        <ssopenurl:library id="EY7MR5FU9X">
            <ssopenurl:name>California College of the Arts</ssopenurl:name>
        </ssopenurl:library>
        <ssopenurl:queryString>version=1.0&url_ver=Z39.88-2004&issn=0212-5633</ssopenurl:queryString>
    </ssopenurl:echoedQuery>
</ssopenurl:openURLResponse>

If you’ve read XML before, then it’s apparent how useful the above data is. It contains a list of our “holdings” for the periodical with information about the start and end (absent here, which implies the holdings run to the present date) dates of the subscription, which database they’re in, and what URL they can be accessed at. Perfect! The XML contains precisely the information we want to display in our catalog.

Unfortunately, our catalog’s JavaScript doesn’t have permission to access the 360 XML API. Due to a browser security policy resources must explicitly say that other domains or pages are allowed to request their data. A page needs to include the Access-Control-Allow-Origin HTTP header to abide by this policy, called Cross-Origin Resource Sharing (CORS), and the 360 API does not.

We can work around this limitation but it requires extra code on our part. While JavaScript from a web page cannot request data directly from 360, we can write a server-side script to pull data. That server-side script can then add its own CORS header which lets our catalog use it. So, in essence, we set up a proxy service that acts as a go-between for our catalog and the API that the catalog cannot use. Typically, this takes little code; the server-side script takes a parameter passed to it in the URL, sends it in a HTTP request to another server, and serves back up whatever response it receives.

Of course, it didn’t turn out to be that simple in practice. As I experimented with my scripts, I could tell that the 360 data was being received, but I couldn’t parse meaningful pieces of information out of it. It’s clearly there; I could see the full XML structure with holdings details. But neither my server-side PHP nor my client-side JavaScript could “find” XML elements like <ssopenurl:linkGroup> and <ssopenurl:normalizedData>. The text before the colon in the tag names is the namespace. Simple jQuery code like $('ssopenurl:linkGroup', xml), which can typically parse XML data, wasn’t working with these namespaced elements.

Finally, I discovered the solution by reading the PHP manual’s entry for the simplexml_load_string function: I can tell PHP how to parse namespaced XML by passing a namespace parameter to the parser function. So my function call turned into:

// parameters: 1) serials solution data since $url is the API we want to pull from
// 2) the type of object that the function should return (this is the default)
// 3) Libxml options (also the default, no special options)
// 4) (finally!) ns, the XML namespace
// 5) "True" here means ns is a prefix and not a URI
$xml= simplexml_load_string( file_get_contents($url), 'SimpleXMLElement', 0, 'ssopenurl', True );

As you can see, two of those parameters don’t even differ from the function’s defaults, but I still need to provide them to get to the “ssopenurl” namespace later. As an aside, technical digressions like these are some of the best and worst parts of my job. It’s rewarding to encounter a problem, perform research, test different approaches, and eventually solve it. But it’d also be nice, and a lot quicker, if code would just work as expected the first time around.

The Catalog

We’re lucky that Koha’s catalog both allows for JavaScript customization and has a well-structured, easy-to-modify record display. Now that I’m able to grab online holdings data from our knowledgebase, inserting into the catalog is trivial. If you wanted to do the same with a different library catalog, the only changes come in the JavaScript that finds ISSN information in a record and then inserts the retrieved holdings information into the display. The complete outline of the data flow from catalog to KB and back looks like:

  • my JavaScript looks for an ISSN on the record’s display page
  • if there’s an ISSN, it sends the ISSN to my proxy script
  • the proxy script adds a few parameters & asks for information from the 360 XML API
  • the 360 XML API returns XML, which my proxy script parses into JSON and sends to the catalog
  • the catalog JavaScript receives the JSON and parses holdings information into formatted HTML like “Online resources: 1992 to present in DOAJ
  • the JS inserts the formatted text into the record’s “online resources” section, creating that section if it doesn’t already exist

Is there a better way to do this? Almost certainly. The six steps above should give you a sense of how convoluted the process is, hacking around a few limitations. Still, the outcome is positive: we stopped updating our print holdings in our knowledgebase and our users have more information at their fingertips. It obviates the final step in the protracted “discovery layer to catalog” search described in the opening of this post.

Our next steps are obvious, too: we should aim to get this information into the discovery layer’s search results for our journals. The general frame of this project would be the same; we already know how to get the data from the API. Much like working with a different library catalog, the only edits are in parsing ISSNs from discovery layer search results and finding a spot in the HTML to insert the holdings data. Finally, we can also remove the redundant and less useful 856$u links from our periodical MARC records now.

The Scripts

These are highly specific to our catalog, but may be of general use to others who want to see how the pieces work together:

Notes

  1. For those unfamiliar, 856 is the MARC field for URLs, whether they URL represents the actual resource being described or something supplementary. It’s pretty common for print journals to also have 856 fields for their online counterparts.

#1Lib1Ref Edit (2017)

I participated in the “#1Lib1Ref” campaign again this year, recording my experience and talking through why I think it’s important.


Online Privacy in Post-Election America

A commitment to protecting the privacy of our patrons is enshrined in the ALA Code of Ethics. While that has always been an important aspect of librarianship, it’s become even more pivotal in an information age where privacy is far more nuanced and difficult to achieve. Given the rhetoric of the election season, and statements made by our President-Elect as well as his Cabinet nominees 1, the American surveillance state has become even more disconcerting. As librarians, we have an obligation to empower our communities with the knowledge they need to secure their own personal information. This post will cover, at a high level, a few areas where librarians of various types can assist patrons.

The Tools

Given that so much information is exchanged online these days, librarians are in a unique position to educate patrons about the Internet. We spend so much time either building web services or utilizing them, it’s highly likely that a librarian knows more about the web than your average citizen. As such, we can relate some of the powerful pieces of software and services that aid in protecting one’s online presence. To name just a handful that almost everyone could benefit from knowing:

DuckDuckGo is a privacy-aware search engine which explicitly does not track individual users. While it is a for-profit endeavor earning money through ad revenue, its policies set it apart from major competitors such as Google and Bing.

TorBrowser is a web browser utilizing The Onion Router protocol which obfuscates the user’s IP address, essentially masking their online activities behind a web of redirects. The Tor network is run by volunteers and TorBrowser is open source software developed by a non-profit organization.

HTTPS is the encrypted version of HTTP, the data transfer protocol that powers the internet. HTTPS sites are less likely to have their traffic intercepted or surveilled. Tools like HTTPS Everywhere help one to find HTTPS versions of sites without too much trouble.

Two-factor authentication is available for many apps and web services. It decreases the possibility that a third-party can access your account by providing an additional layer of protection beyond your password, e.g. through a code sent to your phone.

Signal is an open source private messaging app which uses end-to-end encryption, think of it as HTTPS for your text messages. Signal is made by Open Whisper Systems which, like the Tor Foundation, is a non-profit.

These are just a few major tools in different areas, all of which are worth knowing about. Many have usability trade-offs but switching to just one or two is enough to substantially improve an individual’s privacy.

Privacy Workshops

Merely knowing about particular pieces of software is not enough to secure one’s communications. Tor perhaps says it best in their “Tips on Staying Anonymous“:

Tor is NOT all you need to browse anonymously! You may need to change some of your browsing habits to ensure your identity stays safe.

A laundry list of web browsers, extensions, and apps doesn’t do much by itself. A person’s behavior is still the largest factor in how private their information is. One can visit a secure HTTPS site but still use a password that’s trivial to crack; one can use the “incognito” or “privacy” mode of a browser but still be tracked by their IP address. Online privacy is an immensely complicated and difficult subject which requires knowledge of practices as well tools. As such, libraries can offer workshops that teach both at once. Most libraries teach skills-based workshops, whether they’re on using a citation manager or how to evaluate information sources for credibility. Adding privacy skills is a natural extension of work we already do. Workshops can fit into particular classes—whether they’re history, computer science, or ethics—or be extra-curricular. Look for sympathetic partners on campus, such as student groups or concerned faculty, to see if you can collaborate or at least find an avenue for advertising your events.

Does your library not have anyone qualified or willing to teach a privacy workshop? Consider contacting an outside expert. The Library Freedom Project immediately comes to mind as a wonderful resource offering: a privacy toolkit for librarians, an online class, “train the trainers” type events, and community-focused workshops.2 Academic librarians may also have access to local computer security experts, whether they’re computer science instructors or particularly savvy students, who would be willing to lend their expertise. My one caution would be that just because someone is a subject expert doesn’t mean they’re equipped to effectively lead a workshop, and that working with an expert to ensure an event is tailored to your community will be more successful than simply outsourcing the entire task.

Patron Data

Depending on your position at your library, this final section might either be the most or least obvious thing to be done: control access to data about your patrons. If you’re an instruction or reference librarian, I imagine workshops were the first thing on your mind. If you’re a systems librarian such as myself, you may have thought of technologies like HTTPS or considered data security measures. This section will be longer not because it’s more important, but because these are topics I think about often as they directly relate to my job responsibilities.

Patron data is tricky. I’ll be the first to admit that my library collects quite a bit of data about patrons, a rather small amount of which contains personally identifying information. Data is extremely useful both in fine-tuning our services to meet community needs as well as in demonstrating our value to stakeholders like the college administration. Still, there is good reason to review data practices and web services to see if anything can be improved. Here’s a brief list of heuristics to use:

Are your websites using HTTPS? Secure sites, especially for one’s with patron accounts that hold sensitive information, help prevent data from being intercepted by third parties. I fully realize this is actually more difficult than it appears; our previous ILS offered HTTPS but only as a paid add-on which we couldn’t afford. If a vendor is the holdup here, pester them relentlessly until progress is made. I’ve found that most vendors understand that HTTPS is important, it’s just further down in their development priorities. Making a fuss can change that.

Is personal information being unnecessarily collected? What’s “necessary” is subjective, certainly. A good measure is looking at when the last time personal information was actually used in any substantive manner. If you’re tracking the names of students who ask reference questions, have you ever actually needed them for follow-ups? Could an anonymized ID be used instead? Could names be deleted after a certain amount of time has passed? Which brings us to…

Where personal information is collected, do retention policies exist? E.g. if you’re doing website user studies that record someone’s name, likeness, or voice, do you eventually delete the files? This goes for paper files as well, which can be reviewed and then shredded if deemed unnecessary. Retention policies are beneficial in a few ways. They not only prevent old data from leaking into the wrong hands, they often help with organization and “spring cleaning” tasks. I try to review my hard drive periodically for random files I’ve been sent by faculty or students which can be cleaned out.

Can patrons be empowered with options regarding their own data? Opt-in policies regarding data retention are desirable because they allow a library to collect information that might prove valuable while also giving people the ability to limit their vulnerabilities. Catalog reading lists are the quintessential example: some patrons find these helpful as a tool to review what they’ve read, while others would prefer to obscure their checkout history. It should go without saying that these options existing without any surrounding education is rather useless. Patrons need to know what’s at stake and how to use the systems at their disposal; the setting does nothing by itself. While optional workshops typically only touch a fragment of the overall student population, perhaps in-browser tips and suggestions can be presented to prompt our users to consider about the ramifications of their account’s configuration.

Relevance Ranking

Every so often, an event will happen which foregrounds the continued relevance of our profession. The most recent American election was an unmitigated disaster in terms of information literacy 3, but it also presents an opportunity for us to redouble our efforts where they are needed. Like the terrifying revelations of Edward Snowden, we are reminded that we serve communities that are constantly at risk of oppression, surveillance, and strife. As information professionals, we should strive to take on the challenge of protecting our patrons, and much of that protection occurs online. We can choose to be paralyzed by distress when faced with the state of affairs in our country, or to be challenged to rise to the occasion.

Notes

  1. To name a few examples, incoming CIA chief Mike Pompeo supports NSA bulk data collection and President-Elect Trump has been ambiguous as to whether he supports the idea of a registry or database for Muslim Americans.
  2. Library Freedom Director Alison Macrina has an excellent running Twitter thread on privacy topics which is worth consulting whether you’re an expert or novice.
  3. To note but two examples, the President-Elect persistently made false statements during his campaign and “fake news” appeared as a distinct phenomenon shortly after the election.

A High-Level Look at an ILS Migration

My library recently performed that most miraculous of feats—a full transition from one integrated library system to another, specifically Innovative’s Millennium to the open source Koha (supported by ByWater Solutions). We were prompted to migrate by Millennium’s approaching end-of-life and a desire to move to a more open system where we feel in greater control of our data. I’m sure many librarians have been through ILS migrations, and plenty has been written about them, but as this was my first I wanted to reflect upon the process. If you’re considering changing your ILS, or if you work in another area of librarianship & wonder how a migration looks from the systems end, I hope this post holds some value for you.

Challenges

No migration is without its problems. For starters, certain pieces of data in our old ILS weren’t accessible in any meaningful format. While Millennium has a robust “Create Lists” feature for querying & exporting different types of records (patron, bibliographic, vendor, etc.), it does not expose certain types of information. We couldn’t find a way to export detailed fines information, only a lump sum for each patron. To help with this post-migration, we saved an email listing of all itemized fines that we can refer to later. The email is saved as a shared Google Doc which allows circulation staff to comment on it as fines are resolved.

We also discovered that patron checkout history couldn’t be exported in bulk. While each patron can opt-in to a reading history & view it in the catalog, there’s no way for an administrator to download everyone’s history at once. As a solution, we kept our self-hosted Millennium instance running & can login to patrons’ accounts to retrieve their reading history upon request. Luckily, this feature wasn’t heavily used, so access to it hasn’t come up many times. We plan to keep our old, self-hosted ILS running for a year and then re-evaluate whether it’s prudent to shut it down, losing the data.

While some types of data simply couldn’t be exported, many more couldn’t emigrate in their exact same form. An ILS is a complicated piece of software, with many interdependent parts, and no two are going to represent concepts in the exact same way. To provide a concrete example: Millennium’s loan rules are based upon patron type & the item’s location, so a rule definition might resemble

  • a FACULTY patron can keep items from the MAIN SHELVES for four weeks & renew them once
  • a STUDENT patron can keep items from the MAIN SHELVES for two weeks & renew them two times

Koha, however, uses patron category & item type to determine loan rules, eschewing location as the pivotal attribute of an item. Neither implementation is wrong in any way; they both make sense, but are suited to slightly different situations. This difference necessitated completely reevaluating our item types, which didn’t previously affect loan rules. We had many, many item types because they were meant to represent the different media in our collection, not act as a hook for particular ILS functionality. Under the new system, our Associate Director of Libraries put copious work into reconfiguring & simplifying our types such that they would be compatible with our loan rules. This was a time-consuming process & it’s just one example of how a straightforward migration from one system to the next was impossible.

While some data couldn’t be exported, and others needed extensive rethinking in the new ILS, there was also information that could only be migrated after much massaging. Our patron records were a good example: under Millennium, users logged in on an insecure HTTP page with their barcode & last name. Yikes. I know, I felt terrible about it, but integration with our campus authentication & upgrading to HTTPS were both additional costs that we couldn’t afford. Now, under Koha, we can use the campus CAS (a central authentication system) & HTTPS (yay!), but wait…we don’t have the usernames for any of our patrons. So I spent a while writing Python scripts to parse our patron data, attempting to extract usernames from institutional email addresses. A system administrator also helped use unique identifying information (like phone number) to find potential patron matches in another campus database.

A more amusing example of weird Millennium data was active holds, which are stored in a single field on item records & looks like this:

P#=12312312,H#=1331,I#=999909,NNB=12/12/2016,DP=09/01/2016

Can you tell what’s going on here? With a little poking around in the system, it became apparent that letters like “NNB” stood for “date not needed by” & that other fields were identifiers connecting to patron & item records. So, once again, I wrote scripts to extract meaningful details from this silly format.

I won’t lie, the data munging was some of the most enjoyable work of the migration. Maybe I’m weird, but it was both challenging & interesting as we were suddenly forced to dive deeper into our old system and understand more of its hideous internal organs, just as we were leaving it behind. The problem-solving & sleuthing were fun & distracted me from some of the more frustrating challenges detailed above.

Finally, while we had a migration server where we tested our data & staff played around for almost a month’s time, when it came to the final leap things didn’t quite work as expected. The CAS integration, which I had so anticipated, didn’t work immediately. We started bumping into errors we hadn’t seen on the migration server. Much of this is inevitable; it’s simply unrealistic to create a perfect replica of our live catalog. We cannot, for instance, host the migration server on the exact same domain, and while that seems like a trivial difference it does affect a few things. Luckily, we had few summer classes so there was time to suffer a few setbacks & now that our fall semester is about to begin, we’re in great shape.

Difference & Repetition

Koha is primarily used by public libraries, and as such we’ve run into a few areas where common academic library functions aren’t implemented in a familiar way or are unavailable. Often, it’s that our perspective is so heavily rooted in Millennium that we need to think differently to achieve the same effect in Koha. But sometimes it’s clear that what’s a concern to us isn’t to other libraries.

For instance, bib records for serials with large numbers of issues is an ongoing struggle for us. We have many print periodicals where we have extensive holdings, including bound editions of past issues. The holdings display in the catalog is more oriented towards recent periodicals & displaying whether the latest few issues have arrived yet. That’s fine for materials like newspapers or popular magazines with few back issues, and I’ve seen a few public libraries using Koha that have minimalistic periodical records intended only to point the patron to a certain shelf. However, we have complex holdings like “issues 1 through 10 are bound together, issue 11 is missing, issues 12 through 18 are held in a separate location…” Parsing the catalog record to determine if we have a certain issue, and where it might be, is quite challenging.

Another example of the public versus academic functions: there’s no “recall” feature per se in Koha, wherein a faculty member could retrieve an item they want to place on course reserve from a student. Instead, we have tried to simulate this feature with a mixture of adjustments to our loan rules & internal reports which show the status of contested items. Recall isn’t a huge feature & isn’t used all the time, it’s not something we thought to research when selecting our new ILS, but it’s a great example of a minute difference that ended up creating a headache as we adapted to a new piece of software.

Moving from Millennium to Koha also meant we were shifting from a closed source system where we had to pay additional fees for limited API access to an open source system which boasts full read access to the database via its reporting feature. Koha’s open source nature has been perhaps the biggest boon for me during our migration. It’s very simple to look at the actual server-side code generating particular pages, or pull up specific rows in database tables, to see exactly what’s happening. In a black box ILS, everything we do is based on a vague adumbration of how we think the system operates. We can provide an input & record the output, but we’re never sure about edge cases or whether strange behavior is a bug or somehow intentional.

Koha has its share of bugs, I’ve discovered, but thankfully I’m able to jump right into the source code itself to determine what’s occurring. I’ve been able to diagnose problems by looking at open bug reports on Koha’s bugzilla tracker, pondering over perl code, and applying snippets of code from the Koha wiki or git repository. I’ve already submitted two bug patches, one of which has been pulled into the project. It’s empowering to be able to trace exactly what’s happening when troubleshooting & submit one’s own solution, or just a detailed bug report, for it. Whether or not a patch is the best way to fix an issue, being able to see precisely how the system works is deeply satisfying. It also makes it much easier to me to design JavaScript hacks that smooth over issues on the client side, be it in the staff-facing administrative functions or the public catalog.

What I Would Do Differently

Set clearer expectations.

We had Millennium for more than a decade. We invested substantial resources, both monetary & temporal, in customizing it to suit our tastes & unique collections. As we began testing the new ILS, the most common feedback from staff fell along the lines “this isn’t like it was in Millennium”. I think that would have been a less common observation, or perhaps phrased more productively, if I’d made it clear that a) it’ll take time to customize our new ILS to the degree of the old one, and b) not everything will be or needs to be the same.

Most of the customization decisions were made years ago & were never revisited. We need to return to the reason why things were set up a certain way, then determine if that reason is still legitimate, and finally find a way to achieve the best possible result in the new system. Instead, it’s felt like the process was framed more as “how do we simulate our old ILS in the new one” which sets us up for disappointment & failure from the start. I think there’s a feeling that a new system should automatically be better, and it’s true that we’re gaining several new & useful features, but we’re also losing substantial Millennium-specific customization. It’s important to realize that just because everything is not optimal out of the box doesn’t mean we cannot discover even better solutions if we approach our problems in a new light.

Encourage experimentation, deny expertise.

Because I’m the Systems Librarian, staff naturally turn to me with their systems questions. Here’s a secret: I know very little about the ILS. Like them, I’m still learning, and what’s more I’m often unfamiliar with the particular quarters of the system where they spend large amounts of time. I don’t know what it’s like to check in books & process holds all day, but our circulation staff do. It’s been tough at times when staff seek my guidance & I’m far from able to help them. Instead, we all need to approach the ongoing migration as an exploration. If we’re not sure how something works, the best way is to research & test, then test again. While Koha’s manual is long & quite detailed, it cannot (& arguably should not, lest it grow to unreasonable lengths) specify every edge case that can possibly occur. The only way to know is to test & document, which we should have emphasized & encouraged more towards the start of the process.

To be fair, many staff had reasonable expectations & performed a lot of experiments. Still, I did not do a great job of facilitating either of those as a leader. That’s truly my job as Systems Librarian during this process; I’m not here merely to mold our data so it fits perfectly in the new system, I’m here to oversee the entire transition as a process that involves data, workflows, staff, and technology.

Take more time.

Initially, the ILS migration was such an enormous amount of work that it was not clear where to start. It felt as if, for a few months before our on-site training, we did little but sit around & await a whirlwind of busyness. I wish we had a better sense of the work we could have front-loaded such that we could focus efforts on other tasks later on. For example, we ended up deleting thousands of patron, item, and bibliographic records in an effort to “clean house” & not spend effort migrating data that was unneeded in the first place. We should have attacked that much earlier, and it might have obviated the need for some work. For instance, if in the course of cleaning up Millennium we delete invalid MARC records or eliminate obscure item types, those represent fewer problems encountered later in the migration process.

Finished?

As we start our fall semester, I feel accomplished. We raced through this migration, beginning the initial stages only in April for a go-live date that would occur in June. I learned a lot & appreciated the challenge but also had one horrible epiphany: I’m still relatively young, and I hope to be in librarianship for a long time, so this is likely not the last ILS migration I’ll participate in. While that very thought gives me chills, I hope the lessons I’ve taken from this one will serve me well in the future.


A Reflection on Code4Lib 2016

See also: Margaret’s reflections on Code4Lib 2013 and recap of the 2012 keynote.

 


About a month ago was the 2016 Code4Lib conference in sunny Philadelphia. I’ve only been to a few Code4Lib conferences, starting with Raleigh in 2014, but it’s quickly become my favorite libraryland conference. This won’t be a comprehensive recap but a little taste of what makes the event so special.

Appetizers: Preconferences

One of the best things about Code4Lib is the affordable preconferences. It’s often a pittance to add on a preconference or two, extending your conference for a whole day. Not only that, there’s typically a wealth of options: the 2015 conference boasted fifteen preconferences to choose from, and Philadelphia somehow managed to top that with an astonishing twenty-four choices. Not only are they numerous, the preconferences vary widely in their topics and goals. There’s always intensely practical ones focused on bootstrapping people new to a particular framework, programming language, or piece of software (e.g. Railsbridge, workshops focused on Blacklight or Hydra). But there are also events for practicing your presentation or the aptly named “Getting Ready for Workshops” Workshop. One of my personal favorite ideas—though I must admit I’ve never attended—is the perennial “Fail4Lib” sessions where attendees examine their projects that haven’t succeeded and discuss what they’ve learned.

This year, I wanted to run a preconference of my own. I enjoy teaching, but I rarely get to do it in my current position. Previously, in a more generalist technologist position, I would teach information literacy alongside the other librarians. But as a Systems Librarian, it can sometimes feel like I rarely get out from behind my terminal. A preconference was an appealing chance to teach information professionals on a topic that I’ve accumulated some expertise in. So I worked with Coral Sheldon-Hess to put together a workshop focused on the fundamentals of the command line: what it is, how to use it, and some of the pivotal concepts. I won’t say too much more about the workshop because Coral wrote an excellent, detailed blog post right after we were done. The experience was great and feedback we received, including a couple kind emails from our participants, was very positive. Perhaps we, or someone else, can repeat the workshop in the future, as we put all our materials online.

Main Course: Presentations

Thankfully I don’t have to detail the conference talks too much, because they’re all available on YouTube. If a talk looks intriguing, I strongly encourage you to check out the recording. I’m not too ashamed to admit that a few went way over my head, so seeing the original will certainly be more informative than any summary I could offer.

One thing that was striking was how the two keynotes centered on themes of privacy and surveillance. Kate Krauss, Director of Communications of the Tor Project, lead the conference off. Naturally, Tor being privacy software, Krauss focused on stories of government surveillance. She noted how surveillance focuses on the most marginalized people, citing #BlackLivesMatter and the transgender community as examples. Krauss’ talk provided concrete steps that librarians could take, for instance examining our own data collection practices, ensuring our services are secure, hosting privacy workshops, and running a Tor relay. She even mentioned The Library Freedom Project as a positive example of librarians fighting online surveillance, which she posited as one of the premier civil rights issues of our time.

On the final day, Gabriel Weinberg of the search engine DuckDuckGo spoke on similar themes, except he concentrated on how his company’s lack of personalization and tracking differentiated it from companies like Google and Apple. To me, Weinberg’s talk bookended well with Krauss’ because he highlighted the dangers of corporate surveillance. While the government certainly has abused its access to certain fundamental pieces of our country’s infrastructure—obtaining records from major telecom companies without a warrant comes to mind—tech companies are also culpable in enabling the unparalleled degree of surveillance possible in the modern era, simply by collecting such massive quantities of data linked to individuals (and, all too often, by failing to secure their applications properly).

While the pair of keynotes were excellent and thematic, my favorite moments of the conference were the talks by librarians. Becky Yoose gave perhaps the most rousing, emotional talk I’ve ever heard at a conference on the subject of burnout. Burnout is all too real in our profession, but not often spoken of, particularly in such a public venue. Becky forced us all to confront the healthiness and sustainability of our work/life balance, stressing the importance not only of strong organizational policies to prevent burnout but also personal practices. Finally, Andreas Orphanides gave a thoughtful presentation on the political implications of design choices. Dre’s well-chosen, alternatingly brutal and funny examples—from sidewalk spikes that prevent homeless people from lying in doorways, to an airline website labelling as “lowest” a price clearly higher than others on the very same page—outlined how our design choices reflect our values, and how we can better align our values with those of our users.

I don’t mean to discredit anyone else’s talks—there were many more excellent ones, on a variety of topics. Dinah Handel captured my feelings best in this enthusiastic tweet:

Dessert: Community

My main enjoyment from Code4Lib is the sense of community. You’ll hear a lot of people at conferences state things like “I feel like these are my people.” And we are lucky as a profession to have plenty of strong conference options, depending on our locality, specialization, and interests. At Code4Lib, I feel like I can strike up a conversation with anyone I meet about an impending ILS migration, my favorite command-line tool, or the vagaries of mapping between metadata schemas. While I love my present position, I’m mostly a solo systems person surrounded by a few other librarians all with a different expertise. As much as I want to discuss how ludicrous the webpub.def syntax is, or why reading XSLT makes me faintly ill, I know it’d bore my colleagues to death. At Code4Lib, people can at least tolerate such subjects of conversation, if not revel in them.

Code4Lib is great not solely because of it’s focus on technology and code, which a few other library organizations share, but because of the efforts of community members to make it a pleasurable experience for all. To name just a couple of the new things Code4Lib introduced this year: while previous years have had Duty Officers whom attendees could safely report harassment to, they were announced & much more visible this year; sponsored child care was available for conference goers with small children; and a service provided live transcription of all the talks.1 This is in addition to a number of community-building measures that previous Code4Lib conferences featured, such as a series of newcomers dinners on the first night, a “share and play” game night, and diversity scholarships. Overall, it’s evident that the Code4Lib community is committed to being positive and welcoming. Not that other library organizations aren’t, but it should be evident that our profession isn’t immune from problems. Being proactive and putting in place measures to prevent issues like harassment is a shining example of what makes Code4Lib great.

All this said, the community does have its issues. While a 40% female attendance rate is fair for a technology conference, it’s clear that the intersection of coding and librarianship is more male-dominated than the rest of the profession at large. Notably, Code4Lib has done an incredible job of democratically selecting keynote speakers over the past few years—five female and one male for the past three conferences—but the conference has also been largely white, so much so that the 2016 conference’s Program Committee gave a lightning talk addressing the lack of speaker diversity. Hopefully, measures like the diversity scholarships and conscious efforts on the part of the community can make progress here. But the unbearable whiteness of librarianship remains a very large issue.

Finally, it’s worth noting that Code4Lib is entirely volunteer-run. Since it’s not an official professional organization with membership dues and full-time staff members, everything is done by people willing to spare their own time to make the occasion a great one. A huge thanks to the local planning committee and all the volunteers who made such a great event possible. It’s pretty stunning to me that Code4Lib manages to put together some of the nicest benefits of any conference—the live streaming and transcribed talks come to mind—without a huge backing organization, and while charging pretty reasonable registration prices.

Night Cap

I’d recommend Code4Lib to anyone in the library community who deals with technology, whether you’re a manager, cataloger, systems person, or developer. There’s a wide breadth of material suitable for anyone and a great, supportive community. If that’s not enough, the proportion of presentations featuring pictures of cats and/or animated gifs is higher than your average conference.

Notes

  1. Aside: Matt Miller made a fun “Overheard at Code4Lib 2016” app using the transcripts

#1Lib1Ref

A few of us at Tech Connect participated in the #1Lib1Ref campaign that’s running from January 15th to the 23rd . What’s #1Lib1Ref? It’s a campaign to encourage librarians to get involved with improving Wikipedia, specifically by citation chasing (one of my favorite pastimes!). From the project’s description:

Imagine a World where Every Librarian Added One More Reference to Wikipedia.
Wikipedia is a first stop for researchers: let’s make it better! Your goal today is to add one reference to Wikipedia! Any citation to a reliable source is a benefit to Wikipedia readers worldwide. When you add the reference to the article, make sure to include the hashtag #1Lib1Ref in the edit summary so that we can track participation.

Below, we each describe our experiences editing Wikipedia. Did you participate in #1Lib1Ref, too? Let us know in the comments or join the conversation on Twitter!


 

I recorded a short screencast of me adding a citation to the Darbhanga article.

— Eric Phetteplace


 

I used the Citation Hunt tool to find an article that needed a citation. I selected the second one I found, which was about urinary tract infections in space missions. That is very much up my alley. I discovered after a quick Google search that the paragraph in question was plagiarized from a book on Google Books! After a hunt through the Wikipedia policy on quotations, I decided to rewrite the paragraph to paraphrase the quote, and then added my citation. As is usual with plagiarism, the flow was wrong, since there was a reference to a theme in the previous paragraph of the book that wasn’t present in the Wikipedia article, so I chose to remove that entirely. The Wikipedia Citation Tool for Google Books was very helpful in automatically generating an acceptable citation for the appropriate page. Here’s my shiny new paragraph, complete with citation: https://en.wikipedia.org/wiki/Astronautical_hygiene#Microbial_hazards_in_space.

— Margaret Heller


 

I edited the “Library Facilities” section of the “University of Maryland Baltimore” article in Wikipedia.  There was an outdated link in the existing citation, and I also wanted to add two additional sentences and citations. You can see how I went about doing this in my screen recording below. I used the “edit source” option to get the source first in the Text Editor and then made all the changes I wanted in advance. After that, I copy/pasted the changes I wanted from my text file to the Wikipedia page I was editing. Then, I previewed and saved the page. You can see that I also had a typo in my text  and had to fix that again to make the citation display correctly. So I had to edit the article more than once. After my recording, I noticed another typo in there, which I fixed it using the “edit” option. The “edit” option is much easier to use than the “edit source” option for those who are not familiar with editing Wiki pages. It offers a menu bar on the top with several convenient options.

wiki_edit_menu

The menu bar for the “edit” option in Wikipeda

The recording of editing a Wikipedia article:

— Bohyun Kim


 

It has been so long since I’ve edited anything on Wikipedia that I had to make a new account and read the “how to add a reference” link; which is to say, if I could do it in 30 minutes while on vacation, anyone can. There is a WYSIWYG option for the editing interface, but I learned to do all this in plain text and it’s still the easiest way for me to edit. See the screenshot below for a view of the HTML editor.

I wondered what entry I would find to add a citation to…there have been so many that I’d come across but now I was drawing a total blank. Happily, the 1Lib1Ref campaign gave some suggestions, including “Provinces of Afghanistan.” Since this is my fatherland, I thought it would be a good service to dive into. Many of Afghanistan’s citations are hard to provide for a multitude of reasons. A lot of our history has been an oral tradition. Also, not insignificantly, Afghanistan has been in conflict for a very long time, with much of its history captured from the lens of Great Game participants like England or Russia. Primary sources from the 20th century are difficult to come by because of the state of war from 1979 onwards and there are not many digitization efforts underway to capture what there is available (shout out to NYU and the Afghanistan Digital Library project).

Once I found a source that I thought would be an appropriate reference for a statement on the topography of Uruzgan Province, I did need to edit the sentence to remove the numeric values that had been written since I could not find a source that quantified the area. It’s not a precise entry, to be honest, but it does give the opportunity to link to a good map with other opportunities to find additional information related to Afghanistan’s agriculture. I also wanted to chose something relatively uncontroversial, like geographical features rather than historical or person-based, for this particular campaign.

— Yasmeen Shorish

WikiEditScreenshot

Edited area delineated by red box.


Creating a Flipbook Reader for the Web

At my library, we have a wonderful collection of artist’s books. These titles are not mere text, or even concrete poetry shaping letters across a page, but works of art that use the book as a form in the same way a sculptor might choose clay or marble. They’re all highly inventive and typically employ an interplay between language, symbol, and image that challenges our understanding of what a print volume is. As an art library, collecting these unique and inspiring works seems natural. But how can we share our artists’ books with the world? For a while, our work study students have been scanning sets of pages from the books using a typical flatbed scanner but we weren’t sure how to present these images. Below, I’ll detail how we came to fork the Internet Archive’s Bookreader to publish our images to the web.

Choosing a Book Display

Our Digital Scholarship Librarian, Lisa Conrad, led the artists’ book project. She investigated a number of potential options for creating interactive “flipbooks” out of our series of images. We had a few requirements we were looking for:

  • mobile friendly — ideally, our books would be able to be read on any device, not just desktops with Adobe Flash
  • easy to use — our work study staff shouldn’t need sophisticated skills, like editing code or markup, to publish a work
  • visually appealing — obviously we’re dealing with works of art, if our presentation is hideous or obscures the elegance of the original then it’s doing more harm than good
  • works with our repository — ultimately, the books would “live” in our institutional repository, so we needed software that would emit something like static HTML or documents that could be easily retained in it

These are not especially stringent requirements, in my mind. Still, we struggled to find decent options. The mobile limitation restricted things quite a bit; a surprising number of apps used flash or published in desktop-first formats. We felt our options to customize and simplify the user interface of other apps was too small. Even if an app spat out a nice bundle of HTML and assets that we could upload, the HTML might call out to sketchy external services or present a number of options we didn’t want or need. While I was at first hesitant to implement an open source piece of software that I knew would require much of my time, ultimately the Internet Archive’s Bookreader became the frontrunner. We felt more comfortable using an established, free piece of software that doesn’t require any server-side component (it’s all JavaScript!). My primary concern was that the last commit to the bookreader code base was over a half year ago.

Implementing the Internet Archive Bookreader

The Bookreader proved startlingly easy to implement. We simply copied the code from an example given by the Internet Archive, edited several lines of JavaScript, and were ready to go. Everything else was refinement. Still, we made a few nice decisions in working on the Bookreader that I’d like to share. If you’re not a coder, feel free to skip the rest of this section as it may be unnecessarily detailed.

First off, our code is all up on GitHub. But it won’t be useful to anyone else! It contains specific logic based on how our IR serves up links to the bookreader app itself. But one of the smarter moves, if I may indulge in some self-congratulation, was using submodules in git. If you know git then you know it’s easily one of the best ways to version code and manage projects. But what if you app is just an implementation of an existing code base? There’s a huge portion of code you’re borrowing from somewhere else, and you want to be able to segment that off and incorporate changes to it separately.

As I noted before, the bookreader isn’t exactly actively developed. But it still benefits us to separate our local modifications from the stable, external code. A submodule lets us incorporate another repository into our own. It’s the clean way of copying someone else’s files into our own version. We can reference files in the other repository as if they’re sitting right beside our own, but also pull in updates in a controlled manner without causing tons of extra commits. Elegant, handy, sublime.

Most of our work happened in a single app.js file. This file was copied from the Internet Archive’s provided example and only modified slightly. To give you a sense of how one modifies the original script, here’s a portion:

// Create the BookReader object
br = new BookReader();
// Total number of leafs
br.numLeafs = vaultItem.pages;
// Book title & URL used for the book title link
br.bookTitle= vaultItem.title;
br.bookUrl  = vaultItem.root + 'items/' + vaultItem.id + '/' + vaultItem.version + '/';
// how does the bookreader know what images to retrieve given a page number (index)?
br.getPageURI = function(index, reduce, rotate) {
    // reduce and rotate are ignored in this simple implementation, but we
    // could e.g. look at reduce and load images from a different directory
    var url = vaultItem.root + 'file/' + vaultItem.id + '/' + vaultItem.version + '/' + vaultItem.filenames + (index + 1) + '.JPG';
    return url;
}

The Internet Archive’s code provides us with a Bookreader class. All we do is instantiate it once and override certain properties and methods. The code above is all that’s necessary to display a particular image for a given page number; the vaultItem object (VAULT is the name of our IR) consists of information about a single artists’ book, like the number of pages, its title, its ID and version within the IR. The bookreader app cobbles together these pieces of info to figure out it should display images given a page or pair of pages. The getPageURI function is mostly working with a single index argument, while the group of concatenated strings for the URL are related to how our repository stores files. It’s highly specific to the IR we’re using, but not terribly complicated.

The Bookreader itself sits on a web server outside the IR. Since we cannot publicly share almost all of these books (more on this below), we restrict access to the images to users who are authenticated with the IR. So how can the external reader app serve up images of books within the repository? We expect people to discover the books via our library catalog, which links to the IR, or within the IR itself. Once they sign in, our IR’s display templates contain specially crafted URLs that pass the bookreader information about the item in our repository via their query string. Here’s a shortened example of one query string:

?title=Crystals%20to%20Aden&id=17c06cc5-c419-4e77-9bdb-43c69e94b4cd&pages=26

From there, the app parses the query string to figure out that the book’s title is Crystals to Aden, its ID within the IR is “17c06cc5-c419-4e77-9bdb-43c69e94b4cd”, and it has twenty six pages. We store those values in the vaultItem object referenced in the script above. That object contains enough information for the bookreader to determine how to retrieve images from the IR for each page of the book. Since the user has already authenticated with the IR when they discovered the URL earlier, the IR happily serves up the images.

Ch-ch-ch-challenges

Easily the most difficult part of our artists’ books project has been spreading awareness. It’s a cool project that’s required tons of work all around, from the Digital Scholarship Librarian managing the project, to our diligent work study student scanning images, to our circulation staff cataloging them in our IR, to me solving problems in JavaScript while my web design student worker polished the CSS. We’re proud of the result, but also struggling to share it outside of our walls. Our IR is great at some things but not particularly intuitive to use, so we cannot count on patrons stumbling across the artists’ books on their own very often. Furthermore, putting anything behind a login is obviously going to increase the number of people who give up before accessing it. Not being on the campus Central Authentication Service only exacerbates this for us.

The second challenge is—what else!—copyright. We don’t own the rights to any of these titles. We’ve been guerrilla digitizing them without permission from publishers. For internal sharing, that’s fine and well under Fair Use; we’re only showing excerpts and security settings in our IR ensure they’re only visible to our constituents. But I pine to share these gorgeous creations with people outside the college, with my social networks, with other librarians. Right now, we only have permission to share Crystals to Aden by Michael Bulteau (thanks to duration press for letting us!). Which is great! Except Crystals is a relatively text-heavy work; it’s fine poetry, yes, but not the earth-shattering reinterpretation of the codex I promised you in my opening paragraph.

Hopefully, we can secure permissions to share further works. Internally, we’ve pushed out notices on social media, the LMS, and email lists to inform people of the artists’ books. They’re also available for checkout, so hopefully our digital teasers bring people in to see the real deal.

Lasting Problems

And that’s something that must be mentioned; our digital simulations are definitely not the real deal. Never has a set of titles been so resistant to digitization. Our work study has even asked us “how could I possibly separate this work into distinct pairs of pages” about a work which used partially-overlapping leaves that look like vinyl record sleeves.1 Artists deliberately chose the printed book form and there’s a deep sacrilege in trying to digitally represent that. We can only ever offer a vague adumbration of what was truly intended. I still see value in our bookreader—and the works often look brilliant even as scanned images—but there are fundamental impediments to its execution.

Secondly, scanned images are not text. Many works do have text on the page, but because we’re displaying images in the browser users cannot select the text to copy it. Alongside each set of images in our IR, we also have a PDF copy of all the pages with OCR‘d text. But a decision was made not to make the PDFs visible to non-library staff, since they would be easy to download and disseminate (unlike the many discrete images in our bookreader). This all adds up to make the bookreader very inaccessible; its all images, there’s no feasible way for us to associate alt text with each one, and the PDF copy that might be of interest to visually-impaired users is hidden.

I’ve ended on a sour note, but digitizing our artists’ books and fiddling with the Internet Archive’s Bookreader was a great project. Fun, a bit challenging, with some splendid results. We have our work cut out for us if we want to draw more attention to the books and have them be compelling for everyone. But other libraries in similar situations may find the Bookreader to be a very viable, easy-to-implement solution. If you don’t have permissions issues and are dealing with more traditional works, it’s built to be customizable yet offers a pleasant reading experience.

Notes

  1. Paradoxic Mutations by Margot Lovejoy

Wikipedia, Libraries, & Neutrality

This piece is substantially based off of a column I wrote for RUSQ that will appear early next year.


Broadly construed, there are two camps of opinions surrounding Wikipedia in librarianship. The first is that Wikipedia is not an academic quality source. It is something students need to be warned away from. The community authorship of wikipedia is typically the source of this criticism; since “anyone can edit” the online encyclopedia, there’s no way to tell whether you’re receiving high quality information written by experts or the ill-informed opinions of internet trolls.

The other camp is far more positive regarding Wikipedia. After all, the values of the Wikipedian community clearly align with our own. The ultimate goal of the project is not to give a venue for poor research or fringe theories, but to enable free and open access to information. While Wikipedia isn’t flawless, many of its articles compete with more established reference sources. Famously, Nature performed a comparison of scientific articles and found Wikipedia to be comparable to Encyclopedia Britannica. But following the study, a community project to correct the identified errors in Wikipedia sprung up, fixing them all in a little over a month.1 With this common ground, it’s no wonder that libraries have found Wikipedia to be a valuable partner in publicizing our content. Articles like Using Wikipedia to Extend Digital Collections, Putting the Library in Wikipedia, and Wikipedia Lover, Not a Hater: Harnessing Wikipedia to Increase the Discoverability of Library Resources all discuss the value of working with Wikipedia to highlight library digital collections and metadata, not against.

A good example of Wikipedia driving traffic to library special collections was brought to my attention by fellow Tech Connect author Margaret Heller, who pointed me towards the Google Analytics Usage Reports for CARLI Digital Collections. CARLI regularly sees Wikipedia as one of the top external traffic sources, with Wikipedia being noted in last few quarterly reports as a traffic source leading “to home pages or images from multiple CARLI Collections”.

The Problem with Neutral

I must admit that I side with the pro-Wikipedia folks. I love using Wikipedia as a resource—as a procrastination-enabler I have open several articles on roguelikes that I’m reading while I right this—and I love contributing to it in whatever small ways I can, from fixing broken markup to citation hunting. But Wikipedia is imperfect in ways more insidious that its anonymous authorship or occasionally inaccurate details. Rather, one problem lies within one of the five pillars that define the philosophy of Wikipedia; “Wikipedia is written from a neutral point of view”.

Neutrality has been under fire from #critlib lately, a group of librarians emphasizing critical theory especially with respect to information literacy instruction. ALA Annual featured a well-attended presentation entitled “But We’re Neutral!” And Other Librarian Fictions Confronted by #critlib. A seminal article appearing earlier this year in Code4Lib by Bess Sadler and Chris Bourg begins with a rousing section labelled Libraries are not Neutral:

In spite of the pride many libraries take in their neutrality, libraries have never been neutral repositories of knowledge. Research libraries in particular have always reflected the inequalities, biases, ethnocentrism, and power imbalances that exist throughout the academic enterprise through collection policies and hiring practices that reflect the biases of those in power at a given institution. In addition, theoretically neutral library activities like cataloging have often re-created societal patterns of exclusion and inequality.

Wikipedia shares this problem; while it appears neutral on the surface, its topical coverage and treatment of subjects reflect the skewed power relations of our society. A 2011 study found that 91% of editors were men. The same study shows that few editors come from the Global South and that the English Wikipedia receives far more focus than other languages. Another research paper from 2011 goes a bit further in demonstrating that “male articles are significantly longer than female articles”.2 The editorial gender gap has real effects on the encyclopedia’s content; it’s not just that having editors of all genders is good in its own right, it’s that Wikipedia’s claims to objectivity and neutrality are jeopardized by the disbalance.

Art+Feminism

So what’s a concerned librarian to do? I think the wrong thing would be to denounce Wikipedia. For one, the alternative sources we or our patrons would turn to are no less problematic. Encyclopaedia Britannica has a well-documented history of supposedly scientific articles written from the dominant viewpoint (c.f. the racist description of black people in the 1911 edition). What’s more, Wikipedia is going to be used. It’s massively popular. Sticking our heads in the sand because it doesn’t live up to its own standards of neutrality improves nothing.

Luckily, there are several Wikipedia projects focused on recruiting editors from underrepresented groups and addressing lackluster coverage of particular topics which we can support. One such project is Art + Feminism. In its own words:

Art+Feminism is a rhizomatic campaign to improve coverage of women and the arts on Wikipedia, and to encourage female editorship…Content is skewed by the lack of female participation. Many articles on notable women in history and art are absent on Wikipedia. This represents an alarming aporia in an increasingly important repository of shared knowledge.

Art+Feminism started out by hosting an edit-a-thon out of the Eyebeam Art and Technology Center in New York City in 2014, with more than 30 other locations worldwide joining in. An edit-a-thon is an event where people gather to perform Wikipedia edits, often centered around a particular theme or project. Higher education institutions and libraries make perfect partners for such occasions. We typically have useful materials to be cited and our students form a large body of potential participates who can be easily incentivized to join in, whether with extra credit or simply some food. So when my library heard of the upcoming 2nd annual edit-a-thon, we immediately began planning to host it. Below, I’ll briefly outline what we did in the hopes it’ll encourage other institutions to join us during this year’s edit-a-thon on March 5th.

Hosting an Edit-a-thon

First, we set up a meetup page on Wikipedia. If you’re unfamiliar with Wikipedia, creating a page like this isn’t a struggle. For one, you can simply copy the entire source markup of someone else’s meetup, then edit your specific details into that skeleton. For two, you can enable the experimental Visual Editor to make Wikipedia even easier to edit without learning wiki markup. The meetup page is an important place for putting up information like timing and directions, but is also a place for us to talk about the impact we made by showing how many editors attended and what articles we improved or created.

While we were putting initial details on our meetup page, we set about securing a location on the date of the edit-a-thon. We discovered that a gallery associated with our school had hosted the edit-a-thon the prior year, but they were unable to repeat it. Our school has campuses in both San Francisco and Oakland, but since Oakland had no other edit-a-thon locations we decided to host it there with the idea that SF locals already had an event nearby.

Our next steps aimed to make the event as easy for newcomers as possible. A staff member pulled relevant materials from our collection, so that researching would be simple and our rarer, more valuable resources might lend some of their information to Wikipedia. We also wanted experienced editors to be on hand during the event. I looked at a local Wikiproject, asking for help on the Talk page and then corresponding directly with a couple editors who had expressed interest. Finally, we managed to secure a visit from someone who actually works for the Wikimedia Foundation that runs the encyclopedia.

During the day, we set out snacks and name badges for everyone. Similar to the color-coded name badges at Ada Camp and other tech conferences, Art+Feminism recommended giving out name badges which signify one’s willingness to be photographed: green meant feel free to take a photo, orange meant please ask permission first, and red meant absolutely not. These steps ensure that everyone is comfortable at the event and not being exposed on the internet against their will. At the start of the day, we explained the coloring system and the Wikimedia staff person gave a short talk on how to write articles that endure, standing up to scrutiny over time. The results of the global event are listed on Wikipedia.

While I feel we accomplished much at our local event, there was one negative experience. An image uploaded to Wikimedia Commons for use on a page was flagged for deletion. The editor flagging it said something along the lines of “this isn’t your personal photo album” as the image was a headshot of a female artist. In the ensuing discussion around the proposed deletion, I noted that the image was about to be used on an article. It was never removed from Commons. Still, the incident underscores cultural problems in Wikipedia. The confrontational style of the discussion lacked good faith. Further, I heard a gendered undertone in the editor’s response; how many pictures of white men are derided as personal photos? While our library staff person was undeterred, moments of hostility like these drive away newcomers.

In Which I Admit I’m Missing the Point

To be fair, Wikipedia itself acknowledges that it fails to live up to neutral status. That the encyclopedia strives towards neutrality is the more vital point. But it’s not as if reaching a supposedly perfect neutrality resolves the issues that folks like #critlib are highlighting. Neutral as a positive value is precisely the problem, because there is no neutral stance that can be taken from outside of society’s power relations and history of inequity. So should we instead be agitating for Wikipedia to become less neutral and take more active stances on social issues? Or are edit-a-thons like Art+Feminism the most viable route towards ensuring topics are covered in a way that surfaces marginalized peoples and their experiences? I have no answers, just food for thought.

  1. See External peer review/Nature December 2005 for Wikipedia’s internal take on the correction process. The original Nature article is paywalled here and is doi:10.1038/438900a. A similar but open access study is doi:10.1371/journal.pone.0106930, Accuracy and Completeness of Drug Information in Wikipedia: A Comparison with Standard Textbooks of Pharmacology.
  2. WP:Clubhouse? An Exploration of Wikipedia’s Gender Imbalance. I’m skeptical of how “male” and “female” articles were defined, but the paper itself is thorough in its argumentation and statistical analysis. It’s also worth noting that this paper, and much of the other literature around the gender gap, ignores genders outside the female-male binary. It’s most useful to conceive of the gap as a disproportionate majority of males than a minority of females, while leaving all other genders out of the picture.