Representing Online Journal Holdings in the Library Catalog

The Problem

It isn’t easy to communicate to patrons what serials they have access to and in what form (print, online). They can find these details, sure, but it’s scattered across our library’s web presence. What’s most frustrating is that we clearly have all the necessary information but the systems offer no built-in way to produce a clear display of it. My fellow librarians noted that “it’d be nice if the catalog showed our exact online holdings” and my initial response was to sigh and say “yes, that would be nice”.

To illustrate the scope of the problem, a user can search for journals in a few of our disparate systems:

  • we use a knowledgebase to track database subscriptions and which journals are included in each subscription package
  • the public catalog for our Koha ILS has records for our print journals, sometimes with a MARC 856$u 1 link to our online holdings in the knowledgebase
  • our discovery layer has both article-level results for the journals in our knowledgebase and journal-level search results for the ones in our catalog

While these systems overlap, they also serve distinct purposes, so it’s not so awful. However, there are a few downsides to our triad of serials information systems. First of all, if a patron searches the knowledgebase looking for a journal which we only have in print, our database holdings wouldn’t show that they have access to print issues. To work around this, we track our print issues both in our ILS and the knowledgebase, which duplicates work and introduces possible inconsistencies.

Secondly, someone might start their research in the discovery layer, finding a journal-level record that links out to our catalog. But it’s too much to ask a user to search the discovery layer, click into the catalog, click a link out to the knowledgebase, and only then discover our online holdings don’t include the particular volume they’re looking for. Possessing three interconnected systems creates labyrinthine search patterns and confusion amongst patrons. Simply describing the systems and their nuanced areas of overlap in this post feels like challenge, and the audience is librarians. I can imagine how our users must feel when we try to outline the differences.

The 360 XML API

Our knowledgebase is Serials Solutions 360KB. I went looking in the vendor’s help documentation for answers, which refers to an API for the product but apparently provides no information on using said API. Luckily, a quick search through GitHub projects yielded several using the API and I was able to determine its URL structure: http://{{your Serials Solution ID}}.openurl.xml.serialssolutions.com/openurlxml?version=1.0&url_ver=Z39.88-2004&issn={{the journal’s ISSN}}

It’s probably possible to search by other parameters as well, but for my purposes ISSN was ideal so I didn’t bother investigating further. If you send a request to the address above, you receive XML in response:

<ssopenurl:openURLResponse xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ssdiag="http://xml.serialssolutions.com/ns/diagnostics/v1.0" xmlns:ssopenurl="http://xml.serialssolutions.com/ns/openurl/v1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://xml.serialssolutions.com/ns/openurl/v1.0 http://xml.serialssolutions.com/ns/openurl/v1.0/ssopenurl.xsd http://xml.serialssolutions.com/ns/diagnostics/v1.0 http://xml.serialssolutions.com/ns/diagnostics/v1.0/diagnostics.xsd">
    <ssopenurl:version>1.0</ssopenurl:version>
    <ssopenurl:results dbDate="2017-02-15">
        <ssopenurl:result format="journal">
            <ssopenurl:citation>
                <dc:source>Croquis</dc:source>
                <ssopenurl:issn type="print">0212-5633</ssopenurl:issn>
            </ssopenurl:citation>
            <ssopenurl:linkGroups>
                <ssopenurl:linkGroup type="holding">
                    <ssopenurl:holdingData>
                        <ssopenurl:startDate>1989</ssopenurl:startDate>
                        <ssopenurl:providerId>PRVLSH</ssopenurl:providerId>
                        <ssopenurl:providerName>Library Specific Holdings</ssopenurl:providerName>
                        <ssopenurl:databaseId>ZYW</ssopenurl:databaseId>
                        <ssopenurl:databaseName>CCA Print Holdings</ssopenurl:databaseName>
                        <ssopenurl:normalizedData>
                            <ssopenurl:startDate>1989-01-01</ssopenurl:startDate>
                        </ssopenurl:normalizedData>
                    </ssopenurl:holdingData>
                    <ssopenurl:url type="source">https://library.cca.edu/</ssopenurl:url>
                    <ssopenurl:url type="journal">
                    https://library.cca.edu/cgi-bin/koha/opac-search.pl?idx=ns&q=0212-5633
                    </ssopenurl:url>
                </ssopenurl:linkGroup>
            </ssopenurl:linkGroups>
        </ssopenurl:result>
    </ssopenurl:results>
    <ssopenurl:echoedQuery timeStamp="2017-02-15T16:14:12">
        <ssopenurl:library id="EY7MR5FU9X">
            <ssopenurl:name>California College of the Arts</ssopenurl:name>
        </ssopenurl:library>
        <ssopenurl:queryString>version=1.0&url_ver=Z39.88-2004&issn=0212-5633</ssopenurl:queryString>
    </ssopenurl:echoedQuery>
</ssopenurl:openURLResponse>

If you’ve read XML before, then it’s apparent how useful the above data is. It contains a list of our “holdings” for the periodical with information about the start and end (absent here, which implies the holdings run to the present date) dates of the subscription, which database they’re in, and what URL they can be accessed at. Perfect! The XML contains precisely the information we want to display in our catalog.

Unfortunately, our catalog’s JavaScript doesn’t have permission to access the 360 XML API. Due to a browser security policy resources must explicitly say that other domains or pages are allowed to request their data. A page needs to include the Access-Control-Allow-Origin HTTP header to abide by this policy, called Cross-Origin Resource Sharing (CORS), and the 360 API does not.

We can work around this limitation but it requires extra code on our part. While JavaScript from a web page cannot request data directly from 360, we can write a server-side script to pull data. That server-side script can then add its own CORS header which lets our catalog use it. So, in essence, we set up a proxy service that acts as a go-between for our catalog and the API that the catalog cannot use. Typically, this takes little code; the server-side script takes a parameter passed to it in the URL, sends it in a HTTP request to another server, and serves back up whatever response it receives.

Of course, it didn’t turn out to be that simple in practice. As I experimented with my scripts, I could tell that the 360 data was being received, but I couldn’t parse meaningful pieces of information out of it. It’s clearly there; I could see the full XML structure with holdings details. But neither my server-side PHP nor my client-side JavaScript could “find” XML elements like <ssopenurl:linkGroup> and <ssopenurl:normalizedData>. The text before the colon in the tag names is the namespace. Simple jQuery code like $('ssopenurl:linkGroup', xml), which can typically parse XML data, wasn’t working with these namespaced elements.

Finally, I discovered the solution by reading the PHP manual’s entry for the simplexml_load_string function: I can tell PHP how to parse namespaced XML by passing a namespace parameter to the parser function. So my function call turned into:

// parameters: 1) serials solution data since $url is the API we want to pull from
// 2) the type of object that the function should return (this is the default)
// 3) Libxml options (also the default, no special options)
// 4) (finally!) ns, the XML namespace
// 5) "True" here means ns is a prefix and not a URI
$xml= simplexml_load_string( file_get_contents($url), 'SimpleXMLElement', 0, 'ssopenurl', True );

As you can see, two of those parameters don’t even differ from the function’s defaults, but I still need to provide them to get to the “ssopenurl” namespace later. As an aside, technical digressions like these are some of the best and worst parts of my job. It’s rewarding to encounter a problem, perform research, test different approaches, and eventually solve it. But it’d also be nice, and a lot quicker, if code would just work as expected the first time around.

The Catalog

We’re lucky that Koha’s catalog both allows for JavaScript customization and has a well-structured, easy-to-modify record display. Now that I’m able to grab online holdings data from our knowledgebase, inserting into the catalog is trivial. If you wanted to do the same with a different library catalog, the only changes come in the JavaScript that finds ISSN information in a record and then inserts the retrieved holdings information into the display. The complete outline of the data flow from catalog to KB and back looks like:

  • my JavaScript looks for an ISSN on the record’s display page
  • if there’s an ISSN, it sends the ISSN to my proxy script
  • the proxy script adds a few parameters & asks for information from the 360 XML API
  • the 360 XML API returns XML, which my proxy script parses into JSON and sends to the catalog
  • the catalog JavaScript receives the JSON and parses holdings information into formatted HTML like “Online resources: 1992 to present in DOAJ
  • the JS inserts the formatted text into the record’s “online resources” section, creating that section if it doesn’t already exist

Is there a better way to do this? Almost certainly. The six steps above should give you a sense of how convoluted the process is, hacking around a few limitations. Still, the outcome is positive: we stopped updating our print holdings in our knowledgebase and our users have more information at their fingertips. It obviates the final step in the protracted “discovery layer to catalog” search described in the opening of this post.

Our next steps are obvious, too: we should aim to get this information into the discovery layer’s search results for our journals. The general frame of this project would be the same; we already know how to get the data from the API. Much like working with a different library catalog, the only edits are in parsing ISSNs from discovery layer search results and finding a spot in the HTML to insert the holdings data. Finally, we can also remove the redundant and less useful 856$u links from our periodical MARC records now.

The Scripts

These are highly specific to our catalog, but may be of general use to others who want to see how the pieces work together:

Notes

  1. For those unfamiliar, 856 is the MARC field for URLs, whether they URL represents the actual resource being described or something supplementary. It’s pretty common for print journals to also have 856 fields for their online counterparts.

#1Lib1Ref Edit (2017)

I participated in the “#1Lib1Ref” campaign again this year, recording my experience and talking through why I think it’s important.


Online Privacy in Post-Election America

A commitment to protecting the privacy of our patrons is enshrined in the ALA Code of Ethics. While that has always been an important aspect of librarianship, it’s become even more pivotal in an information age where privacy is far more nuanced and difficult to achieve. Given the rhetoric of the election season, and statements made by our President-Elect as well as his Cabinet nominees 1, the American surveillance state has become even more disconcerting. As librarians, we have an obligation to empower our communities with the knowledge they need to secure their own personal information. This post will cover, at a high level, a few areas where librarians of various types can assist patrons.

The Tools

Given that so much information is exchanged online these days, librarians are in a unique position to educate patrons about the Internet. We spend so much time either building web services or utilizing them, it’s highly likely that a librarian knows more about the web than your average citizen. As such, we can relate some of the powerful pieces of software and services that aid in protecting one’s online presence. To name just a handful that almost everyone could benefit from knowing:

DuckDuckGo is a privacy-aware search engine which explicitly does not track individual users. While it is a for-profit endeavor earning money through ad revenue, its policies set it apart from major competitors such as Google and Bing.

TorBrowser is a web browser utilizing The Onion Router protocol which obfuscates the user’s IP address, essentially masking their online activities behind a web of redirects. The Tor network is run by volunteers and TorBrowser is open source software developed by a non-profit organization.

HTTPS is the encrypted version of HTTP, the data transfer protocol that powers the internet. HTTPS sites are less likely to have their traffic intercepted or surveilled. Tools like HTTPS Everywhere help one to find HTTPS versions of sites without too much trouble.

Two-factor authentication is available for many apps and web services. It decreases the possibility that a third-party can access your account by providing an additional layer of protection beyond your password, e.g. through a code sent to your phone.

Signal is an open source private messaging app which uses end-to-end encryption, think of it as HTTPS for your text messages. Signal is made by Open Whisper Systems which, like the Tor Foundation, is a non-profit.

These are just a few major tools in different areas, all of which are worth knowing about. Many have usability trade-offs but switching to just one or two is enough to substantially improve an individual’s privacy.

Privacy Workshops

Merely knowing about particular pieces of software is not enough to secure one’s communications. Tor perhaps says it best in their “Tips on Staying Anonymous“:

Tor is NOT all you need to browse anonymously! You may need to change some of your browsing habits to ensure your identity stays safe.

A laundry list of web browsers, extensions, and apps doesn’t do much by itself. A person’s behavior is still the largest factor in how private their information is. One can visit a secure HTTPS site but still use a password that’s trivial to crack; one can use the “incognito” or “privacy” mode of a browser but still be tracked by their IP address. Online privacy is an immensely complicated and difficult subject which requires knowledge of practices as well tools. As such, libraries can offer workshops that teach both at once. Most libraries teach skills-based workshops, whether they’re on using a citation manager or how to evaluate information sources for credibility. Adding privacy skills is a natural extension of work we already do. Workshops can fit into particular classes—whether they’re history, computer science, or ethics—or be extra-curricular. Look for sympathetic partners on campus, such as student groups or concerned faculty, to see if you can collaborate or at least find an avenue for advertising your events.

Does your library not have anyone qualified or willing to teach a privacy workshop? Consider contacting an outside expert. The Library Freedom Project immediately comes to mind as a wonderful resource offering: a privacy toolkit for librarians, an online class, “train the trainers” type events, and community-focused workshops.2 Academic librarians may also have access to local computer security experts, whether they’re computer science instructors or particularly savvy students, who would be willing to lend their expertise. My one caution would be that just because someone is a subject expert doesn’t mean they’re equipped to effectively lead a workshop, and that working with an expert to ensure an event is tailored to your community will be more successful than simply outsourcing the entire task.

Patron Data

Depending on your position at your library, this final section might either be the most or least obvious thing to be done: control access to data about your patrons. If you’re an instruction or reference librarian, I imagine workshops were the first thing on your mind. If you’re a systems librarian such as myself, you may have thought of technologies like HTTPS or considered data security measures. This section will be longer not because it’s more important, but because these are topics I think about often as they directly relate to my job responsibilities.

Patron data is tricky. I’ll be the first to admit that my library collects quite a bit of data about patrons, a rather small amount of which contains personally identifying information. Data is extremely useful both in fine-tuning our services to meet community needs as well as in demonstrating our value to stakeholders like the college administration. Still, there is good reason to review data practices and web services to see if anything can be improved. Here’s a brief list of heuristics to use:

Are your websites using HTTPS? Secure sites, especially for one’s with patron accounts that hold sensitive information, help prevent data from being intercepted by third parties. I fully realize this is actually more difficult than it appears; our previous ILS offered HTTPS but only as a paid add-on which we couldn’t afford. If a vendor is the holdup here, pester them relentlessly until progress is made. I’ve found that most vendors understand that HTTPS is important, it’s just further down in their development priorities. Making a fuss can change that.

Is personal information being unnecessarily collected? What’s “necessary” is subjective, certainly. A good measure is looking at when the last time personal information was actually used in any substantive manner. If you’re tracking the names of students who ask reference questions, have you ever actually needed them for follow-ups? Could an anonymized ID be used instead? Could names be deleted after a certain amount of time has passed? Which brings us to…

Where personal information is collected, do retention policies exist? E.g. if you’re doing website user studies that record someone’s name, likeness, or voice, do you eventually delete the files? This goes for paper files as well, which can be reviewed and then shredded if deemed unnecessary. Retention policies are beneficial in a few ways. They not only prevent old data from leaking into the wrong hands, they often help with organization and “spring cleaning” tasks. I try to review my hard drive periodically for random files I’ve been sent by faculty or students which can be cleaned out.

Can patrons be empowered with options regarding their own data? Opt-in policies regarding data retention are desirable because they allow a library to collect information that might prove valuable while also giving people the ability to limit their vulnerabilities. Catalog reading lists are the quintessential example: some patrons find these helpful as a tool to review what they’ve read, while others would prefer to obscure their checkout history. It should go without saying that these options existing without any surrounding education is rather useless. Patrons need to know what’s at stake and how to use the systems at their disposal; the setting does nothing by itself. While optional workshops typically only touch a fragment of the overall student population, perhaps in-browser tips and suggestions can be presented to prompt our users to consider about the ramifications of their account’s configuration.

Relevance Ranking

Every so often, an event will happen which foregrounds the continued relevance of our profession. The most recent American election was an unmitigated disaster in terms of information literacy 3, but it also presents an opportunity for us to redouble our efforts where they are needed. Like the terrifying revelations of Edward Snowden, we are reminded that we serve communities that are constantly at risk of oppression, surveillance, and strife. As information professionals, we should strive to take on the challenge of protecting our patrons, and much of that protection occurs online. We can choose to be paralyzed by distress when faced with the state of affairs in our country, or to be challenged to rise to the occasion.

Notes

  1. To name a few examples, incoming CIA chief Mike Pompeo supports NSA bulk data collection and President-Elect Trump has been ambiguous as to whether he supports the idea of a registry or database for Muslim Americans.
  2. Library Freedom Director Alison Macrina has an excellent running Twitter thread on privacy topics which is worth consulting whether you’re an expert or novice.
  3. To note but two examples, the President-Elect persistently made false statements during his campaign and “fake news” appeared as a distinct phenomenon shortly after the election.

A High-Level Look at an ILS Migration

My library recently performed that most miraculous of feats—a full transition from one integrated library system to another, specifically Innovative’s Millennium to the open source Koha (supported by ByWater Solutions). We were prompted to migrate by Millennium’s approaching end-of-life and a desire to move to a more open system where we feel in greater control of our data. I’m sure many librarians have been through ILS migrations, and plenty has been written about them, but as this was my first I wanted to reflect upon the process. If you’re considering changing your ILS, or if you work in another area of librarianship & wonder how a migration looks from the systems end, I hope this post holds some value for you.

Challenges

No migration is without its problems. For starters, certain pieces of data in our old ILS weren’t accessible in any meaningful format. While Millennium has a robust “Create Lists” feature for querying & exporting different types of records (patron, bibliographic, vendor, etc.), it does not expose certain types of information. We couldn’t find a way to export detailed fines information, only a lump sum for each patron. To help with this post-migration, we saved an email listing of all itemized fines that we can refer to later. The email is saved as a shared Google Doc which allows circulation staff to comment on it as fines are resolved.

We also discovered that patron checkout history couldn’t be exported in bulk. While each patron can opt-in to a reading history & view it in the catalog, there’s no way for an administrator to download everyone’s history at once. As a solution, we kept our self-hosted Millennium instance running & can login to patrons’ accounts to retrieve their reading history upon request. Luckily, this feature wasn’t heavily used, so access to it hasn’t come up many times. We plan to keep our old, self-hosted ILS running for a year and then re-evaluate whether it’s prudent to shut it down, losing the data.

While some types of data simply couldn’t be exported, many more couldn’t emigrate in their exact same form. An ILS is a complicated piece of software, with many interdependent parts, and no two are going to represent concepts in the exact same way. To provide a concrete example: Millennium’s loan rules are based upon patron type & the item’s location, so a rule definition might resemble

  • a FACULTY patron can keep items from the MAIN SHELVES for four weeks & renew them once
  • a STUDENT patron can keep items from the MAIN SHELVES for two weeks & renew them two times

Koha, however, uses patron category & item type to determine loan rules, eschewing location as the pivotal attribute of an item. Neither implementation is wrong in any way; they both make sense, but are suited to slightly different situations. This difference necessitated completely reevaluating our item types, which didn’t previously affect loan rules. We had many, many item types because they were meant to represent the different media in our collection, not act as a hook for particular ILS functionality. Under the new system, our Associate Director of Libraries put copious work into reconfiguring & simplifying our types such that they would be compatible with our loan rules. This was a time-consuming process & it’s just one example of how a straightforward migration from one system to the next was impossible.

While some data couldn’t be exported, and others needed extensive rethinking in the new ILS, there was also information that could only be migrated after much massaging. Our patron records were a good example: under Millennium, users logged in on an insecure HTTP page with their barcode & last name. Yikes. I know, I felt terrible about it, but integration with our campus authentication & upgrading to HTTPS were both additional costs that we couldn’t afford. Now, under Koha, we can use the campus CAS (a central authentication system) & HTTPS (yay!), but wait…we don’t have the usernames for any of our patrons. So I spent a while writing Python scripts to parse our patron data, attempting to extract usernames from institutional email addresses. A system administrator also helped use unique identifying information (like phone number) to find potential patron matches in another campus database.

A more amusing example of weird Millennium data was active holds, which are stored in a single field on item records & looks like this:

P#=12312312,H#=1331,I#=999909,NNB=12/12/2016,DP=09/01/2016

Can you tell what’s going on here? With a little poking around in the system, it became apparent that letters like “NNB” stood for “date not needed by” & that other fields were identifiers connecting to patron & item records. So, once again, I wrote scripts to extract meaningful details from this silly format.

I won’t lie, the data munging was some of the most enjoyable work of the migration. Maybe I’m weird, but it was both challenging & interesting as we were suddenly forced to dive deeper into our old system and understand more of its hideous internal organs, just as we were leaving it behind. The problem-solving & sleuthing were fun & distracted me from some of the more frustrating challenges detailed above.

Finally, while we had a migration server where we tested our data & staff played around for almost a month’s time, when it came to the final leap things didn’t quite work as expected. The CAS integration, which I had so anticipated, didn’t work immediately. We started bumping into errors we hadn’t seen on the migration server. Much of this is inevitable; it’s simply unrealistic to create a perfect replica of our live catalog. We cannot, for instance, host the migration server on the exact same domain, and while that seems like a trivial difference it does affect a few things. Luckily, we had few summer classes so there was time to suffer a few setbacks & now that our fall semester is about to begin, we’re in great shape.

Difference & Repetition

Koha is primarily used by public libraries, and as such we’ve run into a few areas where common academic library functions aren’t implemented in a familiar way or are unavailable. Often, it’s that our perspective is so heavily rooted in Millennium that we need to think differently to achieve the same effect in Koha. But sometimes it’s clear that what’s a concern to us isn’t to other libraries.

For instance, bib records for serials with large numbers of issues is an ongoing struggle for us. We have many print periodicals where we have extensive holdings, including bound editions of past issues. The holdings display in the catalog is more oriented towards recent periodicals & displaying whether the latest few issues have arrived yet. That’s fine for materials like newspapers or popular magazines with few back issues, and I’ve seen a few public libraries using Koha that have minimalistic periodical records intended only to point the patron to a certain shelf. However, we have complex holdings like “issues 1 through 10 are bound together, issue 11 is missing, issues 12 through 18 are held in a separate location…” Parsing the catalog record to determine if we have a certain issue, and where it might be, is quite challenging.

Another example of the public versus academic functions: there’s no “recall” feature per se in Koha, wherein a faculty member could retrieve an item they want to place on course reserve from a student. Instead, we have tried to simulate this feature with a mixture of adjustments to our loan rules & internal reports which show the status of contested items. Recall isn’t a huge feature & isn’t used all the time, it’s not something we thought to research when selecting our new ILS, but it’s a great example of a minute difference that ended up creating a headache as we adapted to a new piece of software.

Moving from Millennium to Koha also meant we were shifting from a closed source system where we had to pay additional fees for limited API access to an open source system which boasts full read access to the database via its reporting feature. Koha’s open source nature has been perhaps the biggest boon for me during our migration. It’s very simple to look at the actual server-side code generating particular pages, or pull up specific rows in database tables, to see exactly what’s happening. In a black box ILS, everything we do is based on a vague adumbration of how we think the system operates. We can provide an input & record the output, but we’re never sure about edge cases or whether strange behavior is a bug or somehow intentional.

Koha has its share of bugs, I’ve discovered, but thankfully I’m able to jump right into the source code itself to determine what’s occurring. I’ve been able to diagnose problems by looking at open bug reports on Koha’s bugzilla tracker, pondering over perl code, and applying snippets of code from the Koha wiki or git repository. I’ve already submitted two bug patches, one of which has been pulled into the project. It’s empowering to be able to trace exactly what’s happening when troubleshooting & submit one’s own solution, or just a detailed bug report, for it. Whether or not a patch is the best way to fix an issue, being able to see precisely how the system works is deeply satisfying. It also makes it much easier to me to design JavaScript hacks that smooth over issues on the client side, be it in the staff-facing administrative functions or the public catalog.

What I Would Do Differently

Set clearer expectations.

We had Millennium for more than a decade. We invested substantial resources, both monetary & temporal, in customizing it to suit our tastes & unique collections. As we began testing the new ILS, the most common feedback from staff fell along the lines “this isn’t like it was in Millennium”. I think that would have been a less common observation, or perhaps phrased more productively, if I’d made it clear that a) it’ll take time to customize our new ILS to the degree of the old one, and b) not everything will be or needs to be the same.

Most of the customization decisions were made years ago & were never revisited. We need to return to the reason why things were set up a certain way, then determine if that reason is still legitimate, and finally find a way to achieve the best possible result in the new system. Instead, it’s felt like the process was framed more as “how do we simulate our old ILS in the new one” which sets us up for disappointment & failure from the start. I think there’s a feeling that a new system should automatically be better, and it’s true that we’re gaining several new & useful features, but we’re also losing substantial Millennium-specific customization. It’s important to realize that just because everything is not optimal out of the box doesn’t mean we cannot discover even better solutions if we approach our problems in a new light.

Encourage experimentation, deny expertise.

Because I’m the Systems Librarian, staff naturally turn to me with their systems questions. Here’s a secret: I know very little about the ILS. Like them, I’m still learning, and what’s more I’m often unfamiliar with the particular quarters of the system where they spend large amounts of time. I don’t know what it’s like to check in books & process holds all day, but our circulation staff do. It’s been tough at times when staff seek my guidance & I’m far from able to help them. Instead, we all need to approach the ongoing migration as an exploration. If we’re not sure how something works, the best way is to research & test, then test again. While Koha’s manual is long & quite detailed, it cannot (& arguably should not, lest it grow to unreasonable lengths) specify every edge case that can possibly occur. The only way to know is to test & document, which we should have emphasized & encouraged more towards the start of the process.

To be fair, many staff had reasonable expectations & performed a lot of experiments. Still, I did not do a great job of facilitating either of those as a leader. That’s truly my job as Systems Librarian during this process; I’m not here merely to mold our data so it fits perfectly in the new system, I’m here to oversee the entire transition as a process that involves data, workflows, staff, and technology.

Take more time.

Initially, the ILS migration was such an enormous amount of work that it was not clear where to start. It felt as if, for a few months before our on-site training, we did little but sit around & await a whirlwind of busyness. I wish we had a better sense of the work we could have front-loaded such that we could focus efforts on other tasks later on. For example, we ended up deleting thousands of patron, item, and bibliographic records in an effort to “clean house” & not spend effort migrating data that was unneeded in the first place. We should have attacked that much earlier, and it might have obviated the need for some work. For instance, if in the course of cleaning up Millennium we delete invalid MARC records or eliminate obscure item types, those represent fewer problems encountered later in the migration process.

Finished?

As we start our fall semester, I feel accomplished. We raced through this migration, beginning the initial stages only in April for a go-live date that would occur in June. I learned a lot & appreciated the challenge but also had one horrible epiphany: I’m still relatively young, and I hope to be in librarianship for a long time, so this is likely not the last ILS migration I’ll participate in. While that very thought gives me chills, I hope the lessons I’ve taken from this one will serve me well in the future.


A Reflection on Code4Lib 2016

See also: Margaret’s reflections on Code4Lib 2013 and recap of the 2012 keynote.

 


About a month ago was the 2016 Code4Lib conference in sunny Philadelphia. I’ve only been to a few Code4Lib conferences, starting with Raleigh in 2014, but it’s quickly become my favorite libraryland conference. This won’t be a comprehensive recap but a little taste of what makes the event so special.

Appetizers: Preconferences

One of the best things about Code4Lib is the affordable preconferences. It’s often a pittance to add on a preconference or two, extending your conference for a whole day. Not only that, there’s typically a wealth of options: the 2015 conference boasted fifteen preconferences to choose from, and Philadelphia somehow managed to top that with an astonishing twenty-four choices. Not only are they numerous, the preconferences vary widely in their topics and goals. There’s always intensely practical ones focused on bootstrapping people new to a particular framework, programming language, or piece of software (e.g. Railsbridge, workshops focused on Blacklight or Hydra). But there are also events for practicing your presentation or the aptly named “Getting Ready for Workshops” Workshop. One of my personal favorite ideas—though I must admit I’ve never attended—is the perennial “Fail4Lib” sessions where attendees examine their projects that haven’t succeeded and discuss what they’ve learned.

This year, I wanted to run a preconference of my own. I enjoy teaching, but I rarely get to do it in my current position. Previously, in a more generalist technologist position, I would teach information literacy alongside the other librarians. But as a Systems Librarian, it can sometimes feel like I rarely get out from behind my terminal. A preconference was an appealing chance to teach information professionals on a topic that I’ve accumulated some expertise in. So I worked with Coral Sheldon-Hess to put together a workshop focused on the fundamentals of the command line: what it is, how to use it, and some of the pivotal concepts. I won’t say too much more about the workshop because Coral wrote an excellent, detailed blog post right after we were done. The experience was great and feedback we received, including a couple kind emails from our participants, was very positive. Perhaps we, or someone else, can repeat the workshop in the future, as we put all our materials online.

Main Course: Presentations

Thankfully I don’t have to detail the conference talks too much, because they’re all available on YouTube. If a talk looks intriguing, I strongly encourage you to check out the recording. I’m not too ashamed to admit that a few went way over my head, so seeing the original will certainly be more informative than any summary I could offer.

One thing that was striking was how the two keynotes centered on themes of privacy and surveillance. Kate Krauss, Director of Communications of the Tor Project, lead the conference off. Naturally, Tor being privacy software, Krauss focused on stories of government surveillance. She noted how surveillance focuses on the most marginalized people, citing #BlackLivesMatter and the transgender community as examples. Krauss’ talk provided concrete steps that librarians could take, for instance examining our own data collection practices, ensuring our services are secure, hosting privacy workshops, and running a Tor relay. She even mentioned The Library Freedom Project as a positive example of librarians fighting online surveillance, which she posited as one of the premier civil rights issues of our time.

On the final day, Gabriel Weinberg of the search engine DuckDuckGo spoke on similar themes, except he concentrated on how his company’s lack of personalization and tracking differentiated it from companies like Google and Apple. To me, Weinberg’s talk bookended well with Krauss’ because he highlighted the dangers of corporate surveillance. While the government certainly has abused its access to certain fundamental pieces of our country’s infrastructure—obtaining records from major telecom companies without a warrant comes to mind—tech companies are also culpable in enabling the unparalleled degree of surveillance possible in the modern era, simply by collecting such massive quantities of data linked to individuals (and, all too often, by failing to secure their applications properly).

While the pair of keynotes were excellent and thematic, my favorite moments of the conference were the talks by librarians. Becky Yoose gave perhaps the most rousing, emotional talk I’ve ever heard at a conference on the subject of burnout. Burnout is all too real in our profession, but not often spoken of, particularly in such a public venue. Becky forced us all to confront the healthiness and sustainability of our work/life balance, stressing the importance not only of strong organizational policies to prevent burnout but also personal practices. Finally, Andreas Orphanides gave a thoughtful presentation on the political implications of design choices. Dre’s well-chosen, alternatingly brutal and funny examples—from sidewalk spikes that prevent homeless people from lying in doorways, to an airline website labelling as “lowest” a price clearly higher than others on the very same page—outlined how our design choices reflect our values, and how we can better align our values with those of our users.

I don’t mean to discredit anyone else’s talks—there were many more excellent ones, on a variety of topics. Dinah Handel captured my feelings best in this enthusiastic tweet:

Dessert: Community

My main enjoyment from Code4Lib is the sense of community. You’ll hear a lot of people at conferences state things like “I feel like these are my people.” And we are lucky as a profession to have plenty of strong conference options, depending on our locality, specialization, and interests. At Code4Lib, I feel like I can strike up a conversation with anyone I meet about an impending ILS migration, my favorite command-line tool, or the vagaries of mapping between metadata schemas. While I love my present position, I’m mostly a solo systems person surrounded by a few other librarians all with a different expertise. As much as I want to discuss how ludicrous the webpub.def syntax is, or why reading XSLT makes me faintly ill, I know it’d bore my colleagues to death. At Code4Lib, people can at least tolerate such subjects of conversation, if not revel in them.

Code4Lib is great not solely because of it’s focus on technology and code, which a few other library organizations share, but because of the efforts of community members to make it a pleasurable experience for all. To name just a couple of the new things Code4Lib introduced this year: while previous years have had Duty Officers whom attendees could safely report harassment to, they were announced & much more visible this year; sponsored child care was available for conference goers with small children; and a service provided live transcription of all the talks.1 This is in addition to a number of community-building measures that previous Code4Lib conferences featured, such as a series of newcomers dinners on the first night, a “share and play” game night, and diversity scholarships. Overall, it’s evident that the Code4Lib community is committed to being positive and welcoming. Not that other library organizations aren’t, but it should be evident that our profession isn’t immune from problems. Being proactive and putting in place measures to prevent issues like harassment is a shining example of what makes Code4Lib great.

All this said, the community does have its issues. While a 40% female attendance rate is fair for a technology conference, it’s clear that the intersection of coding and librarianship is more male-dominated than the rest of the profession at large. Notably, Code4Lib has done an incredible job of democratically selecting keynote speakers over the past few years—five female and one male for the past three conferences—but the conference has also been largely white, so much so that the 2016 conference’s Program Committee gave a lightning talk addressing the lack of speaker diversity. Hopefully, measures like the diversity scholarships and conscious efforts on the part of the community can make progress here. But the unbearable whiteness of librarianship remains a very large issue.

Finally, it’s worth noting that Code4Lib is entirely volunteer-run. Since it’s not an official professional organization with membership dues and full-time staff members, everything is done by people willing to spare their own time to make the occasion a great one. A huge thanks to the local planning committee and all the volunteers who made such a great event possible. It’s pretty stunning to me that Code4Lib manages to put together some of the nicest benefits of any conference—the live streaming and transcribed talks come to mind—without a huge backing organization, and while charging pretty reasonable registration prices.

Night Cap

I’d recommend Code4Lib to anyone in the library community who deals with technology, whether you’re a manager, cataloger, systems person, or developer. There’s a wide breadth of material suitable for anyone and a great, supportive community. If that’s not enough, the proportion of presentations featuring pictures of cats and/or animated gifs is higher than your average conference.

Notes

  1. Aside: Matt Miller made a fun “Overheard at Code4Lib 2016” app using the transcripts

#1Lib1Ref

A few of us at Tech Connect participated in the #1Lib1Ref campaign that’s running from January 15th to the 23rd . What’s #1Lib1Ref? It’s a campaign to encourage librarians to get involved with improving Wikipedia, specifically by citation chasing (one of my favorite pastimes!). From the project’s description:

Imagine a World where Every Librarian Added One More Reference to Wikipedia.
Wikipedia is a first stop for researchers: let’s make it better! Your goal today is to add one reference to Wikipedia! Any citation to a reliable source is a benefit to Wikipedia readers worldwide. When you add the reference to the article, make sure to include the hashtag #1Lib1Ref in the edit summary so that we can track participation.

Below, we each describe our experiences editing Wikipedia. Did you participate in #1Lib1Ref, too? Let us know in the comments or join the conversation on Twitter!


 

I recorded a short screencast of me adding a citation to the Darbhanga article.

— Eric Phetteplace


 

I used the Citation Hunt tool to find an article that needed a citation. I selected the second one I found, which was about urinary tract infections in space missions. That is very much up my alley. I discovered after a quick Google search that the paragraph in question was plagiarized from a book on Google Books! After a hunt through the Wikipedia policy on quotations, I decided to rewrite the paragraph to paraphrase the quote, and then added my citation. As is usual with plagiarism, the flow was wrong, since there was a reference to a theme in the previous paragraph of the book that wasn’t present in the Wikipedia article, so I chose to remove that entirely. The Wikipedia Citation Tool for Google Books was very helpful in automatically generating an acceptable citation for the appropriate page. Here’s my shiny new paragraph, complete with citation: https://en.wikipedia.org/wiki/Astronautical_hygiene#Microbial_hazards_in_space.

— Margaret Heller


 

I edited the “Library Facilities” section of the “University of Maryland Baltimore” article in Wikipedia.  There was an outdated link in the existing citation, and I also wanted to add two additional sentences and citations. You can see how I went about doing this in my screen recording below. I used the “edit source” option to get the source first in the Text Editor and then made all the changes I wanted in advance. After that, I copy/pasted the changes I wanted from my text file to the Wikipedia page I was editing. Then, I previewed and saved the page. You can see that I also had a typo in my text  and had to fix that again to make the citation display correctly. So I had to edit the article more than once. After my recording, I noticed another typo in there, which I fixed it using the “edit” option. The “edit” option is much easier to use than the “edit source” option for those who are not familiar with editing Wiki pages. It offers a menu bar on the top with several convenient options.

wiki_edit_menu

The menu bar for the “edit” option in Wikipeda

The recording of editing a Wikipedia article:

— Bohyun Kim


 

It has been so long since I’ve edited anything on Wikipedia that I had to make a new account and read the “how to add a reference” link; which is to say, if I could do it in 30 minutes while on vacation, anyone can. There is a WYSIWYG option for the editing interface, but I learned to do all this in plain text and it’s still the easiest way for me to edit. See the screenshot below for a view of the HTML editor.

I wondered what entry I would find to add a citation to…there have been so many that I’d come across but now I was drawing a total blank. Happily, the 1Lib1Ref campaign gave some suggestions, including “Provinces of Afghanistan.” Since this is my fatherland, I thought it would be a good service to dive into. Many of Afghanistan’s citations are hard to provide for a multitude of reasons. A lot of our history has been an oral tradition. Also, not insignificantly, Afghanistan has been in conflict for a very long time, with much of its history captured from the lens of Great Game participants like England or Russia. Primary sources from the 20th century are difficult to come by because of the state of war from 1979 onwards and there are not many digitization efforts underway to capture what there is available (shout out to NYU and the Afghanistan Digital Library project).

Once I found a source that I thought would be an appropriate reference for a statement on the topography of Uruzgan Province, I did need to edit the sentence to remove the numeric values that had been written since I could not find a source that quantified the area. It’s not a precise entry, to be honest, but it does give the opportunity to link to a good map with other opportunities to find additional information related to Afghanistan’s agriculture. I also wanted to chose something relatively uncontroversial, like geographical features rather than historical or person-based, for this particular campaign.

— Yasmeen Shorish

WikiEditScreenshot

Edited area delineated by red box.


Creating a Flipbook Reader for the Web

At my library, we have a wonderful collection of artist’s books. These titles are not mere text, or even concrete poetry shaping letters across a page, but works of art that use the book as a form in the same way a sculptor might choose clay or marble. They’re all highly inventive and typically employ an interplay between language, symbol, and image that challenges our understanding of what a print volume is. As an art library, collecting these unique and inspiring works seems natural. But how can we share our artists’ books with the world? For a while, our work study students have been scanning sets of pages from the books using a typical flatbed scanner but we weren’t sure how to present these images. Below, I’ll detail how we came to fork the Internet Archive’s Bookreader to publish our images to the web.

Choosing a Book Display

Our Digital Scholarship Librarian, Lisa Conrad, led the artists’ book project. She investigated a number of potential options for creating interactive “flipbooks” out of our series of images. We had a few requirements we were looking for:

  • mobile friendly — ideally, our books would be able to be read on any device, not just desktops with Adobe Flash
  • easy to use — our work study staff shouldn’t need sophisticated skills, like editing code or markup, to publish a work
  • visually appealing — obviously we’re dealing with works of art, if our presentation is hideous or obscures the elegance of the original then it’s doing more harm than good
  • works with our repository — ultimately, the books would “live” in our institutional repository, so we needed software that would emit something like static HTML or documents that could be easily retained in it

These are not especially stringent requirements, in my mind. Still, we struggled to find decent options. The mobile limitation restricted things quite a bit; a surprising number of apps used flash or published in desktop-first formats. We felt our options to customize and simplify the user interface of other apps was too small. Even if an app spat out a nice bundle of HTML and assets that we could upload, the HTML might call out to sketchy external services or present a number of options we didn’t want or need. While I was at first hesitant to implement an open source piece of software that I knew would require much of my time, ultimately the Internet Archive’s Bookreader became the frontrunner. We felt more comfortable using an established, free piece of software that doesn’t require any server-side component (it’s all JavaScript!). My primary concern was that the last commit to the bookreader code base was over a half year ago.

Implementing the Internet Archive Bookreader

The Bookreader proved startlingly easy to implement. We simply copied the code from an example given by the Internet Archive, edited several lines of JavaScript, and were ready to go. Everything else was refinement. Still, we made a few nice decisions in working on the Bookreader that I’d like to share. If you’re not a coder, feel free to skip the rest of this section as it may be unnecessarily detailed.

First off, our code is all up on GitHub. But it won’t be useful to anyone else! It contains specific logic based on how our IR serves up links to the bookreader app itself. But one of the smarter moves, if I may indulge in some self-congratulation, was using submodules in git. If you know git then you know it’s easily one of the best ways to version code and manage projects. But what if you app is just an implementation of an existing code base? There’s a huge portion of code you’re borrowing from somewhere else, and you want to be able to segment that off and incorporate changes to it separately.

As I noted before, the bookreader isn’t exactly actively developed. But it still benefits us to separate our local modifications from the stable, external code. A submodule lets us incorporate another repository into our own. It’s the clean way of copying someone else’s files into our own version. We can reference files in the other repository as if they’re sitting right beside our own, but also pull in updates in a controlled manner without causing tons of extra commits. Elegant, handy, sublime.

Most of our work happened in a single app.js file. This file was copied from the Internet Archive’s provided example and only modified slightly. To give you a sense of how one modifies the original script, here’s a portion:

// Create the BookReader object
br = new BookReader();
// Total number of leafs
br.numLeafs = vaultItem.pages;
// Book title & URL used for the book title link
br.bookTitle= vaultItem.title;
br.bookUrl  = vaultItem.root + 'items/' + vaultItem.id + '/' + vaultItem.version + '/';
// how does the bookreader know what images to retrieve given a page number (index)?
br.getPageURI = function(index, reduce, rotate) {
    // reduce and rotate are ignored in this simple implementation, but we
    // could e.g. look at reduce and load images from a different directory
    var url = vaultItem.root + 'file/' + vaultItem.id + '/' + vaultItem.version + '/' + vaultItem.filenames + (index + 1) + '.JPG';
    return url;
}

The Internet Archive’s code provides us with a Bookreader class. All we do is instantiate it once and override certain properties and methods. The code above is all that’s necessary to display a particular image for a given page number; the vaultItem object (VAULT is the name of our IR) consists of information about a single artists’ book, like the number of pages, its title, its ID and version within the IR. The bookreader app cobbles together these pieces of info to figure out it should display images given a page or pair of pages. The getPageURI function is mostly working with a single index argument, while the group of concatenated strings for the URL are related to how our repository stores files. It’s highly specific to the IR we’re using, but not terribly complicated.

The Bookreader itself sits on a web server outside the IR. Since we cannot publicly share almost all of these books (more on this below), we restrict access to the images to users who are authenticated with the IR. So how can the external reader app serve up images of books within the repository? We expect people to discover the books via our library catalog, which links to the IR, or within the IR itself. Once they sign in, our IR’s display templates contain specially crafted URLs that pass the bookreader information about the item in our repository via their query string. Here’s a shortened example of one query string:

?title=Crystals%20to%20Aden&id=17c06cc5-c419-4e77-9bdb-43c69e94b4cd&pages=26

From there, the app parses the query string to figure out that the book’s title is Crystals to Aden, its ID within the IR is “17c06cc5-c419-4e77-9bdb-43c69e94b4cd”, and it has twenty six pages. We store those values in the vaultItem object referenced in the script above. That object contains enough information for the bookreader to determine how to retrieve images from the IR for each page of the book. Since the user has already authenticated with the IR when they discovered the URL earlier, the IR happily serves up the images.

Ch-ch-ch-challenges

Easily the most difficult part of our artists’ books project has been spreading awareness. It’s a cool project that’s required tons of work all around, from the Digital Scholarship Librarian managing the project, to our diligent work study student scanning images, to our circulation staff cataloging them in our IR, to me solving problems in JavaScript while my web design student worker polished the CSS. We’re proud of the result, but also struggling to share it outside of our walls. Our IR is great at some things but not particularly intuitive to use, so we cannot count on patrons stumbling across the artists’ books on their own very often. Furthermore, putting anything behind a login is obviously going to increase the number of people who give up before accessing it. Not being on the campus Central Authentication Service only exacerbates this for us.

The second challenge is—what else!—copyright. We don’t own the rights to any of these titles. We’ve been guerrilla digitizing them without permission from publishers. For internal sharing, that’s fine and well under Fair Use; we’re only showing excerpts and security settings in our IR ensure they’re only visible to our constituents. But I pine to share these gorgeous creations with people outside the college, with my social networks, with other librarians. Right now, we only have permission to share Crystals to Aden by Michael Bulteau (thanks to duration press for letting us!). Which is great! Except Crystals is a relatively text-heavy work; it’s fine poetry, yes, but not the earth-shattering reinterpretation of the codex I promised you in my opening paragraph.

Hopefully, we can secure permissions to share further works. Internally, we’ve pushed out notices on social media, the LMS, and email lists to inform people of the artists’ books. They’re also available for checkout, so hopefully our digital teasers bring people in to see the real deal.

Lasting Problems

And that’s something that must be mentioned; our digital simulations are definitely not the real deal. Never has a set of titles been so resistant to digitization. Our work study has even asked us “how could I possibly separate this work into distinct pairs of pages” about a work which used partially-overlapping leaves that look like vinyl record sleeves.1 Artists deliberately chose the printed book form and there’s a deep sacrilege in trying to digitally represent that. We can only ever offer a vague adumbration of what was truly intended. I still see value in our bookreader—and the works often look brilliant even as scanned images—but there are fundamental impediments to its execution.

Secondly, scanned images are not text. Many works do have text on the page, but because we’re displaying images in the browser users cannot select the text to copy it. Alongside each set of images in our IR, we also have a PDF copy of all the pages with OCR‘d text. But a decision was made not to make the PDFs visible to non-library staff, since they would be easy to download and disseminate (unlike the many discrete images in our bookreader). This all adds up to make the bookreader very inaccessible; its all images, there’s no feasible way for us to associate alt text with each one, and the PDF copy that might be of interest to visually-impaired users is hidden.

I’ve ended on a sour note, but digitizing our artists’ books and fiddling with the Internet Archive’s Bookreader was a great project. Fun, a bit challenging, with some splendid results. We have our work cut out for us if we want to draw more attention to the books and have them be compelling for everyone. But other libraries in similar situations may find the Bookreader to be a very viable, easy-to-implement solution. If you don’t have permissions issues and are dealing with more traditional works, it’s built to be customizable yet offers a pleasant reading experience.

Notes

  1. Paradoxic Mutations by Margot Lovejoy

Wikipedia, Libraries, & Neutrality

This piece is substantially based off of a column I wrote for RUSQ that will appear early next year.


Broadly construed, there are two camps of opinions surrounding Wikipedia in librarianship. The first is that Wikipedia is not an academic quality source. It is something students need to be warned away from. The community authorship of wikipedia is typically the source of this criticism; since “anyone can edit” the online encyclopedia, there’s no way to tell whether you’re receiving high quality information written by experts or the ill-informed opinions of internet trolls.

The other camp is far more positive regarding Wikipedia. After all, the values of the Wikipedian community clearly align with our own. The ultimate goal of the project is not to give a venue for poor research or fringe theories, but to enable free and open access to information. While Wikipedia isn’t flawless, many of its articles compete with more established reference sources. Famously, Nature performed a comparison of scientific articles and found Wikipedia to be comparable to Encyclopedia Britannica. But following the study, a community project to correct the identified errors in Wikipedia sprung up, fixing them all in a little over a month.1 With this common ground, it’s no wonder that libraries have found Wikipedia to be a valuable partner in publicizing our content. Articles like Using Wikipedia to Extend Digital Collections, Putting the Library in Wikipedia, and Wikipedia Lover, Not a Hater: Harnessing Wikipedia to Increase the Discoverability of Library Resources all discuss the value of working with Wikipedia to highlight library digital collections and metadata, not against.

A good example of Wikipedia driving traffic to library special collections was brought to my attention by fellow Tech Connect author Margaret Heller, who pointed me towards the Google Analytics Usage Reports for CARLI Digital Collections. CARLI regularly sees Wikipedia as one of the top external traffic sources, with Wikipedia being noted in last few quarterly reports as a traffic source leading “to home pages or images from multiple CARLI Collections”.

The Problem with Neutral

I must admit that I side with the pro-Wikipedia folks. I love using Wikipedia as a resource—as a procrastination-enabler I have open several articles on roguelikes that I’m reading while I right this—and I love contributing to it in whatever small ways I can, from fixing broken markup to citation hunting. But Wikipedia is imperfect in ways more insidious that its anonymous authorship or occasionally inaccurate details. Rather, one problem lies within one of the five pillars that define the philosophy of Wikipedia; “Wikipedia is written from a neutral point of view”.

Neutrality has been under fire from #critlib lately, a group of librarians emphasizing critical theory especially with respect to information literacy instruction. ALA Annual featured a well-attended presentation entitled “But We’re Neutral!” And Other Librarian Fictions Confronted by #critlib. A seminal article appearing earlier this year in Code4Lib by Bess Sadler and Chris Bourg begins with a rousing section labelled Libraries are not Neutral:

In spite of the pride many libraries take in their neutrality, libraries have never been neutral repositories of knowledge. Research libraries in particular have always reflected the inequalities, biases, ethnocentrism, and power imbalances that exist throughout the academic enterprise through collection policies and hiring practices that reflect the biases of those in power at a given institution. In addition, theoretically neutral library activities like cataloging have often re-created societal patterns of exclusion and inequality.

Wikipedia shares this problem; while it appears neutral on the surface, its topical coverage and treatment of subjects reflect the skewed power relations of our society. A 2011 study found that 91% of editors were men. The same study shows that few editors come from the Global South and that the English Wikipedia receives far more focus than other languages. Another research paper from 2011 goes a bit further in demonstrating that “male articles are significantly longer than female articles”.2 The editorial gender gap has real effects on the encyclopedia’s content; it’s not just that having editors of all genders is good in its own right, it’s that Wikipedia’s claims to objectivity and neutrality are jeopardized by the disbalance.

Art+Feminism

So what’s a concerned librarian to do? I think the wrong thing would be to denounce Wikipedia. For one, the alternative sources we or our patrons would turn to are no less problematic. Encyclopaedia Britannica has a well-documented history of supposedly scientific articles written from the dominant viewpoint (c.f. the racist description of black people in the 1911 edition). What’s more, Wikipedia is going to be used. It’s massively popular. Sticking our heads in the sand because it doesn’t live up to its own standards of neutrality improves nothing.

Luckily, there are several Wikipedia projects focused on recruiting editors from underrepresented groups and addressing lackluster coverage of particular topics which we can support. One such project is Art + Feminism. In its own words:

Art+Feminism is a rhizomatic campaign to improve coverage of women and the arts on Wikipedia, and to encourage female editorship…Content is skewed by the lack of female participation. Many articles on notable women in history and art are absent on Wikipedia. This represents an alarming aporia in an increasingly important repository of shared knowledge.

Art+Feminism started out by hosting an edit-a-thon out of the Eyebeam Art and Technology Center in New York City in 2014, with more than 30 other locations worldwide joining in. An edit-a-thon is an event where people gather to perform Wikipedia edits, often centered around a particular theme or project. Higher education institutions and libraries make perfect partners for such occasions. We typically have useful materials to be cited and our students form a large body of potential participates who can be easily incentivized to join in, whether with extra credit or simply some food. So when my library heard of the upcoming 2nd annual edit-a-thon, we immediately began planning to host it. Below, I’ll briefly outline what we did in the hopes it’ll encourage other institutions to join us during this year’s edit-a-thon on March 5th.

Hosting an Edit-a-thon

First, we set up a meetup page on Wikipedia. If you’re unfamiliar with Wikipedia, creating a page like this isn’t a struggle. For one, you can simply copy the entire source markup of someone else’s meetup, then edit your specific details into that skeleton. For two, you can enable the experimental Visual Editor to make Wikipedia even easier to edit without learning wiki markup. The meetup page is an important place for putting up information like timing and directions, but is also a place for us to talk about the impact we made by showing how many editors attended and what articles we improved or created.

While we were putting initial details on our meetup page, we set about securing a location on the date of the edit-a-thon. We discovered that a gallery associated with our school had hosted the edit-a-thon the prior year, but they were unable to repeat it. Our school has campuses in both San Francisco and Oakland, but since Oakland had no other edit-a-thon locations we decided to host it there with the idea that SF locals already had an event nearby.

Our next steps aimed to make the event as easy for newcomers as possible. A staff member pulled relevant materials from our collection, so that researching would be simple and our rarer, more valuable resources might lend some of their information to Wikipedia. We also wanted experienced editors to be on hand during the event. I looked at a local Wikiproject, asking for help on the Talk page and then corresponding directly with a couple editors who had expressed interest. Finally, we managed to secure a visit from someone who actually works for the Wikimedia Foundation that runs the encyclopedia.

During the day, we set out snacks and name badges for everyone. Similar to the color-coded name badges at Ada Camp and other tech conferences, Art+Feminism recommended giving out name badges which signify one’s willingness to be photographed: green meant feel free to take a photo, orange meant please ask permission first, and red meant absolutely not. These steps ensure that everyone is comfortable at the event and not being exposed on the internet against their will. At the start of the day, we explained the coloring system and the Wikimedia staff person gave a short talk on how to write articles that endure, standing up to scrutiny over time. The results of the global event are listed on Wikipedia.

While I feel we accomplished much at our local event, there was one negative experience. An image uploaded to Wikimedia Commons for use on a page was flagged for deletion. The editor flagging it said something along the lines of “this isn’t your personal photo album” as the image was a headshot of a female artist. In the ensuing discussion around the proposed deletion, I noted that the image was about to be used on an article. It was never removed from Commons. Still, the incident underscores cultural problems in Wikipedia. The confrontational style of the discussion lacked good faith. Further, I heard a gendered undertone in the editor’s response; how many pictures of white men are derided as personal photos? While our library staff person was undeterred, moments of hostility like these drive away newcomers.

In Which I Admit I’m Missing the Point

To be fair, Wikipedia itself acknowledges that it fails to live up to neutral status. That the encyclopedia strives towards neutrality is the more vital point. But it’s not as if reaching a supposedly perfect neutrality resolves the issues that folks like #critlib are highlighting. Neutral as a positive value is precisely the problem, because there is no neutral stance that can be taken from outside of society’s power relations and history of inequity. So should we instead be agitating for Wikipedia to become less neutral and take more active stances on social issues? Or are edit-a-thons like Art+Feminism the most viable route towards ensuring topics are covered in a way that surfaces marginalized peoples and their experiences? I have no answers, just food for thought.

  1. See External peer review/Nature December 2005 for Wikipedia’s internal take on the correction process. The original Nature article is paywalled here and is doi:10.1038/438900a. A similar but open access study is doi:10.1371/journal.pone.0106930, Accuracy and Completeness of Drug Information in Wikipedia: A Comparison with Standard Textbooks of Pharmacology.
  2. WP:Clubhouse? An Exploration of Wikipedia’s Gender Imbalance. I’m skeptical of how “male” and “female” articles were defined, but the paper itself is thorough in its argumentation and statistical analysis. It’s also worth noting that this paper, and much of the other literature around the gender gap, ignores genders outside the female-male binary. It’s most useful to conceive of the gap as a disproportionate majority of males than a minority of females, while leaving all other genders out of the picture.

A Forray into Publishing Open Data on GitHub

While we’ve written about using GitHub for publishing before, in this post I will explore publishing data on GitHub, as opposed to a presentation or academic paper. There are a few services where one can publish research data—FigShare comes to mind—but I wanted to try GitHub because I’m already familiar with the service, it seems suitable for publishing data alongside the scripts used to obtain and process it, and its focus on version control makes it particularly apt for publishing a work in progress. However, even with free services like GitHub available, open data still has hurdles to overcome. How can I, a lowly librarian with no grant funding or experience in this area, publish an open data set such that others can locate and reuse it? Let’s find out.

Background

As Lauren introduced in her last post, we here at ACRL Tech Connect are performing research into coding in libraries; how people learn to code, what learning resources they use, what languages they use. As part of this research, I wanted to compare what our survey respondents reported with a bulk analysis of GitHub repositories under library organizations. The Code4Lib wiki has an excellent page listing many library GitHub accounts, and GitHub has a nice API that reports, among many other things, the various languages used in a project. Those two sources of information seemed like a perfect match, so I wrote a few scripts to mash them together.

Publishing scripts that extract and analyze data is important. One cannot trust the results of a single scientific experiment or a poor sample set. Providing the programs used to collect data aims to allow reproducibility so future researchers can verify or build upon prior results. While we perhaps think of science as being quite established by now, data and reproducibility are major issues in most fields. Ask any data librarian and they’ll tell you; managing the preservation and distribution of research data is not a solved or simple problem. Furthermore, every so often another meta-research study will show that only some dismal percentage of experiments can be replicated.1

My own data is not so valuable. No cure for a debilitating disease rests on the number of bytes of Standard ML in your university’s GitHub account. But on principle I want to let my results be repeatable and, what’s more, if someone does find an error in my scripts or data I want it corrected. Even if my initial conclusions are off, someone might be able to construct a stronger study from their basis.

Step #1: Obtain a DOI

As the first step of publishing my data, I wanted to obtain a Digital Object Identifier. Sure, putting my work up on GitHub gives me a URL I can reference, but leaving it at that adds a lot of uncertainty. What if I change my GitHub username, which is contained in the URL? What if I transfer the repository to a new owner? What if GitHub goes out of business? While none of these are likely scenarios, they’re still worth guarding against. DOI providers essentially stick to a pact that their identifiers will continue to work for perpetuity. While that’s not always the case, I feel like grabbing a DOI is still The Right Thing To Do for pubishers at the present moment.

We can use Zenodo to secure a DOI. GitHub already has a fine guide named Making Your Code Citable, but I’ll lightly outline the process here.

First, we create a Zenodo account reusing our GitHub credentials. Zenodo will list out our repositories and we can click the On button next to one to ready it for publishing. This button establishes a “web hook” between events happening in that GitHub repository and Zenodo; when we go to publish a release, Zenodo will be aware of it.

This was the only step that tripped me up a bit. GitHub’s “releases” are not a part of the git version control system, they’re an added feature of the hosting environment. But in my mind they’re identical to git’s “tags” that one uses to label particular points in a repository’s history. Indeed, when we push a tag up to GitHub, it’ll show up on our repository’s releases page. But it appears tags are not technically releases, or don’t trigger the right web hook, because when I pushed a typical “v1.0.0” tag to GitHub, Zenodo didn’t notice. Instead, I had to go to my releases page, Draft a new release, and then Choose an existing tag to associate the version tag with a GitHub release. The title and description entered at this stage are available later in Zenodo.

The final step is back in Zenodo, where we can mint a DOI and describe our project further. We have a powerful set of fields for describing our project in Zenodo, including type (e.g. data set, software, presentation, publication), publication date, list of authors, open-ended description, list of keywords, access rights, license, funding agency, alternative identifiers (e.g. PubMed ID), and more. Zenodo also has a “communities” feature where we can deposit our research in a collection with a disciplinary focus; I put my data in the “Library and Information Science” group.

Step #2: Document the Data’s Schema

Obtaining a DOI is fine and all, but I also wanted to document my data more thoroughly. While it’s not a complicated data set, I’m familiar with the challenges that an unknown data schema presents for end users. All too often at work, I’m forced to revise data processing routines because a new outlier appears. There’ll be a string of text where I’m expecting only integers, a blank entry in what I thought was a required field, or an ID that doesn’t conform to the anticipated pattern (punctuation appears in a barcode! a random letter prefixes an otherwise numeric ID!).

To make our data’s structure clear, we can use the Data Package standard from the Open Knowledge Foundation (OKFN), specifically the Tabular Data Package subset which was designed for the CSV (comma-separated values) format.2 Documenting our data is straightforward with these standards; we place a “datapackage.json” file alongside our data files and fill in a few fields. Here’s an example:

{
    "name": "libs-github-api", // must be URL-friendly, e.g. no spaces 
    "description": "library GitHub projects",
    "license": "CC0 Public Domain",
    "keywords": ["libraries", "programming languages"],
    "resources": [ // list of files 
        {
            "name": "summary",
            "path": "data/summary.csv", // UNIX-style path relative to datapackage.json 
            "format": "csv",
            "mediatype": "text/csv",
            "schema": { // outline of fields within this file 
                "fields": [
                    {
                        "name": "language",
                        "type": "string", /* from a controlled list of data types
                        could also be integer, number, date, etc. */
                        "description": "name of the programming language",
                        "constraints": {
                            "required": true,
                            "unique": true
                        }
                    }
                    // our schema would list a few more fields here… 
                ]
            }
        }
    ]
}

Note that the comments above aren’t valid in JSON, I include them simply to provide some inline explanation.

While it requires a little reading to figure out how to fill out datapackage.json fields, many are self-evident. The appeal of the standard becomes evident in the schema section; we can tell consumers what types to expect from our data and other particularities of a given CSV column. Does a column contain empty values? Then required will be absent or explicitly set to false. Does a column contain both integers and text? Then the type “string” warns consumers not to anticipate only numeric values. What’s more, we can provide a regular expression in the pattern constraint which specifies exactly how a field may be formatted. Even strange barcodes with surprise punctuation could be documented precisely.

I would say there’s a lot more to the Data Package standards, but the truth is they’re elegant and concise. One can read all three (Data Package, Tabular Data Package, and JSON Table Schema) in a matter of minutes, look at an instructive example or two, and be ready to reveal their data’s structure in a standardized way. There is great depth available in the way one describes individual resources and their data schema.

Why spend all this time with a data package when we’ve already done something similar with Zenodo? The data package documentation solves a couple problems. First of all, packaging up our CSV alongside structured data about its nature addresses findability. There’s tons of open data out there, the issue is it can be scattered and difficult to find. If someone is looking for statistics on programming language usage, how would they go about finding my data? Searching GitHub will be challenging; the keywords one uses (“programming language”, the ambiguous “libraries”, etc.) will likely retrieve many repositories which don’t contain open data, and GitHub, while it does have a decent advanced search form, doesn’t have the facets to make retrieving a particular data set straightforward. One cannot, for instance, filter search results by a repository’s license or the format (CSV, JSON) of the data contained therein.

Data packages address the issue of findability by providing for the possibility of a registry that aggregates all the data sets it knows about. Once a datapackage.json appears, suddenly information like whether the format is CSV or JSON, what the license is, who created it, and what subject keywords are related to a repository become clear. The Open Knowledge Foundation already has a strong proof-of-concept registry, albeit one that lists only around a hundred data sets.

Since Data Package is an open standard, any third-party can easily parse its metadata and provide search facets based on the fields that are present. This is how the standard addresses a second issue; machine readability. Documenting data sets is good, necessary even, but it often only helps humans. I can write a five-page paper meticulously detailing my data’s collection methodology and structure, but that’s asking researchers to do a lot of reading. Now consider that their research might be on a grand scale; imagine if they needed to read a hundred five page papers describing ad hoc data schemas!

Instead, creating a machine readable description lets my data be processed quickly by a specially designed program. As a somewhat trivial example, I already used the OKFN’s Data Package Validator to ensure my schema documentation met their standard. As a more interesting use case, the OKFN also defines an optional “views” section of the data package standard which allows applications to automatically create charts from our data.

Reflection

While I’m glad that tools like Zenodo and standards like Data Package exist for publishing data, there’s still a lot of work to be done in this arena. Every time I make a new release on GitHub, which arguably should happen with even minor changes to my data or scripts, I have to refill the extensive Zenodo form. Zenodo also doesn’t detect the GitHub repository’s license, which is hardly blameworthy given that that information isn’t present as structured data but mere text in a readme file. However, when publishing a new version of the same underlying data, it doesn’t fill in the license or other information from its own previous items.

There’s a ton of efficiency left on the table in the data publishing process this post describes. Specifically, an integration between Zenodo and the datapackage.json metadata would alleviate a number of problems. Rather than repeatedly filling out a form in Zenodo, one could simply ensure changes were reflected in the datapackage.json and publish a new version on GitHub. Many fields between the two are redundant, though each also has its unique value; Zenodo asks for typical academic publishing information (e.g. publication type, links to prior versions) while Data Package asks for a data schema.

As a final area of concern, the open-ended “license” field is going to eventually limit the utility of the machine readable information in a Data Package.3 Perhaps this is my inner librarian unnecessarily freaking out here, but uncontrolled fields which affect resource reuse are bad news. Defaulting to authors specifying an arbitrary string of text as a license is precisely the problem that the Digital Public Library of America and other large digital libraries are facing, as their corpuses contain thousands of different rights statements.4 Zenodo provides a substantial list of licenses to choose from, but then does a poor job of automatically detecting one even if hints are available via GitHub or a previous incarnation of the publication. GitHub itself should probably make licenses for repositories required and controlled as I could see that being a vital facet in their advanced search as well as interesting data to expose to researchers via their API.

  1. I read something about this a month or two ago, but wasn’t able to relocate the source. Scouring the web, there’s a Washington Post article from January on the phenomenon of irreproducible research, which in turn points to a PLoS Med article from 2005 “Why Most Published Research Findings Are False“. Other studies along these lines are “A Survey on Data Reproducibility in Cancer Research Provides Insights into Our Limited Ability to Translate Findings from the Laboratory to the Clinic” in PLoS ONE and “Drug development: Raise standards for preclinical cancer research” in Nature.
  2. I might have been able to explore another intriguing project, Research Objects, which has the apt tagline “enabling reproducible, transparent research.” However, the Data Package standards were so easy to find and follow conceptually that I chose them.
  3. And to be fair, I did see one example where the licenses JSON property was specified as an array of objects containing a license name and URL, which might be easier to consume in script depending on what’s available at the URL.
  4. Aside: I don’t mean to argue that arbitrary license strings should be prohibited, because no controlled vocabulary is going to enumerate all possible choices. But there’s a lot of good work being done to make licenses easier to specify—think of Creative Commons with their composable, versioned licenses which can be referred to by URL. Defaulting to a controlled list of license types or at least pointing to a preferred vocabulary would help here.