Reflections on Code4Lib 2013

Disclaimer: I was on the planning committee for Code4Lib 2013, but this is my own opinion and does not reflect other organizers of the conference.

We have mentioned Code4Lib before on this blog, but for those who are unfamiliar, it is a loose collective of programmers working in libraries, librarians, and others interested in code and libraries. (You can read more about it on the website.) The Code4Lib conference has emerged as a venue to share very new technology and have discussions with a wide variety of people who might not attend conferences more geared to librarians. Presentations at the conference are decided by the votes of anyone interested in selecting the program, and additionally lightning talks and breakout sessions allow wide participation and exposure to extremely new projects that have not made it into the literature or to conferences with a longer lead time. The Code4Lib 2013 conference ran February 11-14 at University of Illinois Chicago. You can see a list of all programs here, which includes links to the video archive of the conference.

While there were many types of projects presented, I want to focus on those talks which illustrated what I saw as thread running through the conference–care and emotion. This is perhaps unexpected for a technical conference. Yet those themes underlie a great deal of the work that takes place in academic library technology and the types of projects presented at Code4Lib. We tend to work in academic libraries because we care about the collections and the people using those collections. That intrinsic motivation focuses our work.

Caring about the best way to display collections is central to successful projects. Most (though not all) the presenters and topics came out of academic libraries, and many of the presentations dealt with creating platforms for library and archival metadata and collections. To highlight a few: Penn State University has developed their own institutional repository application called ScholarSphere that provides a better user experience for researchers and managers of the repository. The libraries and archives of the Rock and Roll Hall of Fame dealt with the increasingly common problem of wanting to present digital content alongside more traditional finding aids, and so developed a system for doing so. Corey Harper from New York University presented an extremely interesting and still experimental project to use linked data to enrich interfaces for interacting with library collections. Note that all these projects combined various pieces of open source software and library/web standards to create solutions that solve a problem facing academic or research libraries for a particular setting. I think an important lesson for most academic librarians looking at descriptions of projects like this is that it takes more than development staff to make projects like this. It takes purpose, vision, and dedication to collecting and preserving content–in other words, emotion and care. A great example of this was the presentation about DIYHistory from the University of Iowa. This project started out initially as an extremely low-tech solution for crowdsourcing archival transcription, but got so popular that it required a more robust solution. They were able to adapt open source tools to meet their needs, still keeping the project very within the means of most libraries (the code is here).

Another view of emotion and care came from Mark Matienzo, who did a lightning talk (his blog post gives a longer version with more details). His talk discussed the difficulties of acknowledging and dealing with the emotional content of archives, even though emotion drives interactions with materials and collections. The records provided are emotionless and affectless, despite the fact that they represent important moments in history and lives. The type of sharing of what someone “likes” on Facebook does not satisfactorily answer the question of what they care about,or represent the emotion in their lives. Mark suggested that a tool like Twine, which allows writing interactive stories could approach the difficult question of bringing together the real with the emotional narrative that makes up experience.

One of the ways we express care for our work and for our colleagues is by taking time to be organized and consistent in code. Naomi Dushay of Stanford University Library presented best practices for code handoffs, which described some excellent practices for documenting and clarifying code and processes. One of the major takeaways is that being clear, concise, and straightforward is always preferable, even as much as we want to create cute names for our servers and classes. To preserve a spirit of fun, you can use the cute name and attach a description of what the item actually does.

Originally Bess Sadler, also from Stanford, was going to present with Naomi, but ended up presenting a different talk and the last one of the conference on Creating a Commons (the full text is available here). This was a very moving look at what motivates her to create open source software and how to create better open source software projects. She used the framework of the Creative Commons licenses to discuss open source software–that it needs to be “[m]achine readable, human readable, and lawyer readable.” Machine readable means that code needs to be properly structured and allow for contributions from multiple people without breaking, lawyer readable means that the project should have the correct structure and licensing to collaborate across institutions. Bess focused particularly on the “human readable” aspect of creating communities and understanding the “hacker epistemology,” as she so eloquently put it, “[t]he truth is what works.” Part of understanding that requires being willing to reshape default expectations–for instance, the Code4Lib community developed a Code of Conduct at Bess’s urging to underline the fact that the community aims at inclusion and creating a safe space. She encouraged everyone to keep working to do better and “file bug reports” about open source communities.

This year’s Code4Lib conference was a reminder to me about why I do the work I do as an academic librarian working in a technical role. Even though I may spend a lot of time sitting in front of a computer looking at code, or workflows, or processes, I know it makes access to the collections and exploration of those collections better.

Wikipedia, Libraries, & Neutrality

This piece is substantially based off of a column I wrote for RUSQ that will appear early next year.

Broadly construed, there are two camps of opinions surrounding Wikipedia in librarianship. The first is that Wikipedia is not an academic quality source. It is something students need to be warned away from. The community authorship of wikipedia is typically the source of this criticism; since “anyone can edit” the online encyclopedia, there’s no way to tell whether you’re receiving high quality information written by experts or the ill-informed opinions of internet trolls.

The other camp is far more positive regarding Wikipedia. After all, the values of the Wikipedian community clearly align with our own. The ultimate goal of the project is not to give a venue for poor research or fringe theories, but to enable free and open access to information. While Wikipedia isn’t flawless, many of its articles compete with more established reference sources. Famously, Nature performed a comparison of scientific articles and found Wikipedia to be comparable to Encyclopedia Britannica. But following the study, a community project to correct the identified errors in Wikipedia sprung up, fixing them all in a little over a month.1 With this common ground, it’s no wonder that libraries have found Wikipedia to be a valuable partner in publicizing our content. Articles like Using Wikipedia to Extend Digital Collections, Putting the Library in Wikipedia, and Wikipedia Lover, Not a Hater: Harnessing Wikipedia to Increase the Discoverability of Library Resources all discuss the value of working with Wikipedia to highlight library digital collections and metadata, not against.

A good example of Wikipedia driving traffic to library special collections was brought to my attention by fellow Tech Connect author Margaret Heller, who pointed me towards the Google Analytics Usage Reports for CARLI Digital Collections. CARLI regularly sees Wikipedia as one of the top external traffic sources, with Wikipedia being noted in last few quarterly reports as a traffic source leading “to home pages or images from multiple CARLI Collections”.

The Problem with Neutral

I must admit that I side with the pro-Wikipedia folks. I love using Wikipedia as a resource—as a procrastination-enabler I have open several articles on roguelikes that I’m reading while I right this—and I love contributing to it in whatever small ways I can, from fixing broken markup to citation hunting. But Wikipedia is imperfect in ways more insidious that its anonymous authorship or occasionally inaccurate details. Rather, one problem lies within one of the five pillars that define the philosophy of Wikipedia; “Wikipedia is written from a neutral point of view”.

Neutrality has been under fire from #critlib lately, a group of librarians emphasizing critical theory especially with respect to information literacy instruction. ALA Annual featured a well-attended presentation entitled “But We’re Neutral!” And Other Librarian Fictions Confronted by #critlib. A seminal article appearing earlier this year in Code4Lib by Bess Sadler and Chris Bourg begins with a rousing section labelled Libraries are not Neutral:

In spite of the pride many libraries take in their neutrality, libraries have never been neutral repositories of knowledge. Research libraries in particular have always reflected the inequalities, biases, ethnocentrism, and power imbalances that exist throughout the academic enterprise through collection policies and hiring practices that reflect the biases of those in power at a given institution. In addition, theoretically neutral library activities like cataloging have often re-created societal patterns of exclusion and inequality.

Wikipedia shares this problem; while it appears neutral on the surface, its topical coverage and treatment of subjects reflect the skewed power relations of our society. A 2011 study found that 91% of editors were men. The same study shows that few editors come from the Global South and that the English Wikipedia receives far more focus than other languages. Another research paper from 2011 goes a bit further in demonstrating that “male articles are significantly longer than female articles”.2 The editorial gender gap has real effects on the encyclopedia’s content; it’s not just that having editors of all genders is good in its own right, it’s that Wikipedia’s claims to objectivity and neutrality are jeopardized by the disbalance.


So what’s a concerned librarian to do? I think the wrong thing would be to denounce Wikipedia. For one, the alternative sources we or our patrons would turn to are no less problematic. Encyclopaedia Britannica has a well-documented history of supposedly scientific articles written from the dominant viewpoint (c.f. the racist description of black people in the 1911 edition). What’s more, Wikipedia is going to be used. It’s massively popular. Sticking our heads in the sand because it doesn’t live up to its own standards of neutrality improves nothing.

Luckily, there are several Wikipedia projects focused on recruiting editors from underrepresented groups and addressing lackluster coverage of particular topics which we can support. One such project is Art + Feminism. In its own words:

Art+Feminism is a rhizomatic campaign to improve coverage of women and the arts on Wikipedia, and to encourage female editorship…Content is skewed by the lack of female participation. Many articles on notable women in history and art are absent on Wikipedia. This represents an alarming aporia in an increasingly important repository of shared knowledge.

Art+Feminism started out by hosting an edit-a-thon out of the Eyebeam Art and Technology Center in New York City in 2014, with more than 30 other locations worldwide joining in. An edit-a-thon is an event where people gather to perform Wikipedia edits, often centered around a particular theme or project. Higher education institutions and libraries make perfect partners for such occasions. We typically have useful materials to be cited and our students form a large body of potential participates who can be easily incentivized to join in, whether with extra credit or simply some food. So when my library heard of the upcoming 2nd annual edit-a-thon, we immediately began planning to host it. Below, I’ll briefly outline what we did in the hopes it’ll encourage other institutions to join us during this year’s edit-a-thon on March 5th.

Hosting an Edit-a-thon

First, we set up a meetup page on Wikipedia. If you’re unfamiliar with Wikipedia, creating a page like this isn’t a struggle. For one, you can simply copy the entire source markup of someone else’s meetup, then edit your specific details into that skeleton. For two, you can enable the experimental Visual Editor to make Wikipedia even easier to edit without learning wiki markup. The meetup page is an important place for putting up information like timing and directions, but is also a place for us to talk about the impact we made by showing how many editors attended and what articles we improved or created.

While we were putting initial details on our meetup page, we set about securing a location on the date of the edit-a-thon. We discovered that a gallery associated with our school had hosted the edit-a-thon the prior year, but they were unable to repeat it. Our school has campuses in both San Francisco and Oakland, but since Oakland had no other edit-a-thon locations we decided to host it there with the idea that SF locals already had an event nearby.

Our next steps aimed to make the event as easy for newcomers as possible. A staff member pulled relevant materials from our collection, so that researching would be simple and our rarer, more valuable resources might lend some of their information to Wikipedia. We also wanted experienced editors to be on hand during the event. I looked at a local Wikiproject, asking for help on the Talk page and then corresponding directly with a couple editors who had expressed interest. Finally, we managed to secure a visit from someone who actually works for the Wikimedia Foundation that runs the encyclopedia.

During the day, we set out snacks and name badges for everyone. Similar to the color-coded name badges at Ada Camp and other tech conferences, Art+Feminism recommended giving out name badges which signify one’s willingness to be photographed: green meant feel free to take a photo, orange meant please ask permission first, and red meant absolutely not. These steps ensure that everyone is comfortable at the event and not being exposed on the internet against their will. At the start of the day, we explained the coloring system and the Wikimedia staff person gave a short talk on how to write articles that endure, standing up to scrutiny over time. The results of the global event are listed on Wikipedia.

While I feel we accomplished much at our local event, there was one negative experience. An image uploaded to Wikimedia Commons for use on a page was flagged for deletion. The editor flagging it said something along the lines of “this isn’t your personal photo album” as the image was a headshot of a female artist. In the ensuing discussion around the proposed deletion, I noted that the image was about to be used on an article. It was never removed from Commons. Still, the incident underscores cultural problems in Wikipedia. The confrontational style of the discussion lacked good faith. Further, I heard a gendered undertone in the editor’s response; how many pictures of white men are derided as personal photos? While our library staff person was undeterred, moments of hostility like these drive away newcomers.

In Which I Admit I’m Missing the Point

To be fair, Wikipedia itself acknowledges that it fails to live up to neutral status. That the encyclopedia strives towards neutrality is the more vital point. But it’s not as if reaching a supposedly perfect neutrality resolves the issues that folks like #critlib are highlighting. Neutral as a positive value is precisely the problem, because there is no neutral stance that can be taken from outside of society’s power relations and history of inequity. So should we instead be agitating for Wikipedia to become less neutral and take more active stances on social issues? Or are edit-a-thons like Art+Feminism the most viable route towards ensuring topics are covered in a way that surfaces marginalized peoples and their experiences? I have no answers, just food for thought.

  1. See External peer review/Nature December 2005 for Wikipedia’s internal take on the correction process. The original Nature article is paywalled here and is doi:10.1038/438900a. A similar but open access study is doi:10.1371/journal.pone.0106930, Accuracy and Completeness of Drug Information in Wikipedia: A Comparison with Standard Textbooks of Pharmacology.
  2. WP:Clubhouse? An Exploration of Wikipedia’s Gender Imbalance. I’m skeptical of how “male” and “female” articles were defined, but the paper itself is thorough in its argumentation and statistical analysis. It’s also worth noting that this paper, and much of the other literature around the gender gap, ignores genders outside the female-male binary. It’s most useful to conceive of the gap as a disproportionate majority of males than a minority of females, while leaving all other genders out of the picture.

A Forray into Publishing Open Data on GitHub

While we’ve written about using GitHub for publishing before, in this post I will explore publishing data on GitHub, as opposed to a presentation or academic paper. There are a few services where one can publish research data—FigShare comes to mind—but I wanted to try GitHub because I’m already familiar with the service, it seems suitable for publishing data alongside the scripts used to obtain and process it, and its focus on version control makes it particularly apt for publishing a work in progress. However, even with free services like GitHub available, open data still has hurdles to overcome. How can I, a lowly librarian with no grant funding or experience in this area, publish an open data set such that others can locate and reuse it? Let’s find out.


As Lauren introduced in her last post, we here at ACRL Tech Connect are performing research into coding in libraries; how people learn to code, what learning resources they use, what languages they use. As part of this research, I wanted to compare what our survey respondents reported with a bulk analysis of GitHub repositories under library organizations. The Code4Lib wiki has an excellent page listing many library GitHub accounts, and GitHub has a nice API that reports, among many other things, the various languages used in a project. Those two sources of information seemed like a perfect match, so I wrote a few scripts to mash them together.

Publishing scripts that extract and analyze data is important. One cannot trust the results of a single scientific experiment or a poor sample set. Providing the programs used to collect data aims to allow reproducibility so future researchers can verify or build upon prior results. While we perhaps think of science as being quite established by now, data and reproducibility are major issues in most fields. Ask any data librarian and they’ll tell you; managing the preservation and distribution of research data is not a solved or simple problem. Furthermore, every so often another meta-research study will show that only some dismal percentage of experiments can be replicated.1

My own data is not so valuable. No cure for a debilitating disease rests on the number of bytes of Standard ML in your university’s GitHub account. But on principle I want to let my results be repeatable and, what’s more, if someone does find an error in my scripts or data I want it corrected. Even if my initial conclusions are off, someone might be able to construct a stronger study from their basis.

Step #1: Obtain a DOI

As the first step of publishing my data, I wanted to obtain a Digital Object Identifier. Sure, putting my work up on GitHub gives me a URL I can reference, but leaving it at that adds a lot of uncertainty. What if I change my GitHub username, which is contained in the URL? What if I transfer the repository to a new owner? What if GitHub goes out of business? While none of these are likely scenarios, they’re still worth guarding against. DOI providers essentially stick to a pact that their identifiers will continue to work for perpetuity. While that’s not always the case, I feel like grabbing a DOI is still The Right Thing To Do for pubishers at the present moment.

We can use Zenodo to secure a DOI. GitHub already has a fine guide named Making Your Code Citable, but I’ll lightly outline the process here.

First, we create a Zenodo account reusing our GitHub credentials. Zenodo will list out our repositories and we can click the On button next to one to ready it for publishing. This button establishes a “web hook” between events happening in that GitHub repository and Zenodo; when we go to publish a release, Zenodo will be aware of it.

This was the only step that tripped me up a bit. GitHub’s “releases” are not a part of the git version control system, they’re an added feature of the hosting environment. But in my mind they’re identical to git’s “tags” that one uses to label particular points in a repository’s history. Indeed, when we push a tag up to GitHub, it’ll show up on our repository’s releases page. But it appears tags are not technically releases, or don’t trigger the right web hook, because when I pushed a typical “v1.0.0” tag to GitHub, Zenodo didn’t notice. Instead, I had to go to my releases page, Draft a new release, and then Choose an existing tag to associate the version tag with a GitHub release. The title and description entered at this stage are available later in Zenodo.

The final step is back in Zenodo, where we can mint a DOI and describe our project further. We have a powerful set of fields for describing our project in Zenodo, including type (e.g. data set, software, presentation, publication), publication date, list of authors, open-ended description, list of keywords, access rights, license, funding agency, alternative identifiers (e.g. PubMed ID), and more. Zenodo also has a “communities” feature where we can deposit our research in a collection with a disciplinary focus; I put my data in the “Library and Information Science” group.

Step #2: Document the Data’s Schema

Obtaining a DOI is fine and all, but I also wanted to document my data more thoroughly. While it’s not a complicated data set, I’m familiar with the challenges that an unknown data schema presents for end users. All too often at work, I’m forced to revise data processing routines because a new outlier appears. There’ll be a string of text where I’m expecting only integers, a blank entry in what I thought was a required field, or an ID that doesn’t conform to the anticipated pattern (punctuation appears in a barcode! a random letter prefixes an otherwise numeric ID!).

To make our data’s structure clear, we can use the Data Package standard from the Open Knowledge Foundation (OKFN), specifically the Tabular Data Package subset which was designed for the CSV (comma-separated values) format.2 Documenting our data is straightforward with these standards; we place a “datapackage.json” file alongside our data files and fill in a few fields. Here’s an example:

    "name": "libs-github-api", // must be URL-friendly, e.g. no spaces 
    "description": "library GitHub projects",
    "license": "CC0 Public Domain",
    "keywords": ["libraries", "programming languages"],
    "resources": [ // list of files 
            "name": "summary",
            "path": "data/summary.csv", // UNIX-style path relative to datapackage.json 
            "format": "csv",
            "mediatype": "text/csv",
            "schema": { // outline of fields within this file 
                "fields": [
                        "name": "language",
                        "type": "string", /* from a controlled list of data types
                        could also be integer, number, date, etc. */
                        "description": "name of the programming language",
                        "constraints": {
                            "required": true,
                            "unique": true
                    // our schema would list a few more fields here… 

Note that the comments above aren’t valid in JSON, I include them simply to provide some inline explanation.

While it requires a little reading to figure out how to fill out datapackage.json fields, many are self-evident. The appeal of the standard becomes evident in the schema section; we can tell consumers what types to expect from our data and other particularities of a given CSV column. Does a column contain empty values? Then required will be absent or explicitly set to false. Does a column contain both integers and text? Then the type “string” warns consumers not to anticipate only numeric values. What’s more, we can provide a regular expression in the pattern constraint which specifies exactly how a field may be formatted. Even strange barcodes with surprise punctuation could be documented precisely.

I would say there’s a lot more to the Data Package standards, but the truth is they’re elegant and concise. One can read all three (Data Package, Tabular Data Package, and JSON Table Schema) in a matter of minutes, look at an instructive example or two, and be ready to reveal their data’s structure in a standardized way. There is great depth available in the way one describes individual resources and their data schema.

Why spend all this time with a data package when we’ve already done something similar with Zenodo? The data package documentation solves a couple problems. First of all, packaging up our CSV alongside structured data about its nature addresses findability. There’s tons of open data out there, the issue is it can be scattered and difficult to find. If someone is looking for statistics on programming language usage, how would they go about finding my data? Searching GitHub will be challenging; the keywords one uses (“programming language”, the ambiguous “libraries”, etc.) will likely retrieve many repositories which don’t contain open data, and GitHub, while it does have a decent advanced search form, doesn’t have the facets to make retrieving a particular data set straightforward. One cannot, for instance, filter search results by a repository’s license or the format (CSV, JSON) of the data contained therein.

Data packages address the issue of findability by providing for the possibility of a registry that aggregates all the data sets it knows about. Once a datapackage.json appears, suddenly information like whether the format is CSV or JSON, what the license is, who created it, and what subject keywords are related to a repository become clear. The Open Knowledge Foundation already has a strong proof-of-concept registry, albeit one that lists only around a hundred data sets.

Since Data Package is an open standard, any third-party can easily parse its metadata and provide search facets based on the fields that are present. This is how the standard addresses a second issue; machine readability. Documenting data sets is good, necessary even, but it often only helps humans. I can write a five-page paper meticulously detailing my data’s collection methodology and structure, but that’s asking researchers to do a lot of reading. Now consider that their research might be on a grand scale; imagine if they needed to read a hundred five page papers describing ad hoc data schemas!

Instead, creating a machine readable description lets my data be processed quickly by a specially designed program. As a somewhat trivial example, I already used the OKFN’s Data Package Validator to ensure my schema documentation met their standard. As a more interesting use case, the OKFN also defines an optional “views” section of the data package standard which allows applications to automatically create charts from our data.


While I’m glad that tools like Zenodo and standards like Data Package exist for publishing data, there’s still a lot of work to be done in this arena. Every time I make a new release on GitHub, which arguably should happen with even minor changes to my data or scripts, I have to refill the extensive Zenodo form. Zenodo also doesn’t detect the GitHub repository’s license, which is hardly blameworthy given that that information isn’t present as structured data but mere text in a readme file. However, when publishing a new version of the same underlying data, it doesn’t fill in the license or other information from its own previous items.

There’s a ton of efficiency left on the table in the data publishing process this post describes. Specifically, an integration between Zenodo and the datapackage.json metadata would alleviate a number of problems. Rather than repeatedly filling out a form in Zenodo, one could simply ensure changes were reflected in the datapackage.json and publish a new version on GitHub. Many fields between the two are redundant, though each also has its unique value; Zenodo asks for typical academic publishing information (e.g. publication type, links to prior versions) while Data Package asks for a data schema.

As a final area of concern, the open-ended “license” field is going to eventually limit the utility of the machine readable information in a Data Package.3 Perhaps this is my inner librarian unnecessarily freaking out here, but uncontrolled fields which affect resource reuse are bad news. Defaulting to authors specifying an arbitrary string of text as a license is precisely the problem that the Digital Public Library of America and other large digital libraries are facing, as their corpuses contain thousands of different rights statements.4 Zenodo provides a substantial list of licenses to choose from, but then does a poor job of automatically detecting one even if hints are available via GitHub or a previous incarnation of the publication. GitHub itself should probably make licenses for repositories required and controlled as I could see that being a vital facet in their advanced search as well as interesting data to expose to researchers via their API.

  1. I read something about this a month or two ago, but wasn’t able to relocate the source. Scouring the web, there’s a Washington Post article from January on the phenomenon of irreproducible research, which in turn points to a PLoS Med article from 2005 “Why Most Published Research Findings Are False“. Other studies along these lines are “A Survey on Data Reproducibility in Cancer Research Provides Insights into Our Limited Ability to Translate Findings from the Laboratory to the Clinic” in PLoS ONE and “Drug development: Raise standards for preclinical cancer research” in Nature.
  2. I might have been able to explore another intriguing project, Research Objects, which has the apt tagline “enabling reproducible, transparent research.” However, the Data Package standards were so easy to find and follow conceptually that I chose them.
  3. And to be fair, I did see one example where the licenses JSON property was specified as an array of objects containing a license name and URL, which might be easier to consume in script depending on what’s available at the URL.
  4. Aside: I don’t mean to argue that arbitrary license strings should be prohibited, because no controlled vocabulary is going to enumerate all possible choices. But there’s a lot of good work being done to make licenses easier to specify—think of Creative Commons with their composable, versioned licenses which can be referred to by URL. Defaulting to a controlled list of license types or at least pointing to a preferred vocabulary would help here.

Where do Library Staff Learn About Programming? Some Preliminary Survey Results

[Editor’s Note:  This post is part of a series of posts related to ACRL TechConnect’s 2015 survey on Programming Languages, Frameworks, and Web Content Management Systems in Libraries.  The survey was distributed between January and March 2015 and received 265 responses.  A longer journal article with additional analysis is also forthcoming.  For a quick summary of the article below, check out this infographic.]

Our survey on programming languages in libraries has resulted in a mountain of fascinating data.  One of the goals of our survey was to better understand how staff in libraries learn about programming and develop their coding skills.  Based upon anecdotal evidence, we hypothesized that library staff members are often self-taught, learning through a combination of on-the-job learning and online tutorials.  Our findings indicate that respondents use a wide variety of sources to learn about programming, including MOOCs, online tutorials, Google searches, and colleagues.

Are programming skills gained by formal coursework, or in Library Science Master’s Programs?

We were interested in identifying sources of programming learning, whether that involved course work (either formal coursework as part of a degree or continuing education program, or through Massive Online Open Courseware (MOOCs)).  Nearly two-thirds of respondents indicated they had an MLS or were working on one:

When asked about coursework taken in programming, application, or software development, results were mixed, with the most popular choice being 1-2 classes:

However, of those respondents who have taken a course in programming (about 80% of all respondents) AND indicated that they either had an MLS or were attending an MLS program, only about a third had taken any of those courses as part of a Master’s in Library Science program:

Resources for learning about programming

The final question of the survey asked respondents, in an open-ended way, to describe resources they use to learn about programming.  It was a pretty complex question:

Please list or describe any learning resources, discussion boards or forums, or other methods you use to learn about or develop your skills in programming, application development, or scripting. Please includes links to online resources if available. Examples of resources include, but are not limited to:, MOOC courses, local community/college/university course on programming, Books, Code4Lib listserv, Stack Overflow, etc.).

Respondents gave, in many cases, incredibly detailed responses – and most respondents indicated a list of resources used.  After coding the responses into 10 categories, some trends emerged.  The most popular resources for learning about programming, by far, were courses (whether those courses were taken formally in a classroom environment, or online in a MOOC environment):

To better illustrate what each category entails, here are the top five resources in each category:

By far, the most commonly cited learning resource was Stack Overflow, followed by the Code4Lib Listserv, Books/ebooks (unspecified) and  Results may skew a little toward these resources because they were mentioned as examples in the question, priming respondents to include them in their responses.  Since links to the survey were distributed, among other places, on the Code4Lib listserv, its prominence may also be influenced by response bias. One area that was a little surprising was the number of respondents that included social networks (including in-person networks like co-workers) as resources – indeed, respondents who mentioned colleagues as learning resources were particularly enthusiastic, as one respondent put it:

…co-workers are always very important learning resources, perhaps the most important!

Preliminary Analysis

While the data isn’t conclusive enough to draw any strong conclusions yet, a few thoughts come to mind:

  • About 3/4 of respondents indicated that programming was either part of their job description, or that they use programming or scripting as part of their work, even if it’s not expressly part of their job.  And yet, only about a third of respondents with an MLS (or in the process of getting one) took a programming class as part of their MLS program.  Programming is increasingly an essential skill for library work, and this survey seems to support the view that there should be more programming courses in library school curriculum.
  • Obviously programming work is not monolithic – there’s lots of variation among those who do programming work that isn’t reflected in our survey, and this survey may have unintentionally excluded those who are hobby coders.  Most questions focused on programming used when performing work-related tasks, so additional research would be needed to identify learning strategies of enthusiast programmers who don’t have the opportunity to program as part of their job.
  • Respondents indicated that learning on the job is an important aspect of their work; they may not have time or institutional support for formal training or courses, and figure things out as they go along using forums like Stack Overflow and Code4Lib’s listserv.  As one respondent put it:

Codecademy got me started. Stack Overflow saves me hours of time and effort, on a regular basis, as it helps me with answers to specific, time-of-need questions, helping me do problem-based learning.

TL;DR?  Here’s an infographic:

In the next post, I’ll discuss some of the findings related to ways administration and supervisors support (or don’t support) programming work in libraries.

Removing the Truthiness from Google

A decade ago, Stephen Colbert introduced the concept of “truthiness”, or a fact that was so because it felt right “from the gut.” When we search for information online, we are always up against the risk that the creator of a page is someone who, like Stephen Colbert’s character doesn’t trust books, because “they’re all fact, no heart.”1 Since sites with questionable or outright false facts that “feel right” often end up at the top of Google search results, librarians teach students how to evaluate online sources for accuracy, relevancy, and so on rather than just trusting the top result. But what if there were a way to ensure that truthiness was removed, and only sites with true information appeared at the top of the results?

This idea is what underlies a new Google algorithm called Knowledge-Based Trust(KBT)2. Google’s original founding principles and the PageRank algorithm were based on academic citation practices–loosely summarized, pages linked to by a number of other pages are more likely to be useful than those with fewer links. The content of the page, while it needs to match the search query, is less crucial to its ranking than outside factors, which is otherwise known as an exogenous model. The KBT, by contrast, is an endogenous model relying on the actual content of the page. Ranking is based on the probability that the page is accurate, and therefore more trustworthy. This is designed to address the problem of sites with high PageRank scores that aren’t accurate, either because their truthiness quotient is high, or because they have gamed the system by scraping content and applying misleading SEO. On the other side, pages with great information that aren’t very popular may be buried.

“Wait a second,” you are now asking yourself, “Google now determines what is true?” The answer is: sort of, but of course it’s not as simple as that. Let’s look at the paper in detail, and then come back to the philosophical questions.

Digging Into the KBT

First, this paper is technical, but the basic information is fairly straightforward. This model is based on extracting facts from a web source, evaluating whether those facts are true or not, and then whether a source is accurate or not. This leads to a determination that the facts are correct in an iterative process. Of course, verifying that determination is essential to ensuring that all the algorithms are working correctly, and this paper describes ways of checking the extracted facts for accuracy.

The extractors are described more fully in an earlier version of this work, Knowledge Vault (KV), which was designed to fill in large-scale knowledge bases such as Freebase by extracting facts from a web source using techniques like Natural Language Processing of text files followed by machine learning, HTML DOM trees, HTML tables, and human processed pages with metadata. The extractors themselves can perform poorly in creating these triples, however, and this is more common than the facts being wrong, and so sites may be unfairly flagged as inaccurate. The KBT project aims to introduce an algorithm for determining what type of error is present, as well as how to judge sites with many or few facts accurately, and lastly to test their assumptions using real world data against known facts.

The specific example given in the paper is the birthplace of President Barack Obama. The extractor would determine a predicate, subject, object triple from a web source and match these strings to Freebase (for example). This can lead to a number of errors–there is a huge problem in computationally determining the truth even when the semantics are straightforward (which we all know it rarely is). For this example, it’s possible to check data from the web against the known value in Freebase, and so if that extractor works set an option to 1 (for yes) and 0 (for no). Then this can be charted in a two-dimensional or three-dimensional matrix that helps show the probability of a given extractor working, as well as whether the value pulled by the extractor was true or not.

They go on to examine two models for computing the data, single-layer and multi-layer. The single-layer model, which looks at each web source and its facts separately, is easier to work with using standard techniques but is limited because it doesn’t take into account extraction errors. The multi-layer model is more complex to analyze, but takes the extraction errors into account along with the truth errors. I am not qualified to comment on the algorithm math in detail, but essentially it computes probability of accuracy for each variable in turn, ultimately arriving at an equation that estimates how accurate a source is, weighted by the likelihood that source contains those facts. There are additional considerations for precision and recall, as well as confidence levels returned by extractors.

Lastly, they consider how to split up large sources to avoid computational bottlenecks, as well as to merge sources with few facts in order to not penalize them but not accidentally combine unrelated sources. Their experimental results determined that generally PageRank and KBT are orthogonal, but with a few outliers. In some cases, the site has a low PageRank but a high KBT. They manually verified the top three predicates with high extraction accuracy scores for web sources with a high KBT to check what was happening. 85% of these sources were trustworthy without extraction errors and with predicates related to the topic of the page, but only 23% of these sources had PageRank scores over 0.5. In other cases, sources had a low KBT but high PageRank, which included sites such as celebrity gossip sites and forums such as Yahoo Answers. Yes, indeed, Google computer scientists finally have definitive proof that Yahoo Answers tends to be inaccurate.

The conclusion of the article with future improvements reads like the learning outcomes for any basic information literacy workshop. First, the algorithm would need to be able to tell the main topic of the website and filter out unrelated facts, to understand which triples are trivial, to have better comprehension of what is a fact, and to correctly remove sites with data scraped from other sources. That said, for what it does, this is a much more sophisticated model than anything else out there, and at least proves that there is a possibility to computationally determine the accuracy of a web source.

What is Truth, Anyway?

Despite the promise of this model there are clearly many potential problems, of which I’ll mention just a few. The source for this exercise, Freebase, is currently in read-only mode as its data migrates to Wikidata. Google is apparently dropping Freebase to focus on their Open Knowledge Graph, which is partially Freebase/Wikidata content and partially data 3. One interesting wrinkle is that much of Freebase content cites Wikipedia as a source, which means there are currently recursive citations that must be properly cited before they will be accepted as facts. We already know that Wikipedia suffers from a lack of diversity in contributors and topic coverage, so a focus on content from Wikipedia has the danger of reducing the sources of information from which the KBT could check triples.

That said, most of human knowledge and understanding is difficult to fit into triples. While surely no one would search Google for “What is love?” or similar and expect to get a factual answer, there are plenty of less extreme examples that are unclear. For instance, how does this account for controversial topics? I.e. “anthropogenic global warming is real” vs. “global warming is real, but it’s not anthropogenic.” 97% of scientists agree to the former, but what if you are looking for what the 3% are saying?

And we might question whether it’s a good idea to trust an algorithm’s definition of what is true. As Bess Sadler and Chris Bourg remind us, algorithms are not neutral, and may ignore large parts of human experience, particularly from groups underrepresented in computer science and technology. Librarians should have a role in reducing that ignorance by supporting “inclusion, plurality, participation and transparency.” 4 Given the limitations of what is available to the KBT it seems unlikely that this algorithm would markedly reduce this inequity, though I could see how it could be possible if Wikidata could be seeded with more information about diverse groups.

Librarians take note, this algorithm is still under development, and most likely won’t be appearing in our Google results any time in the near future. And even once it does, we need to ensure that we are still paying attention to nuance, edge cases, and our own sense of truthiness–and more importantly, truth–as we evaluate web sources.

  2. Dong, X. et al. “Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources”. Proceedings of the VLDB Endowment, 2015. Retrieved from
  4. Sadler, Bess and Chris Bourg, “Feminism and the Future of Library Discovery.” Code4Lib Journal 28, April 2015.

A Video on Browser Extensions

I thought we’d try something new on ACRL TechConnect, so I recorded a fifteen-minute video discussing general use cases for browser extensions and some specifics of Google Chrome extensions.

The video mentions my WikipeDPLA post on this blog and walks through some slides I presented at a Code4Lib Northern California event.

If you’re looking for another good extension example in libraryland, Stephen Schor of New York Public Library recently wrote extensions for Chrome and Firefox that improve the appearance and utility of the Library of Congress’ EAD documentation. The Chrome extension uses the same content script approach as my silly example in the video. It’s a good demonstration of how you can customize a site you don’t control using the power of browser add-ons.

Have you found a use case for browser add-ons at your library? Let us know in the comments!

This Is How I (Attempt To) Work

Editor’s Note: ACRL TechConnect blog will run a series of posts by our regular and guest authors about The Setup of our work. The first post is by TechConnect alum Becky Yoose.

Ever wondered how several of your beloved TechConnect authors and alumni manage to Get Stuff Done? In conjunction with The Setup, this is the first post in a series of TechConnect authors, past and present, to show off what tools, tips, and tricks they use for work.

I have been tagged by @nnschiller in his “This is how I work” post. Normally, I just hide when these type of chain letter type events come along, but this time I’ll indulge everyone and dust off my blogging skills. I’m Becky Yoose, Discovery and Integrated Systems Librarian, and this is how I work.

Location: Grinnell, Iowa, United States

Current Gig: Assistant Professor, Discovery and Integrated Systems Librarian; Grinnell College

Current Mobile Device: Samsung Galaxy Note 3, outfitted with an OtterBox Defender cover. I still mourn the discontinuation of the Droid sliding keyboard models, but the oversized screen and stylus make up for the lack of tactile typing.

Current Computer:

Work: HP EliteBook 8460p (due to be replaced in 2015); boots Windows 7

Home: Betty, my first build; dual boots Windows 7 and Ubuntu 14.04 LTS

eeepc 901, currently b0rked due to misjudgement on my part about appropriate xubuntu distros.

Current Tablet: iPad 2, supplied by work.

One word that best describes how you work:


Don’t panic. Nothing to see here. Move along.

What apps/software/tools can’t you live without?

Essential work computer software and tools, in no particular order:

  • Outlook – email and meetings make up the majority of my daily interactions with people at work and since campus is a Microsoft shop…
  • Notepad++ – my Swiss army knife for text-based duties: scripts, notes, and everything in between.
  • PuTTY – Great SSH/Telnet client for Windows.
  • Marcedit – I work with library metadata, so Marcedit is essential on any of my work machines.
  • MacroExpress and AutoIt – Two different Windows automation apps: MacroExpress handles more simple automation (opening programs, templating/constant data, simple workflows involving multiple programs) while AutoIt gives you more flexibility and control in the automation process, including programming local functions and more complex decision-making processes.
  • Rainmeter and Rainlander – These two provide customized desktop skins that give you direct or quicker access to specific system information, functions, or in Rainlander’s case, application data.
  • Pidgin – MPOW uses both LibraryH3lp and AIM for instant messaging services, and I use IRC to keep in touch with #libtechwomen and #code4lib channels. Being able to do all three in one app saves time and effort.
  • Jing – while the Snipping Tool in Windows 7 is great for taking screenshots for emails, Jing has proven to be useful for both basic screenshots and screencasts for troubleshooting systems issues with staff and library users. The ability to save screencasts on is also valuable when working with vendors in troubleshooting problems.
  • CCleaner – Not only does it empty your recycling bin and temporary files/caches, the various features available in one spot (program lists, registry fixes, startup program lists, etc.) make CCleaner an efficient way to do housekeeping on my machines.
  • Janetter (modified code for custom display of Twitter lists) – Twitter is my main information source for the library and technology fields. One feature I use extensively is the List feature, and Janetter’s plugin-friendly set up allows me to highly customize not only the display but what is displayed in the list feeds.
  • Firefox, including these plugins (not an exhaustive list):

For server apps, the main app (beyond putty or vSphere) that I need is Nagios to monitor the library virtual Linux server farm. I also am partial to nano, vim, and apt.

As one of the very few tech people on staff, I need a reliable system to track and communicate technical issues with both library users and staff. Currently the Libraries is piggybacking on ITS’ ticketing system KBOX. Despite being fit into a somewhat inflexible existing structure, it has worked well for us, and since we don’t have to maintain the system, all the better!

Web services: The Old Reader, Gmail, Google Drive, Skype, Twitter. I still mourn the loss of Google Reader.

For physical items, my tea mug. And my hat.

What’s your workspace like?

Take a concrete box, place it in the dead center of the library, cut out a door in one side, place the door opening three feet from the elevator door, cool it to a consistent 63-65 degrees F., and you have my office. Spending 10+ hours a day during the week in this office means a bit of modding is in order:

  • Computer workstation set up: two HP LA2205wg 22 inch monitors (set to appropriate ergonomic distances on desk), laptop docking station, ergonomic keyboard/mouse stand, ergonomic chair. Key word is “ergonomic”. I can’t stress this enough with folks; I’ve seen friends develop RSIs on the job years ago and they still struggle with them today. Don’t go down that path if you can help it; it’s not pretty.
  • Light source: four lamps of varying size, all with GE Daylight 6500K 15 watt light bulbs. I can’t do the overhead lights due to headaches and migraines, so these lamps and bulbs help make an otherwise dark concrete box a little brighter.
  • Three cephalopods, a starfish, a duck, a moomin, and cats of various materials and sizes
  • Well stocked snack/emergency meal/tea corner to fuel said 10+ hour work days
  • Blankets, cardigans, shawls, and heating pads to deal with the cold

When I work at home during weekends, I end up in the kitchen with the laptop on the island, giving me the option to sit on the high chair or stand. Either way, I have a window to look at when I need a few seconds to think. (If my boss is reading this – I want my office window back.)

What’s your best time-saving trick?

Do it right the first time. If you can’t do it right the first time, then make the path to make it right as efficient  and painless as you possibly can. Alternatively, build a time machine to prevent those disastrous metadata and systems decisions made in the past that you’re dealing with now.

What’s your favorite to-do list manager?

Post it notes on a wall

The Big Picture from 2012

I have tried to do online to-do list managers, such as Trello; however, I have found that physical managers work best for me. In my office I have a to-do management system that comprises of three types of lists:

  • The Big Picture List (2012 list pictured above)- four big post it sheets on my wall, labeled by season, divided by months in each sheet. Smaller post it notes are used to indicate which projects are going on in which months. This is a great way to get a quick visual as to what needs to be completed, what can be delayed, etc.
  • The Medium Picture List – a mounted whiteboard on the wall in front of my desk. Here specific projects are listed with one to three action items that need to be completed within a certain time, usually within one to two months.
  • The Small Picture List – written on discarded Choice review cards, the perfect size to quickly jot down things that need to be done either today or in the next few days.

Besides your phone and computer, what gadget can’t you live without?

My wrist watch, set five minutes fast. I feel conscientious if I go out of the house without it.

What everyday thing are you better at than everyone else?

I’d like to think that I’m pretty good with adhering to Inbox Zero.

What are you currently reading?

The practice of system and network administration, 2nd edition. Part curiosity, part wanting to improve my sysadmin responsibilities, part wanting to be able to communicate better with my IT colleagues.

What do you listen to while you work?

It depends on what I am working on. I have various stations on Pandora One and a selection of iTunes playlists to choose from depending on the task on hand. The choices range from medieval chant (for long form writing) to thrash metal (XML troubleshooting).

Realistically, though, the sounds I hear most are email notifications, the operation of the elevator that is three feet from my door, and the occasional TMI conversation between students who think the hallway where my office and the elevator are located is deserted.

Are you more of an introvert or an extrovert?

An introvert blessed/cursed with her parents’ social skills.

What’s your sleep routine like?

I turn into a pumpkin at around 8:30 pm, sometimes earlier. I wake up around 4:30 am most days, though I do cheat and not get out of bed until around 5:15 am, checking email, news feeds, and looking at my calendar to prepare for the coming day.

Fill in the blank: I’d love to see _________ answer these same questions.

You. Also, my cats.

What’s the best advice you’ve ever received?

Not advice per se, but life experience. There are many things one learns when living on a farm, including responsibility, work ethic, and realistic optimism. You learn to integrate work and life since, on the farm, work is life. You work long hours, but you also have to rest whenever you can catch a moment.  If nothing else, living on a farm teaches you that no matter how long you put off doing something, it has to be done. The earlier, the better, especially when it comes with shoveling manure.

Hacking in Python with PyMARC

marc icon with an arrow pointing to a spreadsheet icon

The pymarc Python library is an extremely handy library that can be used for writing scripts to read, write, edit, and parse MARC records.  I was first introduced to pymarc through Heidi Frank’s excellent Code4Lib journal article on using pymarc in conjunction with MARCedit.1.  Here at ACRL TechConnect, Becky Yoose has also written about some cool applications of pymarc.

In this post, I’m going to walk through a couple of applications of using pymarc to make comma-separated (.csv) files for batch loading in DSpace and in the KBART format (e.g., for OCLC KnowledgeBase purposes).  I know what you’re thinking – why write a script when I can use MARCedit for that?  Among MARCedit’s many indispensable features is its Export Tab-Delimited Data feature. 2.  But there might be a couple of use cases where scripts can come in handy:

  • Your MARC records lack data that you need in the tab-delimited or .csv output – for example, a field used for processing in whatever system you’re loading the data into that isn’t found in the MARC record source.
  • Your delimited file output requires things like empty fields or parsed, modified fields.  MARCedit is great for exporting whole MARC fields and subfields into delimited files, but you may need to pull apart a MARC element or parse a MARC element out with some additional data.  Pymarc is great for this.
  • You have to process on a lot of records frequently, and just want something tailor-made for your use case.  While you can certainly use MARCedit to save a frequently used export template, editing that saved template when you need to change the output parameters of your delimited file isn’t easy.  With a Python script, you can easily edit your script if your delimited file requirements change.


Some General Notes about Python

Python is a useful scripting language to know for a lot of reasons, and for small scripting projects can have a fairly shallow learning curve.  In order to start using Python, you’ll need to set your development environment.  Macs already come with Python installed, but to double-check which version you have installed, launch Terminal and type Python -v.  In a Windows environment, you’ll need to do a few more steps, detailed here.  Full Python documentation for 2.x can be found here (though personally, I find it a little dense at times!), and CodeAcademy features some pretty extensive tutorials to help you learn the syntax quickly.  Google also has some pretty extensive tutorials on Python.  Personally, I find it easiest to learn when I have a project that actually means something to me, so if you’re familiar with MARC records,  just downloading the Python scripts mentioned in this post below and learning how to run them can be a good start.

Spacing and indentation in Python is very important, so if you’re getting errors running scripts make sure that your conditional statements are indented properly 3. Also the version of Python on your machine makes a really big difference, and the version will determine whether your code runs successfully.  These examples have all been tested with Python 2.6 and 2.7, but not with Python 3 or above.  I find that Python 2.x has more help resources out on the web, which is nice to be able to take advantage of when you’re first starting out.

Use Case 1:  MARC to KBART

The complete script described below, along with some sample MARC records, is on GitHub.

The KBART format is a NISO/United Kingdom Serials Group (UKSG) initiative designed to standardize information for use with Knowledge Base systems.  The KBART format is a series of standardized fields that can be used to identify serial coverage (e.g., start date and end date) URLs, publisher information, and more.4  Notably, it’s the required format for adding and modifying customized collections in OCLC’s Knowledge Base. 5.

In this use case, I’m using OCLC’s modified KBART file – most of the fields are KBART standard but a few fields are specific to OCLC’s Knowledge Base, e.g., oclc_collection_name.6.  Typically, these KBART files would be created either manually, or generated by using some VLOOKUP Excel magic with an existing KBART file 7.  In my use case, I needed to batch migrate a bunch of data stored in MARC records to the OCLC Knowledge Base, and these particular records didn’t always correspond neatly to existing OCLC Knowledge Base Collections.  For example, one collection we wanted to manage in the OCLC Knowledge Base was the library’s print holdings so that these titles were displayed in OCLC’s A-Z Journal lookup tool.

First, I’ll need a few helper libraries for my script:

import csv
from pymarc import MARCReader
from os import listdir
from re import search

Then, I’ll declare the source directory where my MARC records are (in this case, in the same directory the script lives in a folder called marc) and instruct the script to go find all the .mrc files in that directory.  Note that this enables the script to process either a single file of multiple MARC records, or multiple distinct files, all at once.

# change the source directory to whatever directory your .mrc files are in
SRC_DIR = 'marc/'

The script identifies MARC records by looking for all files ending in .mrc using the Python filter function.  Within the filter function, lambda creates a one-off anonymous function to define the filter parameters: 8

# get a list of all .mrc files in source directory 
file_list = filter(lambda x: search('.mrc', x), listdir(SRC_DIR))

Now I’ll define the output parameters.  I need a tab-delimited file, not a comma-delimited file, but either way I’ll use the csv.writer function to create a .txt file and define the tab delimiter (\t).  I also don’t really want the fields quoted unless there’s a special character or something that might cause a problem reading the file, so I’ll set quoting to minimal:

#create tab delimited text file that quotes if special characters are present
csv_out = csv.writer(open('kbart.txt', 'w'), delimiter = '\t', 
quotechar = '"', quoting = csv.QUOTE_MINIMAL)

And I’ll also create the header row, which includes the KBART fields required by the OCLC Knowledge Base:

#create the header row
csv_out.writerow(['publication_title', 'print_identifier', 
'online_identifier', 'date_first_issue_online', 
'num_first_vol_online', 'num_first_issue_online', 
'date_last_issue_online', 'num_last_vol_online', 
'num_last_issue_online', 'title_url', 'first_author', 
'title_id', 'coverage_depth', 'coverage_notes', 
'publisher_name', 'location', 'title_notes', 'oclc_collection_name', 
'oclc_collection_id', 'oclc_entry_id', 'oclc_linkscheme', 
'oclc_number', 'action'])

Next, I’ll start a loop for writing a row into the tab-delimited text file for every MARC record found.

for item in file_list:
 fd = file(SRC_DIR + '/' + item, 'r')
 reader = MARCReader(fd)

By default, we’ll need to set each field’s value to blank (”), so that errors are avoided if the value is not present in the MARC record.  We can set all the values to blank to start with in one statement:

for record in reader:
    publication_title = print_identifier = online_identifier 
    = date_first_issue_online = num_first_vol_online 
    = num_first_issue_online = date_last_issue_online 
    = num_last_vol_online = num_last_issue_online 
    = title_url = first_author = title_id = coverage_depth 
    = coverage_notes = publisher_name = location = title_notes 
    = oclc_collection_name = oclc_collection_id = oclc_entry_id 
    = oclc_linkscheme = oclc_number = action = ''

Now I can start pulling in data from MARC fields.  At its simplest, for each field defined in my CSV, I can pull data out of MARC fields using a construction like this:

    if record ['856'] is not None:
      title_url = record['856']['u']

I can use generic python string parsing tools to clean up fields.  For example, if I need to strip the ending / out of a MARC 245$a field, I can use .rsplit to return everything before that final slash (/):

   # publication_title
    if record['245'] is not None:
      publication_title = record['245']['a'].rsplit('/', 1)[0]
      if record['245']['b'] is not None:
          publication_title = publication_title + " " + record['245']['b']

Also note that once you’ve declared a variable value (like publication_title) you can re-use that variable name to add more onto the string, as I’ve done with the 245$b subtitle above.

As is usually the case for MARC records, the trickiest business has to do with parsing out serial ranges.  The KBART format is really designed to capture, as accurately as possible, thresholds for beginning and ending dates.  MARC makes this…complicated.  Many libraries might use the 866 summary field to establish summary ranges, but there are varying local practices that determine what these might look like.  In my case, the minimal information I was looking for included beginning and ending years, and the records I was processing with the script stored summary information fairly cleanly in the 863 and 866 fields:

    # date_first_issue_online
    if record ['866'] is not None:
      date_first_issue_online = record['863']['a'].rsplit('-', 1)[0]
    # date_last_issue_online
    if record ['866'] is not None:
      date_last_issue_online = record['866']['a'].rsplit('-', 1)[-1]

Where further adjustments were necessary, and I couldn’t easily account for the edge cases programmatically, I imported the KBART file into OpenRefine for further cleanup.  A sample KBART file created by this script is available here.

Use Case 2:  MARC to DSpace Batch Upload

Again, the complete script for transforming MARC records for DSpace ingest is on GitHub.

This use case is very similar to the KBART scenario.  We use DSpace as our institutional repository, and we had about 2000 historical theses and dissertation MARC  records to transform and ingest into DSpace.  DSpace facilitates batch-uploading metadata as CSV files according to the RFC4180 format, which basically means all the field elements use double quotes. 9  While the fields being pulled from are different, the structure of the script is almost exactly the same.

When we first define the CSV output, we need to ensure that quoting is used for all elements – csv.QUOTE_ALL:

csv_out = csv.writer(open('output/theses.txt', 'w'), delimiter = '\t', 
quotechar = '"', quoting = csv.QUOTE_ALL)

The other nice thing about using Python for this transformation is the ability to program in static text that would be the same for all the lines in the CSV file.  For example, I parsed together a more friendly display of the department the thesis was submitted to like so:

# dc.contributor.department
if record ['690']['x'] is not None:
dccontributordepartment = ('UniversityName.  Department of ') + 
record['690']['x'].rsplit('.', 1)[0]

You can view a sample output file created by this script on here on Github.

Other PyMARC Projects

There are lots of other, potentially more interesting things that can be done with pymarc, although certainly one of its most common applications is converting MARC to more flexible formats, such as MARCXML 10.  If you’re interested in learning more,  join the pymarc Google Group, where you can often get help by the original pymarc developers (Ed Summers, Mark Matienzo, Gabriel Farrell and Geoffrey Spear).

  1. Frank, Heidi. “Augmenting the Cataloger’s Bag of Tricks: Using MarcEdit, Python, and pymarc for Batch-Processing MARC Records Generated from the Archivists’ Toolkit.” Code4Lib Journal, 20 (2013):
  3. For the specifications on indentation, take a look at
  6.  A sample blank KBART Excel sheet for OCLC’s Knowledge Base can be found here:  KBART files must be saved as tab-delimited .txt files prior to upload, but are obviously more easily edited manually in Excel
  7. If you happen to be in need of this for OCLC’s KB, or  are in need of comparing two Excel files, here’s a walkthrough of how to use VLOOKUP to select owned titles from a large OCLC KBART file:
  8. A nice tutorial on using lambda and filter together can be found here:

Migrating to LibGuides 2.0

This summer Springshare released LibGuides 2.0, which is a complete revamp of the LibGuides system. Many libraries use LibGuides, either as course/research guides or in some cases as the entire library website, and so this is something that’s been on the mind of many librarians this summer, whichever side of LibGuides they usually see. The process of migrating is not too difficult, but the choices you make in planning the new interface can be challenging. As the librarians responsible for the migration, we will discuss our experience of planning and implementing the new LibGuides platform.

Making the Decision to Migrate

While migrating this summer was optional, Springshare will probably only support LibGuides 1 for another two years, and at Loyola we felt it was better to move sooner rather than later. Over the past few years there were perpetual LibGuides cleanup projects, and this seemed to be a good opportunity to finalize that work. At the same time, we wanted to experiment with new designs for the library’s website that would bring it in closer alignment with the university’s new brand as well as make the site responsive, and LibGuides seemed like the ideal place to experiment with some of those ideas. Several new features, revealed on Springshare’s blog, resonated with subject-area specialists which was another reason to push for a migration sooner than later. We also wanted to have it in place before the first day of classes, which gave us a few months to experiment.

The Reference and Electronic Resources librarian, Will Kent, as well as the Head of Reference, Niamh McGuigan, and the Digital Services Librarian, Margaret Heller, worked in concert to make decisions, as well as inviting all the other reference and instruction librarians (as well as anyone else who was interested) to participate in the process. There were a few ground rules the core team went by, however: we were migrating and the process was iterative, i.e. we weren’t waiting for perfection to launch.

Planning the Migration

During the migration planning process, the small team of three librarians worked together to create a timeline, report to the library staff on progress, solicit feedback on the system, and update the LibGuide policies to reflect the new changes and functions. As far as front-end migration went, we addressed large staff-wide meetings, provided updates, polled subject specialists on the progress, prepared our 400 databases for conversion to the new A-Z list, and demonstrated new features, and opened changes that they should be aware of. We would relay updates from Springshare and handle any troubleshooting questions as they happened.

Given the new features – new categories, new ways of searching, the A-Z database list, and other features, it was important for us to sit down, discuss standards, and update our content policies. The good news was that most of our content was in good shape for the migration. The process was swift and barring inevitable, tiny bugs went smoothly.

Our original timeline was to present the migration steps at our June monthly joint meeting of collections and reference staff, and give a timeline one month until the July meeting to complete the work. For various reasons this ended up stretching until mid-August, but we still launched the day before classes began. We are constantly in the process of updating guide types, adding new resources, and re-classifying boxes to adhere to our new policies.

Working on the Design

LibGuides 2.0 provides two basic templates, a left navigation menu and a top tabbed menu that looks similar to the original LibGuides (additional templates are available with the LibGuides CMS product). We had originally discussed using the left navigation box template and originally began a design based on this, but ultimately people felt more comfortable with the tabbed navigation. Whiteboard sketch of the LibGuides UI

For the initial prototype, Margaret worked off a template that we’d used before for Omeka. This mirrors the Loyola University Chicago template very closely. We kept all of the LibGuides standard template–i.e. 1-3 columns with the number of columns and sections within the column determined by the page creator, but added a few additional pieces in the header and footer, as well as making big changes to the tabs.

The first step in planning the design was to understand what customization happened in the template, and which in the header and footer which are entered separately in the admin UI. Margaret sketched out our vision for the site on the whiteboard wall to determine existing selectors and those that would need to be added, as well as get a sense of whether we would need to change the content section at all. In the interests of completing the project in a timely fashion, we determined that the bare minimum of customization to unify the research guides with the rest of the university websites would be the first priority.

For those still planning a redesign, the Code4Lib community has many suggestions on what to consider. The main thing to consider is that LibGuides 2.0 is based on the Bootstrap 3.0 framework, which Michael Schofield recently implored us to use responsibly. Other important considerations are the accessibility of the solution you pick, and use of white space.

LibGuides Look & Feel UI tabs The Look & Feel section under ‘Admin’ has several tabs with sections for Header and Footer, Custom CSS/JS, and layout of pages–Guide Pages Layout is the most relevant for this post.

Just as in the previous version of LibGuides, one can enter custom code for the header and footer (which in this case is almost the same as the regular library website), as well link to a custom CSS file (we did not include any custom Javascript here, but did include several Google Fonts and our custom icon). The Guide Pages Layout is new, and this is where one can edit the actual template that creates each page. We didn’t make any large changes here, but were still able to achieve a unique look with custom CSS.

The new LibGuides platform is responsive, but we needed to account for several items we added to the interface. We added a search box that would allow users to search the entire university website, as well as several new logos, so Margaret added a few media queries to adjust these features on a phone or tablet, as well as adjust the spacing of the custom footer.

Improving the Design

Our first design was ready to present to the subject librarians a month after the migration process started. It was based on the principle of matching the pages closely (example), in which the navigation tabs across the top have unusual cutouts, and section titles are very large. No one was very happy with this result, however, as it made the typical LibGuides layout with multiple sections on a page unusable and the tabs not visible enough. While one approach would have been to change the navigation to left navigation menu and limit the number of sections, the majority of the subject librarians preferred to keep things closer to what they had been, with a view to moving toward a potential new layout in the future.

Once we determined a literal interpretation of the university website was not usable for our content, we found inspiration for the template body from another section of the university website that was aimed at presenting a lot of dynamic content with multiple sections, but kept the standard header. This allowed us to create a page that was recognizably part of Loyola, but presented our LibGuides content in a much more usable form.

Sticky Tabs

Sticky Tabs

The other piece we borrowed from the university website was sticky tabs. This was an attempt to make the tabs more visible and usable based on what we knew from usability testing on the old platform and what users would already know from the university site. Because LibGuides is based on the Bootstrap framework, it was easy to drop this in using the Affix plugin (tutorial on how to use this)1. The tabs are translucent so they don’t obscure content as one scrolls down.

Our final result was much more popular with everyone. It has a subtle background color and border around each box with a section header that stands out but doesn’t overwhelm the content. The tabs are not at all like traditional LibGuides tabs, functioning somewhat more like regular header links.


Final result.

Next Steps

Over the summer we were not able to conduct usability testing on the new interface due to the tight timeline, so the first step this fall is to integrate it into our regular usability testing schedule to make iterative changes based on user feedback. We also need to continue to audit the page to improve accessibility.

The research guides are one of the most used links on our website (anywhere between 10,000 and 20,000 visits per month), so our top priority was to make sure the migration did not interfere with use – both in terms of patron access and content creation by the subject-area librarians. Thanks to our feedback sessions, good communication with Springshare, and reliable new platform, the migration went smoothly without interruption.

About our guest author: Will Kent is Reference/Instruction and Electronic Resources Librarian and subject specialist for Nursing and Chemistry at Loyola University Chicago. He received his MSLIS from University of Illinois Urbana-Champaign in 2011 with a certificate in Community Informatics.

  1. You may remember that in the Bootstrap Responsibly post Michael suggested it wasn’t necessary to use this, but it is the most straightforward way in LibGuides 2.0

Using the Stripe API to Collect Library Fines by Accepting Online Payment

Recently, my library has been considering accepting library fines via online. Currently, many library fines of a small amount that many people owe are hard to collect. As a sum, the amount is significant enough. But each individual fines often do not warrant even the cost for the postage and the staff work that goes into creating and sending out the fine notice letter. Libraries that are able to collect fines through the bursar’s office of their parent institutions may have a better chance at collecting those fines. However, others can only expect patrons to show up with or to mail a check to clear their fines. Offering an online payment option for library fines is one way to make the library service more user-friendly to those patrons who are too busy to visit the library in person or to mail a check but are willing to pay online with their credit cards.

If you are new to the world of online payment, there are several terms you need to become familiar with. The following information from the article in SixRevisions is very useful to understand those terms.1

  • ACH (Automated Clearing House) payments: Electronic credit and debit transfers. Most payment solutions use ACH to send money (minus fees) to their customers.
  • Merchant Account: A bank account that allows a customer to receive payments through credit or debit cards. Merchant providers are required to obey regulations established by card associations. Many processors act as both the merchant account as well as the payment gateway.
  • Payment Gateway: The middleman between the merchant and their sponsoring bank. It allows merchants to securely pass credit card information between the customer and the merchant and also between merchant and the payment processor.
  • Payment Processor: A company that a merchant uses to handle credit card transactions. Payment processors implement anti-fraud measures to ensure that both the front-facing customer and the merchant are protected.
  • PCI (the Payment Card Industry) Compliance: A merchant or payment gateway must set up their payment environment in a way that meets the Payment Card Industry Data Security Standard (PCI DSS).

Often, the same company functions as both payment gateway and payment processor, thereby processing the credit card payment securely. Such a product is called ‘Online payment system.’ Meyer’s article I have cited above also lists 10 popular online payment systems: Stripe, Authorize.Net, PayPal, Google Checkout, Amazon Payments, Dwolla, Braintree, Samurai by FeeFighters, WePay, and 2Checkout. Bear in mind that different payment gateways, merchant accounts, and bank accounts may or may not work together, your bank may or may not work as a merchant account, and your library may or may not have a merchant account. 2

Also note that there are fees in using online payment systems like these and that different systems have different pay structures. For example, has the $99 setup fee and then charges $20 per month plus a $0.10 per-transaction fee. Stripe charges 2.9% + $0.30 per transaction with no setup or monthly fees. Fees for mobile payment solutions with a physical card reader such as Square may go up much higher.

Among various online payment systems, I picked Stripe because it was recommended on the Code4Lib listserv. One of the advantages for using Stripe is that it acts as both the payment gateway and the merchant account. What this means is that your library does not have to have a merchant account to accept payment online. Another big advantage of using Stripe is that you do not have to worry about the PCI compliance part of your website because the Stripe API uses a clever way to send the sensitive credit card information over to the Stripe server while keeping your local server, on which your payment form sits, completely blind to such sensitive data. I will explain this in more detail later in this post.

Below I will share some of the code that I have used to set up Stripe as my library’s online payment option for testing. This may be of interest to you if you are thinking about offering online payment as an option for your patrons or if you are simply interested in how an online payment API works. Even if your library doesn’t need to collect library fines via online, an online payment option can be a handy tool for a small-scale fund-raising drive or donation.

The first step to take to make Stripe work is getting an API keys. You do not have to create an account to get API keys for testing. But if you are going to work on your code more than one day, it’s probably worth getting an account. Stripe API has excellent documentation. I have read ‘Getting Started’ section and then jumped over to the ‘Examples’ section, which can quickly get you off the ground. ( I found an example by Daniel Schröter in GitHub from the list of examples in the Stripe’s Examples section and decided to test out. ( Most of the time, getting an example code requires some probing and tweaking such as getting all the required library downloaded and sorting out the paths in the code and adding API keys. This one required relatively little work.

Now, let’s take a look at the form that this code creates.


In order to create a form of my own for testing, I decided to change a few things in the code.

  1. Add Patron & Payment Details.
  2. Allow custom amount for payment.
  3. Change the currency from Euro to US dollars.
  4. Configure the validation for new fields.
  5. Hide the payment form once the charge goes through instead of showing the payment form below the payment success message.


4. can be done as follows. The client-side validation is performed by Bootstrapvalidator jQuery Plugin. So you need to get the syntax correct to get the code, which now has new fields, to work properly.

This is the Javascript that allows you to send the data submitted to your payment form to the Stripe server. First, include the Stripe JS library (line 24). Include JQuery, Bootstrap, Bootstrap Form Helpers plugin, and Bootstrap Validator plugin (line 25-28). The next block of code includes an event handler for the form, which send the payment information to the Stripe via AJAX when the form is submitted. Stripe will validate the payment information and then return a token that identifies this particular transaction.


When the token is received, this code calls for the function, stripeResponseHandler(). This function, stripeResponseHandler() checks if the Stripe server did not return any error upon receiving the payment information and, if no error has been returned, attaches the token information to the form and submits the form.


The server-side PHP script then checks if the Stripe token has been received and, if so, creates a charge to send it to Stripe as shown below. I am using PHP here, but Stripe API supports many other languages than PHP such as Ruby and Python. So you have many options. The real payment amount appears here as part of the charge array in line 326. If the charge succeeds, the payment success message is stored in a div to be displayed.


The reason why you do not have to worry about the PCI compliance with Stripe is that Stripe API asks to receive the payment information via AJAX and the input fields of sensitive information does not have the name attribute and value. (See below for the Card Holder Name and Card Number information as an example; Click to bring up the clear version of the image.)  By omitting the name attribute and value, the local server where the online form sits is deprived of any means to retrieve the information in those input fields submitted through the form. Since sensitive information does not touch the local server at all, PCI compliance for the local server becomes no concern. To clarify, not all fields in the payment form need to be deprived of the name attribute. Only the sensitive fields that you do not want your web server to have access to need to be protected this way. Here, for example, I am assigning the name attribute and value to fields such as name and e-mail in order to use them later to send a e-mail receipt.

(NB. Please click images to see the enlarged version.)

Screen Shot 2014-08-17 at 8.01.08 PM

Now, the modified form has ‘Fee Category’, custom ‘Payment Amount,’ and some other information relevant to the billing purpose of my library.


When the payment succeeds, the page changes to display the following message.


Stripe provides a number of fake card numbers for testing. So you can test various cases of failures. The Stripe website also displays all payments and related tokens and charges that are associated with those payments. This greatly helps troubleshooting. One thing that I noticed while troubleshooting is that Stripe logs sometimes do lag behind. That is, when a payment would succeed, associated token and charge may not appear under the “Logs” section immediately. But you will see the payment shows up in the log. So you will know that associated token and charge will eventually appear in the log later.


Once you are ready to test real payment transactions, you need to flip the switch from TEST to LIVE located on the top left corner. You will also need to replace your API keys for ‘TESTING’ (both secret and public) with those for ‘LIVE’ transaction. One more thing that is needed before making your library getting paid with real money online is setting up SSL (Secure Sockets Layer) for your live online payment page. This is not required for testing but necessary for processing live payment transactions. It is not a very complicated work. So don’t be discouraged at this point. You just have to buy a security certificate and put it in your Web server. Speak to your system administrator for how to get the SSL set up for your payment page. More information about setting up SSL can be found in the Stripe documentation I just linked above.

My library has not yet gone live with this online payment option. Before we do, I may make some more modifications to the code to fit the staff workflow better, which is still being mapped out. I am also planning to place the online payment page behind the university’s Shibboleth authentication in order to cut down spam and save some tedious data entry by library patrons by getting their information such as name, university email, student/faculty/staff ID number directly from the campus directory exposed through Shibboleth and automatically inserting it into the payment form fields.

In this post, I have described my experience of testing out the Stripe API as an online payment solution. As I have mentioned above, however, there are many other online payment systems out there. Depending your library’s environment and financial setup, different solutions may work better than others. To me, not having to worry about the PCI compliance by using Stripe was a big plus. If your library accepts online payment, please share what solution you chose and what factors led you to the particular online payment system in the comments.

* This post has been based upon my recent presentation, “Accepting Online Payment for Your Library and ‘Stripe’ as an Example”, given at the Code4Lib DC Unconference. Slides are available at the link above.

  1. Meyer, Rosston. “10 Excellent Online Payment Systems.” Six Revisions, May 15, 2012.
  2. Ullman, Larry. “Introduction to Stripe.” Larry Ullman, October 10, 2012.