A Reflection on Code4Lib 2016

See also: Margaret’s reflections on Code4Lib 2013 and recap of the 2012 keynote.

 


About a month ago was the 2016 Code4Lib conference in sunny Philadelphia. I’ve only been to a few Code4Lib conferences, starting with Raleigh in 2014, but it’s quickly become my favorite libraryland conference. This won’t be a comprehensive recap but a little taste of what makes the event so special.

Appetizers: Preconferences

One of the best things about Code4Lib is the affordable preconferences. It’s often a pittance to add on a preconference or two, extending your conference for a whole day. Not only that, there’s typically a wealth of options: the 2015 conference boasted fifteen preconferences to choose from, and Philadelphia somehow managed to top that with an astonishing twenty-four choices. Not only are they numerous, the preconferences vary widely in their topics and goals. There’s always intensely practical ones focused on bootstrapping people new to a particular framework, programming language, or piece of software (e.g. Railsbridge, workshops focused on Blacklight or Hydra). But there are also events for practicing your presentation or the aptly named “Getting Ready for Workshops” Workshop. One of my personal favorite ideas—though I must admit I’ve never attended—is the perennial “Fail4Lib” sessions where attendees examine their projects that haven’t succeeded and discuss what they’ve learned.

This year, I wanted to run a preconference of my own. I enjoy teaching, but I rarely get to do it in my current position. Previously, in a more generalist technologist position, I would teach information literacy alongside the other librarians. But as a Systems Librarian, it can sometimes feel like I rarely get out from behind my terminal. A preconference was an appealing chance to teach information professionals on a topic that I’ve accumulated some expertise in. So I worked with Coral Sheldon-Hess to put together a workshop focused on the fundamentals of the command line: what it is, how to use it, and some of the pivotal concepts. I won’t say too much more about the workshop because Coral wrote an excellent, detailed blog post right after we were done. The experience was great and feedback we received, including a couple kind emails from our participants, was very positive. Perhaps we, or someone else, can repeat the workshop in the future, as we put all our materials online.

Main Course: Presentations

Thankfully I don’t have to detail the conference talks too much, because they’re all available on YouTube. If a talk looks intriguing, I strongly encourage you to check out the recording. I’m not too ashamed to admit that a few went way over my head, so seeing the original will certainly be more informative than any summary I could offer.

One thing that was striking was how the two keynotes centered on themes of privacy and surveillance. Kate Krauss, Director of Communications of the Tor Project, lead the conference off. Naturally, Tor being privacy software, Krauss focused on stories of government surveillance. She noted how surveillance focuses on the most marginalized people, citing #BlackLivesMatter and the transgender community as examples. Krauss’ talk provided concrete steps that librarians could take, for instance examining our own data collection practices, ensuring our services are secure, hosting privacy workshops, and running a Tor relay. She even mentioned The Library Freedom Project as a positive example of librarians fighting online surveillance, which she posited as one of the premier civil rights issues of our time.

On the final day, Gabriel Weinberg of the search engine DuckDuckGo spoke on similar themes, except he concentrated on how his company’s lack of personalization and tracking differentiated it from companies like Google and Apple. To me, Weinberg’s talk bookended well with Krauss’ because he highlighted the dangers of corporate surveillance. While the government certainly has abused its access to certain fundamental pieces of our country’s infrastructure—obtaining records from major telecom companies without a warrant comes to mind—tech companies are also culpable in enabling the unparalleled degree of surveillance possible in the modern era, simply by collecting such massive quantities of data linked to individuals (and, all too often, by failing to secure their applications properly).

While the pair of keynotes were excellent and thematic, my favorite moments of the conference were the talks by librarians. Becky Yoose gave perhaps the most rousing, emotional talk I’ve ever heard at a conference on the subject of burnout. Burnout is all too real in our profession, but not often spoken of, particularly in such a public venue. Becky forced us all to confront the healthiness and sustainability of our work/life balance, stressing the importance not only of strong organizational policies to prevent burnout but also personal practices. Finally, Andreas Orphanides gave a thoughtful presentation on the political implications of design choices. Dre’s well-chosen, alternatingly brutal and funny examples—from sidewalk spikes that prevent homeless people from lying in doorways, to an airline website labelling as “lowest” a price clearly higher than others on the very same page—outlined how our design choices reflect our values, and how we can better align our values with those of our users.

I don’t mean to discredit anyone else’s talks—there were many more excellent ones, on a variety of topics. Dinah Handel captured my feelings best in this enthusiastic tweet:

Dessert: Community

My main enjoyment from Code4Lib is the sense of community. You’ll hear a lot of people at conferences state things like “I feel like these are my people.” And we are lucky as a profession to have plenty of strong conference options, depending on our locality, specialization, and interests. At Code4Lib, I feel like I can strike up a conversation with anyone I meet about an impending ILS migration, my favorite command-line tool, or the vagaries of mapping between metadata schemas. While I love my present position, I’m mostly a solo systems person surrounded by a few other librarians all with a different expertise. As much as I want to discuss how ludicrous the webpub.def syntax is, or why reading XSLT makes me faintly ill, I know it’d bore my colleagues to death. At Code4Lib, people can at least tolerate such subjects of conversation, if not revel in them.

Code4Lib is great not solely because of it’s focus on technology and code, which a few other library organizations share, but because of the efforts of community members to make it a pleasurable experience for all. To name just a couple of the new things Code4Lib introduced this year: while previous years have had Duty Officers whom attendees could safely report harassment to, they were announced & much more visible this year; sponsored child care was available for conference goers with small children; and a service provided live transcription of all the talks.1 This is in addition to a number of community-building measures that previous Code4Lib conferences featured, such as a series of newcomers dinners on the first night, a “share and play” game night, and diversity scholarships. Overall, it’s evident that the Code4Lib community is committed to being positive and welcoming. Not that other library organizations aren’t, but it should be evident that our profession isn’t immune from problems. Being proactive and putting in place measures to prevent issues like harassment is a shining example of what makes Code4Lib great.

All this said, the community does have its issues. While a 40% female attendance rate is fair for a technology conference, it’s clear that the intersection of coding and librarianship is more male-dominated than the rest of the profession at large. Notably, Code4Lib has done an incredible job of democratically selecting keynote speakers over the past few years—five female and one male for the past three conferences—but the conference has also been largely white, so much so that the 2016 conference’s Program Committee gave a lightning talk addressing the lack of speaker diversity. Hopefully, measures like the diversity scholarships and conscious efforts on the part of the community can make progress here. But the unbearable whiteness of librarianship remains a very large issue.

Finally, it’s worth noting that Code4Lib is entirely volunteer-run. Since it’s not an official professional organization with membership dues and full-time staff members, everything is done by people willing to spare their own time to make the occasion a great one. A huge thanks to the local planning committee and all the volunteers who made such a great event possible. It’s pretty stunning to me that Code4Lib manages to put together some of the nicest benefits of any conference—the live streaming and transcribed talks come to mind—without a huge backing organization, and while charging pretty reasonable registration prices.

Night Cap

I’d recommend Code4Lib to anyone in the library community who deals with technology, whether you’re a manager, cataloger, systems person, or developer. There’s a wide breadth of material suitable for anyone and a great, supportive community. If that’s not enough, the proportion of presentations featuring pictures of cats and/or animated gifs is higher than your average conference.

Notes

  1. Aside: Matt Miller made a fun “Overheard at Code4Lib 2016” app using the transcripts

Reflections on Code4Lib 2013

Disclaimer: I was on the planning committee for Code4Lib 2013, but this is my own opinion and does not reflect other organizers of the conference.

We have mentioned Code4Lib before on this blog, but for those who are unfamiliar, it is a loose collective of programmers working in libraries, librarians, and others interested in code and libraries. (You can read more about it on the website.) The Code4Lib conference has emerged as a venue to share very new technology and have discussions with a wide variety of people who might not attend conferences more geared to librarians. Presentations at the conference are decided by the votes of anyone interested in selecting the program, and additionally lightning talks and breakout sessions allow wide participation and exposure to extremely new projects that have not made it into the literature or to conferences with a longer lead time. The Code4Lib 2013 conference ran February 11-14 at University of Illinois Chicago. You can see a list of all programs here, which includes links to the video archive of the conference.

While there were many types of projects presented, I want to focus on those talks which illustrated what I saw as thread running through the conference–care and emotion. This is perhaps unexpected for a technical conference. Yet those themes underlie a great deal of the work that takes place in academic library technology and the types of projects presented at Code4Lib. We tend to work in academic libraries because we care about the collections and the people using those collections. That intrinsic motivation focuses our work.

Caring about the best way to display collections is central to successful projects. Most (though not all) the presenters and topics came out of academic libraries, and many of the presentations dealt with creating platforms for library and archival metadata and collections. To highlight a few: Penn State University has developed their own institutional repository application called ScholarSphere that provides a better user experience for researchers and managers of the repository. The libraries and archives of the Rock and Roll Hall of Fame dealt with the increasingly common problem of wanting to present digital content alongside more traditional finding aids, and so developed a system for doing so. Corey Harper from New York University presented an extremely interesting and still experimental project to use linked data to enrich interfaces for interacting with library collections. Note that all these projects combined various pieces of open source software and library/web standards to create solutions that solve a problem facing academic or research libraries for a particular setting. I think an important lesson for most academic librarians looking at descriptions of projects like this is that it takes more than development staff to make projects like this. It takes purpose, vision, and dedication to collecting and preserving content–in other words, emotion and care. A great example of this was the presentation about DIYHistory from the University of Iowa. This project started out initially as an extremely low-tech solution for crowdsourcing archival transcription, but got so popular that it required a more robust solution. They were able to adapt open source tools to meet their needs, still keeping the project very within the means of most libraries (the code is here).

Another view of emotion and care came from Mark Matienzo, who did a lightning talk (his blog post gives a longer version with more details). His talk discussed the difficulties of acknowledging and dealing with the emotional content of archives, even though emotion drives interactions with materials and collections. The records provided are emotionless and affectless, despite the fact that they represent important moments in history and lives. The type of sharing of what someone “likes” on Facebook does not satisfactorily answer the question of what they care about,or represent the emotion in their lives. Mark suggested that a tool like Twine, which allows writing interactive stories could approach the difficult question of bringing together the real with the emotional narrative that makes up experience.

One of the ways we express care for our work and for our colleagues is by taking time to be organized and consistent in code. Naomi Dushay of Stanford University Library presented best practices for code handoffs, which described some excellent practices for documenting and clarifying code and processes. One of the major takeaways is that being clear, concise, and straightforward is always preferable, even as much as we want to create cute names for our servers and classes. To preserve a spirit of fun, you can use the cute name and attach a description of what the item actually does.

Originally Bess Sadler, also from Stanford, was going to present with Naomi, but ended up presenting a different talk and the last one of the conference on Creating a Commons (the full text is available here). This was a very moving look at what motivates her to create open source software and how to create better open source software projects. She used the framework of the Creative Commons licenses to discuss open source software–that it needs to be “[m]achine readable, human readable, and lawyer readable.” Machine readable means that code needs to be properly structured and allow for contributions from multiple people without breaking, lawyer readable means that the project should have the correct structure and licensing to collaborate across institutions. Bess focused particularly on the “human readable” aspect of creating communities and understanding the “hacker epistemology,” as she so eloquently put it, “[t]he truth is what works.” Part of understanding that requires being willing to reshape default expectations–for instance, the Code4Lib community developed a Code of Conduct at Bess’s urging to underline the fact that the community aims at inclusion and creating a safe space. She encouraged everyone to keep working to do better and “file bug reports” about open source communities.

This year’s Code4Lib conference was a reminder to me about why I do the work I do as an academic librarian working in a technical role. Even though I may spend a lot of time sitting in front of a computer looking at code, or workflows, or processes, I know it makes access to the collections and exploration of those collections better.


Decentralizing Library IT

I’ve always gravitated toward library jobs in library systems and technology, but I recently took on a new position as head of a tech services department in a smaller academic library.  Some of my colleagues expressed surprised that I’m moving out of a traditional library IT or systems role, but my former position was as a systems librarian within a technical services department, and for the past few years, a significant amount of my time recently has involved developing collection and metadata-related system integrations for acquisitions and cataloging. A few trends have made me think that I’m not alone in branching out and applying systems skills to diverse functional areas of the library.  It has become relatively commonplace for the work of technology innovation to occur, at least in part, outside of traditional library IT departments; for example, reference and instruction librarians playing a tightly integrated role in the optimization of discovery interfaces, tech services staff using Python and linked data technologies to clean up and enhance metadata, and instruction librarians and access services staff creating and managing high-tech MakerSpaces.

More personnel across the library are embracing and developing high tech skills traditionally housed in library systems or IT departments.  The following are six general trends I’ve observed that are influencing the spread of technology development outside of traditional library IT.

Increasingly high technical skills are required for most library areas

  • Job advertisements for almost every functional area of the library emphasize advanced technical knowledge (beyond typical office application knowledge), especially with regard to ILS systems management.  In a 2016 study of library job advertisements, the authors found a wide range of job titles that require knowledge and skills in information technology, including Metadata Librarian, Digital Archivist, Information Literacy Librarian, and Research Data Librarian (Shahbazi, Fahimnia, & Khoshemehr, 2016).   Scholarly communications, data services, e-resource management, reference, and other library staff positions may all be positioned outside of traditional library IT, but are all deeply involved in the utilization and development of library technologies.

Optimization of cloud-based systems can be distributed

  • With cloud-based application hosting, managing physical servers and backups may become less burdensome, but the need for knowledgeable personnel to configure and optimize often complex cloud-based systems is as essential as it has always been.  Scholarly communications, e-resource, and access services library staff may play highly integrated roles in the development and optimizing of library systems.  Beyond acting as consultants for the management of these systems, knowledge in scripting and interoperability mechanisms enables staff outside of library IT to holistically contribute to the development of cloud-based library applications.

More opportunities to build integrations

  • New library services platforms often enable a number of integrations with third party systems via APIs and other web services.1 Many of these integrations require a deep knowledge of workflows and data structures between multiple systems, so including involvement from multiple functional areas is usually required. The combination of knowledge of a functional area with knowledge of how the library system works can result in some seriously powerful and useful applications.

The increasing importance of data wrangling

  • Sources of metadata are increasingly varied, requiring data wrangling, work that is often made more efficient by developing coding or scripting methods to automate routine tasks.  Metadata specialists are often experts in developing macros in OCLC Connexion (for example), and increasingly require access to a computing environment that enables writing, testing, and using code to further automate metadata cross-walking and cleanup, including use of Python, OpenRefine, and other tools for dealing with enormous amounts of messy data.

Customization of discovery and library e-content

  • Discovery platforms are complex and often highly customizable. Many reference and instruction librarians have a robust understanding of user behavior and information literacy goals that are essential for development of usable interfaces, as well as skills in user experience (UX) testing and interface design.  Reference and instruction librarians are often experts in course management systems and LibGuides, and know good tricks and hacks for optimizing digital learning content.  Library collection development, scholarly communications and technical services librarians deeply understand content and how to make it findable, and increasingly play pivotal roles in configuring harvesting and transformation of metadata into discovery systems.

Systems beyond the ILS

  • Libraries are engaging with a much wider variety of technologies than just an ILS – libraries support institutional repository and digital library software, data management software, authority systems such as VIAF, open publishing, etc. While working with these systems does not necessarily require a background or emphasis on systems administration, it is definitely helpful to have an understanding of the architecture of such applications and how applications might interact with each other.
  • Systems knowledge is applicable to more than just technology. Thinking like a programmer can often be useful when performing workflow analysis and optimization, as well as problem-solving even in non-technical areas.

Are “true” library systems administrators still needed? (Yes, obviously)

When researching for this post, I came across an amusing article by Roy Tennant from back in 2011 titled “If You are a Library SysAdmin, you are TOAST”.  The article presents a (seemingly not satirical?) argument that movement to cloud-based systems in libraries will make library system administrators obsolete:

When I, as just a moderately savvy librarian, can learn maybe five to ten very specific steps and be able to deploy any application I would likely want to deploy, why do I need to talk to my system administrator ever again?

Obviously, six years after this article was written, with many libraries firmly embedded in the cloud with a variety of library applications, the role of the system administrator in library is not at all diminished.  While work involving physical servers and backups may be less common for many applications, system administrators and those with IT skills in libraries are still in huge demand to be on hand to evaluate, optimize, and provide integrations for cloud-based library systems.

I think it’s safe to say that the more people in any organization with technical knowledge, the better.  Managing decentralized technology projects, however, does require leadership and coordination.  When learning to develop applications, coding is often much more fun than worrying about server administration and security – but of course, someone has to be concerned about security and help those who may just be learning about technology adopt secure development practices.  Library technology projects don’t have to come out of library IT departments, but leadership from library IT departments should be open and supportive of library technology initiatives coming out of non-library IT areas, while facilitating secure practices.  Coordination on the part of library IT is also essential to avoid duplication of effort and ensure that projects being developed are sustainable and supported by the technology environment of the larger organization.  Encouraging the open exchange of technology-related ideas across the library prevents tech savvy staff feeling they need to hide their pet projects lest they get ‘in trouble’ with a restrictive library IT department.

In my view, there’s simply too much technology change happening in library to keep all technology development centralized in a single unit within the library.  Adding tech-savvy positions within non-technology departments is not a bad strategy – it can help support innovation out of library departments that haven’t traditionally been expected to drive technological change.  However, continually raising expectations with regard to technical knowledge can be stressful, so ensuring that strong support for professional development is in place is also important.  In my own new position, I’m excited to channel my knowledge of APIs and interest in data visualization technologies into creating some cool collection management and assessment tools, and I’m not at all concerned that I won’t have an opportunity to apply my technical knowledge in the rapidly changing landscape of library technical services and collection development.  Working outside of library IT means that I need to communicate closely with the head of library IT about the projects I’m working on, and also be sure to closely follow other technology-related projects across the library and be proactive about offering my skills where they might be helpful.  It also means that I need to work to support the technical expertise of staff in my department, particularly as related to library system management in acquisitions and cataloging.    No matter your role in the library, there’s plenty of technology-related work to go around.

  1.  See, for example, the Ex Libris and OCLC Developer Networks, both of which provide great documentation and example applications to novice developers.

Voice, Natural Language Processing, and the Future of Library Experiences

Is the future of research voice controlled? It might be, because when I originally had the idea for this post my first instinct was to grab my phone and dictate my half-formed ideas into a note, rather than typing it out. Writing things down often makes them seem wrong and not at all what we are trying to say in our heads. (Maybe it’s not so new, since as you may remember Socrates had a similar instinct.) The idea came out of a few different talks at the national Code4Lib conference held in Los Angeles in March of 2017 and a talk given by Chris Bourg. Among these presentations the themes of machine learning, artificial intelligence, natural language processing, voice search, and virtual assistants intersect to give us a vision for what is coming. The future might look like a system that can parse imprecise human language and turn it into an appropriately structured search query in a database or variety of databases, bearing in mind other variables, and return the correct results. Pieces of this exist already, of course, but I suspect over the next few years we will be building or adapting tools to perform these functions. As we do this, we should think about how we can incorporate our values and skills as librarians into these tools along the way.

Natural Language Processing

I will not attempt to summarize natural language processing (NLP) here, except to say that speaking to a computer requires that the computer be able to understand what we are saying. Human—or natural—language is messy, full of nuance and context that requires years for people to master, and even then often leads to misunderstandings that can range from funny to deadly. Using a machine to understand and parse natural language requires complex techniques, but luckily there are a lot of tools that can make the job easier. For more details, you should review the NLP talks by Corey Harper and Nathan Lomeli at Code4Lib. Both these talks showed that there is a great deal of complexity involved in NLP, and that its usefulness is still relatively confined. Nathan Lomeli puts it like this. NLP can “cut strings, count beans, classify things, and correlate everything”. 1 Given a corpus, you can use NLP tools to figure out what certain words might be, how many of those words there are, and how they might connect to each other.

Processing language to understand a textual corpus has a long history but is now relatively easy for anyone to do with the tools out there. The easiest is Voyant Tools, which is a project by Sinclair, Stéfan Sinclair and Geoffrey Rockwell. It is a portal to a variety of tools for NLP. You can feed it a corpus and get back all kind of counts and correlations. For example, Franny Gaede and I used VoyantTools to analyze social justice research websites to develop a social justice term corpus for a research project. While a certain level of human review is required for any such project, it’s possible to see that this technology can replace a lot of human-created language. This is already happening, in fact. A tool called Wordsmith can create convincing articles about finance, sports, and technology, or really any field with a standard set of inputs and outputs in writing. If computers are writing stories, they can also find stories.

Talking to the Voice in the Machine

Finding those stories, and in turn, finding the data with which to tell more stories, is where machine learning and artificial intelligence enter. In libraries we have a lot of words, and while we have various projects that are parsing those words and doing things with them, we have only begun to see where this can go. There are two sides to this. Chris Bourg’s talk at Harvard Library Leadership in a Digital Age, asks the question “What happens to libraries and librarians when machines can read all the books?” One suggestion she makes is that:

we would be wise to start thinking now about machines and algorithms as a new kind of patron  — a patron that doesn’t replace human patrons, but has some different needs and might require a different set of skills and a different way of thinking about how our resources could be used. 2

One way in which we can start to address the needs of machines as patrons is by creating searches that work with them, which is for now ultimately to serve the needs of humans, but in the future could be for their own artificial intelligence purposes. Most people are familiar with virtual assistants that have popped up on all platforms over the past few years. As an iOS and a Windows user, I am now constantly invited to speak to Siri or Cortana to search for answers to my questions or fix something in my schedule. While I’m perfectly happy to ask Siri to remind me to bring my laptop to work at 7:45 AM or to wake me up in 20 minutes, I find mixed results when I try to ask a more complex question. 3 Sometimes when I ask the temperature on the surface of Jupiter I get the answer, other times I get today’s weather in a town called Jupiter. This is not too surprising, as asking “What is the temperature of Jupiter?” could mean a number of things. It’s on the human to specify to the computer to which domain of knowledge you are referring, which requires knowing exactly how to ask the question. Computers cannot yet do a reference interview, since they cannot pick up on the subtle hidden meanings or helping with the struggle for the right words that librarians do so well. But they can help with certain types of research tasks quite well, if you know how to ask the question.  Eric Frierson (PPT) gave a demonstration of his project working on voice powered search in EBSCO using Alexa. In the presentation he demonstrates the Alexa “skills” he set up for people to ask Alexa for help. They are “do you have”, “the book”, “information about”, “an overview of”, “what I should read after”, or “books like”. There is a demonstration of what this looks like on YouTube. The results are useful when you say the correct thing in the correct order, and for an active user it would be fairly quick to learn what to say, just as we learn how best to type in a search query in various services.

Some Considerations

People in the Sixties: The government will wiretap your home. People now: Hey wire tap, can cats eat pancakes?

Alexa meme

Why ask a question of a computer rather than type in a question to a computer? For the reason I started this piece with, certainly–voice is there, and it’s often easier to say what you mean than write it. This can be taken pragmatically as well. If you find typing difficult, being able to speak makes life easier. When I was home with a newborn baby I really appreciated being able to dictate and ask Siri about the weather forecast and what time the doctor’s appointment was. Herein lies one of the many potential pitfalls of voice: who is listening to what you are saying? One recent news story puts this in perspective, as Amazon agreed to turn over data from Alexa to police in a murder investigation after the suspect gave the ok. They refused to do at first, but it is an open question as to the legal nature of the conversation with a virtual assistant. Nor is it entirely clear when you speak to a device where the data is being processed. So before we all rush out and write voice search tools for all our systems, it is useful to think about where that data lives what the purpose of it is.

If we would protect a user’s search query by ensuring that our catalogs are encrypted (and let’s be honest, we aren’t there yet), how do we do the same for virtual search assistants in the library catalog? For Alexa, that’s built into creating an Alexa skill, since a basic requirement for the web service used is that it meet Amazon’s security requirements. But if this data is subject to subpoena, we would have to think about it in the same way we would any other search data on a third party system. And we also have to recognize that these tools are created by these companies for commercial purposes, and part of that is to gather data about people and sell things to them based on that data. Machine learning could eventually build on that to learn a lot more about people than they think, which the Amazon Echo Look recently brought up as a subject of debate. There are likely to be other services popping up in addition to those offered by Amazon, Google, Apple, and Microsoft. Before long, we might expect our vendors to be offering voice search in their interfaces, and we need to be aware of the transmission of that data and where it is being processed. A recent alliance formed called The Voice Privacy Alliance, which is developing some standards for this.

The invisibility of the result processing has another dark side. The biases inherent in the algorithms become even more hidden, as the first result becomes the “right” one. If Siri tells me the weather in Jupiter, that’s a minor inconvenience, but if Siri tells me that “Black girls” are something hypersexualized, as Safiya Noble has found that Google does, do I (or let’s say, a kid) necessarily know something has gone wrong? 4 Without human intervention and understanding, machines can perpetuate the worst side of humanity.

This comes back to Chris Bourg’s question. What happens to librarians when machines can read all the books, and have a conversation with patrons about those books? Luckily for us, it is unlikely that artificial intelligence will ever be truly self-aware with desires, metacognition, love, and need for growth and adventure. Those qualities will continue to make librarians useful to creating vibrant and unique collections and communities. But we will need to fit that in a world where we are having conversations with our computers about those collections and communities.

 

  1. Lomeli, Nathan. “Natural Language Processing: Parsing Through The Hype”. Code4Lib. Los Angeles, CA. March 7, 2017.
  2. “What Happens to Libraries and Librarians When Machines Can Read All the Books?” Feral Librarian, March 17, 2017. https://chrisbourg.wordpress.com/2017/03/16/what-happens-to-libraries-and-librarians-when-machines-can-read-all-the-books/.
  3. As a side issue, I don’t have a private office and I feel weird speaking to my computer when there are people around.
  4. Noble, Safiya Umoja. “Google Search: Hyper-Visibility as a Means of Rendering Black Women and Girls Invisible – InVisible Culture.” InVisible Culture: An Electronic Journal for Visual Culture, no. 19 (2013). http://ivc.lib.rochester.edu/google-search-hyper-visibility-as-a-means-of-rendering-black-women-and-girls-invisible/.

Looking Across the Digital Preservation Landscape

When it comes to digital preservation, everyone agrees that a little bit is better than nothing. Look no further than these two excellent presentations from Code4Lib 2016, “Can’t Wait for Perfect: Implementing “Good Enough” Digital Preservation” by Shira Peltzman and Alice Sara Prael, and “Digital Preservation 101, or, How to Keep Bits for Centuries” by Julie Swierczek. I highly suggest you go check those out before reading more of this post if you are new to digital preservation, since they get into some technical details that I won’t.

The takeaway from these for me was twofold. First, digital preservation doesn’t have to be hard, but it does have to be intentional, and secondly, it does require institutional commitment. If you’re new to the world of digital preservation, understanding all the basic issues and what your options are can be daunting. I’ve been fortunate enough to lead a group at my institution that has spent the last few years working through some of these issues, and so in this post I want to give a brief overview of the work we’ve done, as well as the current landscape for digital preservation systems. This won’t be an in-depth exploration, more like a key to the map. Note that ACRL TechConnect has covered a variety of digital preservation issues before, including data management and preservation in “The Library as Research Partner” and using bash scripts to automate digital preservation workflow tasks in “Bash Scripting: automating repetitive command line tasks”.

The committee I chair started examining born digital materials, but expanded focus to all digital materials, since our digitized materials were an easier test case for a lot of our ideas. The committee spent a long time understanding the basic tenets of digital preservation–and in truth, we’re still working on this. For this process, we found working through the NDSA Levels of Digital Preservation an extremely helpful exercise–you can find a helpfully annotated version with tools by Shira Peltzman and Alice Sara Prael, as well as an additional explanation by Shira Peltman. We also relied on the Library of Congress Signal blog and the work of Brad Houston, among other resources. A few of the tasks we accomplished were to create a rough inventory of digital materials, a workflow manual, and to acquire many terabytes (currently around 8) of secure networked storage space for files to replace all removable hard drives being used for backups. While backups aren’t exactly digital preservation, we wanted to at the very least secure the backups we did have. An inventory and workflow manual may sound impressive, but I want to emphasize that these are living and somewhat messy documents. The major advantage of having these is not so much for what we do have, but for identifying gaps in our processes. Through this process, we were able to develop a lengthy (but prioritized) list of tasks that need to be completed before we’ll be satisfied with our processes. An example of this is that one of the major workflow gaps we discovered is that we have many items on obsolete digital media formats, such as floppy disks, that needs to be imaged before it can even be inventoried. We identified the tool we wanted to use for that, but time and staffing pressures have left the completion of this project in limbo. We’re now working on hiring a graduate student who can help work on this and similar projects.

The other piece of our work has been trying to understand what systems are available for digital preservation. I’ll summarize my understanding of this below, with several major caveats. This is a world that is currently undergoing a huge amount of change as many companies and people work on developing new systems or improving existing systems, so there is a lot missing from what I will say. Second, none of these solutions are necessarily mutually exclusive. Some by design require various pieces to be used together, some may not require it, but your circumstances may dictate a different solution. For instance, you may not like the access layer built into one system, and so will choose something else. The dream that you can just throw money at the problem and it will go away is, at present, still just a dream–as are so many library technology problems.

The closest to such a dream is the end-to-end system. This is something where at one end you load in a file or set of files you want to preserve (for example, a large set of donated digital photographs in TIFF format), and at the other end have a processed archival package (which might include the TIFF files, some metadata about the processing, and a way to check for bit rot in your files), as well as an access copy (for example, a smaller sized JPG appropriate for display to the public) if you so desire–not all digital files should be available to the public, but still need to be preserved.

Examples of such systems include Preservica, ArchivesDirect, and Rosetta. All of these are hosted vended products, but ArchivesDirect is based on open source Archivematica so it is possible to get some idea of the experience of using it if you are able to install the tools on which it based. The issues with end-t0-end systems are similar to any other choice you make in library systems. First, they come at a high price–Preservica and ArchivesDirect are open about their pricing, and for a plan that will meet the needs of medium-sized libraries you will be looking at $10,000-$14,000 annual cost. You are pretty much stuck with the options offered in the product, though you still have many decisions to make within that framework. Migrating from one system to another if you change your mind may involve some very difficult processes, and so inertia dictates that you will be using that system for the long haul, which a short trial period or demos may not be enough to really tell you that it’s a good idea. But you do have the potential for more simplicity and therefore a stronger likelihood that you will actually use them, as well as being much more manageable for smaller staffs that lack dedicated positions for digital preservation work–or even room in the current positions for digital preservation work.  A hosted product is ideal if you don’t have the staff or servers to install anything yourself, and helps you get your long-term archival files onto Amazon Glacier. Amazon Glacier is, by the way, where pretty much all the services we’re discussing store everything you are submitting for long-term storage. It’s dirt cheap to store on Amazon Glacier and if you can restore slowly, not too expensive to restore–only expensive if you need to restore a lot quickly. But using it is somewhat technically challenging since you only interact with it through APIs–there’s no way to log in and upload files or download files as with a cloud storage service like Dropbox. For that reason, when you’re paying a service hundreds of dollars a terabyte that ultimately stores all your material on Amazon Glacier which costs pennies per gigabye, you’re paying for the technical infrastructure to get your stuff on and off of there as much as anything else. In another way you’re paying an insurance policy for accessing materials in a catastrophic situation where you do need to recover all your files–theoretically, you don’t have to pay extra for such a situation.

A related option to an end-to-end system that has some attractive features is to join a preservation network. Examples of these include Digital Preservation Network (DPN) or APTrust. In this model, you pay an annual membership fee (right now $20,000 annually, though this could change soon) to join the consortium. This gives you access to a network of preservation nodes (either Amazon Glacier or nodes at other institutions), access to tools, and a right (and requirement) to participate in the governance of the network. Another larger preservation goal of such networks is to ensure long-term access to material even if the owning institution disappears. Of course, $20,000 plus travel to meetings and work time to participate in governance may be out of reach of many, but it appears that both DPN and APTrust are investigating new pricing models that may meet the needs of smaller institutions who would like to participate but can’t contribute as much in money or time. This a world that I would recommend watching closely.

Up until recently, the way that many institutions were achieving digital preservation was through some kind of repository that they created themselves, either with open source repository software such as Fedora Repository or DSpace or some other type of DIY system. With open source Archivematica, and a few other tools, you can build your own end-to-end system that will allow you to process files, store the files and preservation metadata, and provide access as is appropriate for the collection. This is theoretically a great plan. You can make all the choices yourself about your workflows, storage, and access layer. You can do as much or as little as you need to do. But in practice for most of us, this just isn’t going to happen without a strong institutional commitment of staff and servers to maintain this long term, at possibly a higher cost than any of the other solutions. That realization is one of the driving forces behind Hydra-in-a-Box, which is an exciting initiative that is currently in development. The idea is to make it possible for many different sizes of institutions to take advantage of the robust feature sets for preservation in Fedora and workflow management/access in Hydra, but without the overhead of installing and maintaining them. You can follow the project on Twitter and by joining the mailing list.

After going through all this, I am reminded of one of my favorite slides from Julie Swierczek’s Code4Lib presentation. She works through the Open Archival Initiative System model graph to explain it in depth, and comes to a point in the workflow that calls for “Sustainable Financing”, and then zooms in on this. For many, this is the crux of the digital preservation problem. It’s possible to do a sort of ok job with digital preservation for nothing or very cheap, but to ensure long term preservation requires institutional commitment for the long haul, just as any library collection requires. Given how much attention digital preservation is starting to receive, we can hope that more libraries will see this as a priority and start to participate. This may lead to even more options, tools, and knowledge, but it will still require making it a priority and putting in the work.


Wikipedia, Libraries, & Neutrality

This piece is substantially based off of a column I wrote for RUSQ that will appear early next year.


Broadly construed, there are two camps of opinions surrounding Wikipedia in librarianship. The first is that Wikipedia is not an academic quality source. It is something students need to be warned away from. The community authorship of wikipedia is typically the source of this criticism; since “anyone can edit” the online encyclopedia, there’s no way to tell whether you’re receiving high quality information written by experts or the ill-informed opinions of internet trolls.

The other camp is far more positive regarding Wikipedia. After all, the values of the Wikipedian community clearly align with our own. The ultimate goal of the project is not to give a venue for poor research or fringe theories, but to enable free and open access to information. While Wikipedia isn’t flawless, many of its articles compete with more established reference sources. Famously, Nature performed a comparison of scientific articles and found Wikipedia to be comparable to Encyclopedia Britannica. But following the study, a community project to correct the identified errors in Wikipedia sprung up, fixing them all in a little over a month.1 With this common ground, it’s no wonder that libraries have found Wikipedia to be a valuable partner in publicizing our content. Articles like Using Wikipedia to Extend Digital Collections, Putting the Library in Wikipedia, and Wikipedia Lover, Not a Hater: Harnessing Wikipedia to Increase the Discoverability of Library Resources all discuss the value of working with Wikipedia to highlight library digital collections and metadata, not against.

A good example of Wikipedia driving traffic to library special collections was brought to my attention by fellow Tech Connect author Margaret Heller, who pointed me towards the Google Analytics Usage Reports for CARLI Digital Collections. CARLI regularly sees Wikipedia as one of the top external traffic sources, with Wikipedia being noted in last few quarterly reports as a traffic source leading “to home pages or images from multiple CARLI Collections”.

The Problem with Neutral

I must admit that I side with the pro-Wikipedia folks. I love using Wikipedia as a resource—as a procrastination-enabler I have open several articles on roguelikes that I’m reading while I right this—and I love contributing to it in whatever small ways I can, from fixing broken markup to citation hunting. But Wikipedia is imperfect in ways more insidious that its anonymous authorship or occasionally inaccurate details. Rather, one problem lies within one of the five pillars that define the philosophy of Wikipedia; “Wikipedia is written from a neutral point of view”.

Neutrality has been under fire from #critlib lately, a group of librarians emphasizing critical theory especially with respect to information literacy instruction. ALA Annual featured a well-attended presentation entitled “But We’re Neutral!” And Other Librarian Fictions Confronted by #critlib. A seminal article appearing earlier this year in Code4Lib by Bess Sadler and Chris Bourg begins with a rousing section labelled Libraries are not Neutral:

In spite of the pride many libraries take in their neutrality, libraries have never been neutral repositories of knowledge. Research libraries in particular have always reflected the inequalities, biases, ethnocentrism, and power imbalances that exist throughout the academic enterprise through collection policies and hiring practices that reflect the biases of those in power at a given institution. In addition, theoretically neutral library activities like cataloging have often re-created societal patterns of exclusion and inequality.

Wikipedia shares this problem; while it appears neutral on the surface, its topical coverage and treatment of subjects reflect the skewed power relations of our society. A 2011 study found that 91% of editors were men. The same study shows that few editors come from the Global South and that the English Wikipedia receives far more focus than other languages. Another research paper from 2011 goes a bit further in demonstrating that “male articles are significantly longer than female articles”.2 The editorial gender gap has real effects on the encyclopedia’s content; it’s not just that having editors of all genders is good in its own right, it’s that Wikipedia’s claims to objectivity and neutrality are jeopardized by the disbalance.

Art+Feminism

So what’s a concerned librarian to do? I think the wrong thing would be to denounce Wikipedia. For one, the alternative sources we or our patrons would turn to are no less problematic. Encyclopaedia Britannica has a well-documented history of supposedly scientific articles written from the dominant viewpoint (c.f. the racist description of black people in the 1911 edition). What’s more, Wikipedia is going to be used. It’s massively popular. Sticking our heads in the sand because it doesn’t live up to its own standards of neutrality improves nothing.

Luckily, there are several Wikipedia projects focused on recruiting editors from underrepresented groups and addressing lackluster coverage of particular topics which we can support. One such project is Art + Feminism. In its own words:

Art+Feminism is a rhizomatic campaign to improve coverage of women and the arts on Wikipedia, and to encourage female editorship…Content is skewed by the lack of female participation. Many articles on notable women in history and art are absent on Wikipedia. This represents an alarming aporia in an increasingly important repository of shared knowledge.

Art+Feminism started out by hosting an edit-a-thon out of the Eyebeam Art and Technology Center in New York City in 2014, with more than 30 other locations worldwide joining in. An edit-a-thon is an event where people gather to perform Wikipedia edits, often centered around a particular theme or project. Higher education institutions and libraries make perfect partners for such occasions. We typically have useful materials to be cited and our students form a large body of potential participates who can be easily incentivized to join in, whether with extra credit or simply some food. So when my library heard of the upcoming 2nd annual edit-a-thon, we immediately began planning to host it. Below, I’ll briefly outline what we did in the hopes it’ll encourage other institutions to join us during this year’s edit-a-thon on March 5th.

Hosting an Edit-a-thon

First, we set up a meetup page on Wikipedia. If you’re unfamiliar with Wikipedia, creating a page like this isn’t a struggle. For one, you can simply copy the entire source markup of someone else’s meetup, then edit your specific details into that skeleton. For two, you can enable the experimental Visual Editor to make Wikipedia even easier to edit without learning wiki markup. The meetup page is an important place for putting up information like timing and directions, but is also a place for us to talk about the impact we made by showing how many editors attended and what articles we improved or created.

While we were putting initial details on our meetup page, we set about securing a location on the date of the edit-a-thon. We discovered that a gallery associated with our school had hosted the edit-a-thon the prior year, but they were unable to repeat it. Our school has campuses in both San Francisco and Oakland, but since Oakland had no other edit-a-thon locations we decided to host it there with the idea that SF locals already had an event nearby.

Our next steps aimed to make the event as easy for newcomers as possible. A staff member pulled relevant materials from our collection, so that researching would be simple and our rarer, more valuable resources might lend some of their information to Wikipedia. We also wanted experienced editors to be on hand during the event. I looked at a local Wikiproject, asking for help on the Talk page and then corresponding directly with a couple editors who had expressed interest. Finally, we managed to secure a visit from someone who actually works for the Wikimedia Foundation that runs the encyclopedia.

During the day, we set out snacks and name badges for everyone. Similar to the color-coded name badges at Ada Camp and other tech conferences, Art+Feminism recommended giving out name badges which signify one’s willingness to be photographed: green meant feel free to take a photo, orange meant please ask permission first, and red meant absolutely not. These steps ensure that everyone is comfortable at the event and not being exposed on the internet against their will. At the start of the day, we explained the coloring system and the Wikimedia staff person gave a short talk on how to write articles that endure, standing up to scrutiny over time. The results of the global event are listed on Wikipedia.

While I feel we accomplished much at our local event, there was one negative experience. An image uploaded to Wikimedia Commons for use on a page was flagged for deletion. The editor flagging it said something along the lines of “this isn’t your personal photo album” as the image was a headshot of a female artist. In the ensuing discussion around the proposed deletion, I noted that the image was about to be used on an article. It was never removed from Commons. Still, the incident underscores cultural problems in Wikipedia. The confrontational style of the discussion lacked good faith. Further, I heard a gendered undertone in the editor’s response; how many pictures of white men are derided as personal photos? While our library staff person was undeterred, moments of hostility like these drive away newcomers.

In Which I Admit I’m Missing the Point

To be fair, Wikipedia itself acknowledges that it fails to live up to neutral status. That the encyclopedia strives towards neutrality is the more vital point. But it’s not as if reaching a supposedly perfect neutrality resolves the issues that folks like #critlib are highlighting. Neutral as a positive value is precisely the problem, because there is no neutral stance that can be taken from outside of society’s power relations and history of inequity. So should we instead be agitating for Wikipedia to become less neutral and take more active stances on social issues? Or are edit-a-thons like Art+Feminism the most viable route towards ensuring topics are covered in a way that surfaces marginalized peoples and their experiences? I have no answers, just food for thought.

  1. See External peer review/Nature December 2005 for Wikipedia’s internal take on the correction process. The original Nature article is paywalled here and is doi:10.1038/438900a. A similar but open access study is doi:10.1371/journal.pone.0106930, Accuracy and Completeness of Drug Information in Wikipedia: A Comparison with Standard Textbooks of Pharmacology.
  2. WP:Clubhouse? An Exploration of Wikipedia’s Gender Imbalance. I’m skeptical of how “male” and “female” articles were defined, but the paper itself is thorough in its argumentation and statistical analysis. It’s also worth noting that this paper, and much of the other literature around the gender gap, ignores genders outside the female-male binary. It’s most useful to conceive of the gap as a disproportionate majority of males than a minority of females, while leaving all other genders out of the picture.

A Forray into Publishing Open Data on GitHub

While we’ve written about using GitHub for publishing before, in this post I will explore publishing data on GitHub, as opposed to a presentation or academic paper. There are a few services where one can publish research data—FigShare comes to mind—but I wanted to try GitHub because I’m already familiar with the service, it seems suitable for publishing data alongside the scripts used to obtain and process it, and its focus on version control makes it particularly apt for publishing a work in progress. However, even with free services like GitHub available, open data still has hurdles to overcome. How can I, a lowly librarian with no grant funding or experience in this area, publish an open data set such that others can locate and reuse it? Let’s find out.

Background

As Lauren introduced in her last post, we here at ACRL Tech Connect are performing research into coding in libraries; how people learn to code, what learning resources they use, what languages they use. As part of this research, I wanted to compare what our survey respondents reported with a bulk analysis of GitHub repositories under library organizations. The Code4Lib wiki has an excellent page listing many library GitHub accounts, and GitHub has a nice API that reports, among many other things, the various languages used in a project. Those two sources of information seemed like a perfect match, so I wrote a few scripts to mash them together.

Publishing scripts that extract and analyze data is important. One cannot trust the results of a single scientific experiment or a poor sample set. Providing the programs used to collect data aims to allow reproducibility so future researchers can verify or build upon prior results. While we perhaps think of science as being quite established by now, data and reproducibility are major issues in most fields. Ask any data librarian and they’ll tell you; managing the preservation and distribution of research data is not a solved or simple problem. Furthermore, every so often another meta-research study will show that only some dismal percentage of experiments can be replicated.1

My own data is not so valuable. No cure for a debilitating disease rests on the number of bytes of Standard ML in your university’s GitHub account. But on principle I want to let my results be repeatable and, what’s more, if someone does find an error in my scripts or data I want it corrected. Even if my initial conclusions are off, someone might be able to construct a stronger study from their basis.

Step #1: Obtain a DOI

As the first step of publishing my data, I wanted to obtain a Digital Object Identifier. Sure, putting my work up on GitHub gives me a URL I can reference, but leaving it at that adds a lot of uncertainty. What if I change my GitHub username, which is contained in the URL? What if I transfer the repository to a new owner? What if GitHub goes out of business? While none of these are likely scenarios, they’re still worth guarding against. DOI providers essentially stick to a pact that their identifiers will continue to work for perpetuity. While that’s not always the case, I feel like grabbing a DOI is still The Right Thing To Do for pubishers at the present moment.

We can use Zenodo to secure a DOI. GitHub already has a fine guide named Making Your Code Citable, but I’ll lightly outline the process here.

First, we create a Zenodo account reusing our GitHub credentials. Zenodo will list out our repositories and we can click the On button next to one to ready it for publishing. This button establishes a “web hook” between events happening in that GitHub repository and Zenodo; when we go to publish a release, Zenodo will be aware of it.

This was the only step that tripped me up a bit. GitHub’s “releases” are not a part of the git version control system, they’re an added feature of the hosting environment. But in my mind they’re identical to git’s “tags” that one uses to label particular points in a repository’s history. Indeed, when we push a tag up to GitHub, it’ll show up on our repository’s releases page. But it appears tags are not technically releases, or don’t trigger the right web hook, because when I pushed a typical “v1.0.0” tag to GitHub, Zenodo didn’t notice. Instead, I had to go to my releases page, Draft a new release, and then Choose an existing tag to associate the version tag with a GitHub release. The title and description entered at this stage are available later in Zenodo.

The final step is back in Zenodo, where we can mint a DOI and describe our project further. We have a powerful set of fields for describing our project in Zenodo, including type (e.g. data set, software, presentation, publication), publication date, list of authors, open-ended description, list of keywords, access rights, license, funding agency, alternative identifiers (e.g. PubMed ID), and more. Zenodo also has a “communities” feature where we can deposit our research in a collection with a disciplinary focus; I put my data in the “Library and Information Science” group.

Step #2: Document the Data’s Schema

Obtaining a DOI is fine and all, but I also wanted to document my data more thoroughly. While it’s not a complicated data set, I’m familiar with the challenges that an unknown data schema presents for end users. All too often at work, I’m forced to revise data processing routines because a new outlier appears. There’ll be a string of text where I’m expecting only integers, a blank entry in what I thought was a required field, or an ID that doesn’t conform to the anticipated pattern (punctuation appears in a barcode! a random letter prefixes an otherwise numeric ID!).

To make our data’s structure clear, we can use the Data Package standard from the Open Knowledge Foundation (OKFN), specifically the Tabular Data Package subset which was designed for the CSV (comma-separated values) format.2 Documenting our data is straightforward with these standards; we place a “datapackage.json” file alongside our data files and fill in a few fields. Here’s an example:

{
    "name": "libs-github-api", // must be URL-friendly, e.g. no spaces 
    "description": "library GitHub projects",
    "license": "CC0 Public Domain",
    "keywords": ["libraries", "programming languages"],
    "resources": [ // list of files 
        {
            "name": "summary",
            "path": "data/summary.csv", // UNIX-style path relative to datapackage.json 
            "format": "csv",
            "mediatype": "text/csv",
            "schema": { // outline of fields within this file 
                "fields": [
                    {
                        "name": "language",
                        "type": "string", /* from a controlled list of data types
                        could also be integer, number, date, etc. */
                        "description": "name of the programming language",
                        "constraints": {
                            "required": true,
                            "unique": true
                        }
                    }
                    // our schema would list a few more fields here… 
                ]
            }
        }
    ]
}

Note that the comments above aren’t valid in JSON, I include them simply to provide some inline explanation.

While it requires a little reading to figure out how to fill out datapackage.json fields, many are self-evident. The appeal of the standard becomes evident in the schema section; we can tell consumers what types to expect from our data and other particularities of a given CSV column. Does a column contain empty values? Then required will be absent or explicitly set to false. Does a column contain both integers and text? Then the type “string” warns consumers not to anticipate only numeric values. What’s more, we can provide a regular expression in the pattern constraint which specifies exactly how a field may be formatted. Even strange barcodes with surprise punctuation could be documented precisely.

I would say there’s a lot more to the Data Package standards, but the truth is they’re elegant and concise. One can read all three (Data Package, Tabular Data Package, and JSON Table Schema) in a matter of minutes, look at an instructive example or two, and be ready to reveal their data’s structure in a standardized way. There is great depth available in the way one describes individual resources and their data schema.

Why spend all this time with a data package when we’ve already done something similar with Zenodo? The data package documentation solves a couple problems. First of all, packaging up our CSV alongside structured data about its nature addresses findability. There’s tons of open data out there, the issue is it can be scattered and difficult to find. If someone is looking for statistics on programming language usage, how would they go about finding my data? Searching GitHub will be challenging; the keywords one uses (“programming language”, the ambiguous “libraries”, etc.) will likely retrieve many repositories which don’t contain open data, and GitHub, while it does have a decent advanced search form, doesn’t have the facets to make retrieving a particular data set straightforward. One cannot, for instance, filter search results by a repository’s license or the format (CSV, JSON) of the data contained therein.

Data packages address the issue of findability by providing for the possibility of a registry that aggregates all the data sets it knows about. Once a datapackage.json appears, suddenly information like whether the format is CSV or JSON, what the license is, who created it, and what subject keywords are related to a repository become clear. The Open Knowledge Foundation already has a strong proof-of-concept registry, albeit one that lists only around a hundred data sets.

Since Data Package is an open standard, any third-party can easily parse its metadata and provide search facets based on the fields that are present. This is how the standard addresses a second issue; machine readability. Documenting data sets is good, necessary even, but it often only helps humans. I can write a five-page paper meticulously detailing my data’s collection methodology and structure, but that’s asking researchers to do a lot of reading. Now consider that their research might be on a grand scale; imagine if they needed to read a hundred five page papers describing ad hoc data schemas!

Instead, creating a machine readable description lets my data be processed quickly by a specially designed program. As a somewhat trivial example, I already used the OKFN’s Data Package Validator to ensure my schema documentation met their standard. As a more interesting use case, the OKFN also defines an optional “views” section of the data package standard which allows applications to automatically create charts from our data.

Reflection

While I’m glad that tools like Zenodo and standards like Data Package exist for publishing data, there’s still a lot of work to be done in this arena. Every time I make a new release on GitHub, which arguably should happen with even minor changes to my data or scripts, I have to refill the extensive Zenodo form. Zenodo also doesn’t detect the GitHub repository’s license, which is hardly blameworthy given that that information isn’t present as structured data but mere text in a readme file. However, when publishing a new version of the same underlying data, it doesn’t fill in the license or other information from its own previous items.

There’s a ton of efficiency left on the table in the data publishing process this post describes. Specifically, an integration between Zenodo and the datapackage.json metadata would alleviate a number of problems. Rather than repeatedly filling out a form in Zenodo, one could simply ensure changes were reflected in the datapackage.json and publish a new version on GitHub. Many fields between the two are redundant, though each also has its unique value; Zenodo asks for typical academic publishing information (e.g. publication type, links to prior versions) while Data Package asks for a data schema.

As a final area of concern, the open-ended “license” field is going to eventually limit the utility of the machine readable information in a Data Package.3 Perhaps this is my inner librarian unnecessarily freaking out here, but uncontrolled fields which affect resource reuse are bad news. Defaulting to authors specifying an arbitrary string of text as a license is precisely the problem that the Digital Public Library of America and other large digital libraries are facing, as their corpuses contain thousands of different rights statements.4 Zenodo provides a substantial list of licenses to choose from, but then does a poor job of automatically detecting one even if hints are available via GitHub or a previous incarnation of the publication. GitHub itself should probably make licenses for repositories required and controlled as I could see that being a vital facet in their advanced search as well as interesting data to expose to researchers via their API.

  1. I read something about this a month or two ago, but wasn’t able to relocate the source. Scouring the web, there’s a Washington Post article from January on the phenomenon of irreproducible research, which in turn points to a PLoS Med article from 2005 “Why Most Published Research Findings Are False“. Other studies along these lines are “A Survey on Data Reproducibility in Cancer Research Provides Insights into Our Limited Ability to Translate Findings from the Laboratory to the Clinic” in PLoS ONE and “Drug development: Raise standards for preclinical cancer research” in Nature.
  2. I might have been able to explore another intriguing project, Research Objects, which has the apt tagline “enabling reproducible, transparent research.” However, the Data Package standards were so easy to find and follow conceptually that I chose them.
  3. And to be fair, I did see one example where the licenses JSON property was specified as an array of objects containing a license name and URL, which might be easier to consume in script depending on what’s available at the URL.
  4. Aside: I don’t mean to argue that arbitrary license strings should be prohibited, because no controlled vocabulary is going to enumerate all possible choices. But there’s a lot of good work being done to make licenses easier to specify—think of Creative Commons with their composable, versioned licenses which can be referred to by URL. Defaulting to a controlled list of license types or at least pointing to a preferred vocabulary would help here.

Where do Library Staff Learn About Programming? Some Preliminary Survey Results

[Editor’s Note:  This post is part of a series of posts related to ACRL TechConnect’s 2015 survey on Programming Languages, Frameworks, and Web Content Management Systems in Libraries.  The survey was distributed between January and March 2015 and received 265 responses.  A longer journal article with additional analysis is also forthcoming.  For a quick summary of the article below, check out this infographic.]

Our survey on programming languages in libraries has resulted in a mountain of fascinating data.  One of the goals of our survey was to better understand how staff in libraries learn about programming and develop their coding skills.  Based upon anecdotal evidence, we hypothesized that library staff members are often self-taught, learning through a combination of on-the-job learning and online tutorials.  Our findings indicate that respondents use a wide variety of sources to learn about programming, including MOOCs, online tutorials, Google searches, and colleagues.

Are programming skills gained by formal coursework, or in Library Science Master’s Programs?

We were interested in identifying sources of programming learning, whether that involved course work (either formal coursework as part of a degree or continuing education program, or through Massive Online Open Courseware (MOOCs)).  Nearly two-thirds of respondents indicated they had an MLS or were working on one:

When asked about coursework taken in programming, application, or software development, results were mixed, with the most popular choice being 1-2 classes:

However, of those respondents who have taken a course in programming (about 80% of all respondents) AND indicated that they either had an MLS or were attending an MLS program, only about a third had taken any of those courses as part of a Master’s in Library Science program:

Resources for learning about programming

The final question of the survey asked respondents, in an open-ended way, to describe resources they use to learn about programming.  It was a pretty complex question:

Please list or describe any learning resources, discussion boards or forums, or other methods you use to learn about or develop your skills in programming, application development, or scripting. Please includes links to online resources if available. Examples of resources include, but are not limited to: Lynda.com, MOOC courses, local community/college/university course on programming, Books, Code4Lib listserv, Stack Overflow, etc.).

Respondents gave, in many cases, incredibly detailed responses – and most respondents indicated a list of resources used.  After coding the responses into 10 categories, some trends emerged.  The most popular resources for learning about programming, by far, were courses (whether those courses were taken formally in a classroom environment, or online in a MOOC environment):

To better illustrate what each category entails, here are the top five resources in each category:

By far, the most commonly cited learning resource was Stack Overflow, followed by the Code4Lib Listserv, Books/ebooks (unspecified) and Lynda.com.  Results may skew a little toward these resources because they were mentioned as examples in the question, priming respondents to include them in their responses.  Since links to the survey were distributed, among other places, on the Code4Lib listserv, its prominence may also be influenced by response bias. One area that was a little surprising was the number of respondents that included social networks (including in-person networks like co-workers) as resources – indeed, respondents who mentioned colleagues as learning resources were particularly enthusiastic, as one respondent put it:

…co-workers are always very important learning resources, perhaps the most important!

Preliminary Analysis

While the data isn’t conclusive enough to draw any strong conclusions yet, a few thoughts come to mind:

  • About 3/4 of respondents indicated that programming was either part of their job description, or that they use programming or scripting as part of their work, even if it’s not expressly part of their job.  And yet, only about a third of respondents with an MLS (or in the process of getting one) took a programming class as part of their MLS program.  Programming is increasingly an essential skill for library work, and this survey seems to support the view that there should be more programming courses in library school curriculum.
  • Obviously programming work is not monolithic – there’s lots of variation among those who do programming work that isn’t reflected in our survey, and this survey may have unintentionally excluded those who are hobby coders.  Most questions focused on programming used when performing work-related tasks, so additional research would be needed to identify learning strategies of enthusiast programmers who don’t have the opportunity to program as part of their job.
  • Respondents indicated that learning on the job is an important aspect of their work; they may not have time or institutional support for formal training or courses, and figure things out as they go along using forums like Stack Overflow and Code4Lib’s listserv.  As one respondent put it:

Codecademy got me started. Stack Overflow saves me hours of time and effort, on a regular basis, as it helps me with answers to specific, time-of-need questions, helping me do problem-based learning.

TL;DR?  Here’s an infographic:



In the next post, I’ll discuss some of the findings related to ways administration and supervisors support (or don’t support) programming work in libraries.


Removing the Truthiness from Google

A decade ago, Stephen Colbert introduced the concept of “truthiness”, or a fact that was so because it felt right “from the gut.” When we search for information online, we are always up against the risk that the creator of a page is someone who, like Stephen Colbert’s character doesn’t trust books, because “they’re all fact, no heart.”1 Since sites with questionable or outright false facts that “feel right” often end up at the top of Google search results, librarians teach students how to evaluate online sources for accuracy, relevancy, and so on rather than just trusting the top result. But what if there were a way to ensure that truthiness was removed, and only sites with true information appeared at the top of the results?

This idea is what underlies a new Google algorithm called Knowledge-Based Trust(KBT)2. Google’s original founding principles and the PageRank algorithm were based on academic citation practices–loosely summarized, pages linked to by a number of other pages are more likely to be useful than those with fewer links. The content of the page, while it needs to match the search query, is less crucial to its ranking than outside factors, which is otherwise known as an exogenous model. The KBT, by contrast, is an endogenous model relying on the actual content of the page. Ranking is based on the probability that the page is accurate, and therefore more trustworthy. This is designed to address the problem of sites with high PageRank scores that aren’t accurate, either because their truthiness quotient is high, or because they have gamed the system by scraping content and applying misleading SEO. On the other side, pages with great information that aren’t very popular may be buried.

“Wait a second,” you are now asking yourself, “Google now determines what is true?” The answer is: sort of, but of course it’s not as simple as that. Let’s look at the paper in detail, and then come back to the philosophical questions.

Digging Into the KBT

First, this paper is technical, but the basic information is fairly straightforward. This model is based on extracting facts from a web source, evaluating whether those facts are true or not, and then whether a source is accurate or not. This leads to a determination that the facts are correct in an iterative process. Of course, verifying that determination is essential to ensuring that all the algorithms are working correctly, and this paper describes ways of checking the extracted facts for accuracy.

The extractors are described more fully in an earlier version of this work, Knowledge Vault (KV), which was designed to fill in large-scale knowledge bases such as Freebase by extracting facts from a web source using techniques like Natural Language Processing of text files followed by machine learning, HTML DOM trees, HTML tables, and human processed pages with schema.org metadata. The extractors themselves can perform poorly in creating these triples, however, and this is more common than the facts being wrong, and so sites may be unfairly flagged as inaccurate. The KBT project aims to introduce an algorithm for determining what type of error is present, as well as how to judge sites with many or few facts accurately, and lastly to test their assumptions using real world data against known facts.

The specific example given in the paper is the birthplace of President Barack Obama. The extractor would determine a predicate, subject, object triple from a web source and match these strings to Freebase (for example). This can lead to a number of errors–there is a huge problem in computationally determining the truth even when the semantics are straightforward (which we all know it rarely is). For this example, it’s possible to check data from the web against the known value in Freebase, and so if that extractor works set an option to 1 (for yes) and 0 (for no). Then this can be charted in a two-dimensional or three-dimensional matrix that helps show the probability of a given extractor working, as well as whether the value pulled by the extractor was true or not.

They go on to examine two models for computing the data, single-layer and multi-layer. The single-layer model, which looks at each web source and its facts separately, is easier to work with using standard techniques but is limited because it doesn’t take into account extraction errors. The multi-layer model is more complex to analyze, but takes the extraction errors into account along with the truth errors. I am not qualified to comment on the algorithm math in detail, but essentially it computes probability of accuracy for each variable in turn, ultimately arriving at an equation that estimates how accurate a source is, weighted by the likelihood that source contains those facts. There are additional considerations for precision and recall, as well as confidence levels returned by extractors.

Lastly, they consider how to split up large sources to avoid computational bottlenecks, as well as to merge sources with few facts in order to not penalize them but not accidentally combine unrelated sources. Their experimental results determined that generally PageRank and KBT are orthogonal, but with a few outliers. In some cases, the site has a low PageRank but a high KBT. They manually verified the top three predicates with high extraction accuracy scores for web sources with a high KBT to check what was happening. 85% of these sources were trustworthy without extraction errors and with predicates related to the topic of the page, but only 23% of these sources had PageRank scores over 0.5. In other cases, sources had a low KBT but high PageRank, which included sites such as celebrity gossip sites and forums such as Yahoo Answers. Yes, indeed, Google computer scientists finally have definitive proof that Yahoo Answers tends to be inaccurate.

The conclusion of the article with future improvements reads like the learning outcomes for any basic information literacy workshop. First, the algorithm would need to be able to tell the main topic of the website and filter out unrelated facts, to understand which triples are trivial, to have better comprehension of what is a fact, and to correctly remove sites with data scraped from other sources. That said, for what it does, this is a much more sophisticated model than anything else out there, and at least proves that there is a possibility to computationally determine the accuracy of a web source.

What is Truth, Anyway?

Despite the promise of this model there are clearly many potential problems, of which I’ll mention just a few. The source for this exercise, Freebase, is currently in read-only mode as its data migrates to Wikidata. Google is apparently dropping Freebase to focus on their Open Knowledge Graph, which is partially Freebase/Wikidata content and partially schema.org data 3. One interesting wrinkle is that much of Freebase content cites Wikipedia as a source, which means there are currently recursive citations that must be properly cited before they will be accepted as facts. We already know that Wikipedia suffers from a lack of diversity in contributors and topic coverage, so a focus on content from Wikipedia has the danger of reducing the sources of information from which the KBT could check triples.

That said, most of human knowledge and understanding is difficult to fit into triples. While surely no one would search Google for “What is love?” or similar and expect to get a factual answer, there are plenty of less extreme examples that are unclear. For instance, how does this account for controversial topics? I.e. “anthropogenic global warming is real” vs. “global warming is real, but it’s not anthropogenic.” 97% of scientists agree to the former, but what if you are looking for what the 3% are saying?

And we might question whether it’s a good idea to trust an algorithm’s definition of what is true. As Bess Sadler and Chris Bourg remind us, algorithms are not neutral, and may ignore large parts of human experience, particularly from groups underrepresented in computer science and technology. Librarians should have a role in reducing that ignorance by supporting “inclusion, plurality, participation and transparency.” 4 Given the limitations of what is available to the KBT it seems unlikely that this algorithm would markedly reduce this inequity, though I could see how it could be possible if Wikidata could be seeded with more information about diverse groups.

Librarians take note, this algorithm is still under development, and most likely won’t be appearing in our Google results any time in the near future. And even once it does, we need to ensure that we are still paying attention to nuance, edge cases, and our own sense of truthiness–and more importantly, truth–as we evaluate web sources.

  1. http://thecolbertreport.cc.com/videos/63ite2/the-word—truthiness.
  2. Dong, X. et al. “Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources”. Proceedings of the VLDB Endowment, 2015. Retrieved from http://arxiv.org/abs/1502.03519
  3. https://www.wikidata.org/wiki/Help:FAQ/Freebase
  4. Sadler, Bess and Chris Bourg, “Feminism and the Future of Library Discovery.” Code4Lib Journal 28, April 2015.

A Video on Browser Extensions

I thought we’d try something new on ACRL TechConnect, so I recorded a fifteen-minute video discussing general use cases for browser extensions and some specifics of Google Chrome extensions.

The video mentions my WikipeDPLA post on this blog and walks through some slides I presented at a Code4Lib Northern California event.

If you’re looking for another good extension example in libraryland, Stephen Schor of New York Public Library recently wrote extensions for Chrome and Firefox that improve the appearance and utility of the Library of Congress’ EAD documentation. The Chrome extension uses the same content script approach as my silly example in the video. It’s a good demonstration of how you can customize a site you don’t control using the power of browser add-ons.

Have you found a use case for browser add-ons at your library? Let us know in the comments!