Reflections on Code4Lib 2013

Disclaimer: I was on the planning committee for Code4Lib 2013, but this is my own opinion and does not reflect other organizers of the conference.

We have mentioned Code4Lib before on this blog, but for those who are unfamiliar, it is a loose collective of programmers working in libraries, librarians, and others interested in code and libraries. (You can read more about it on the website.) The Code4Lib conference has emerged as a venue to share very new technology and have discussions with a wide variety of people who might not attend conferences more geared to librarians. Presentations at the conference are decided by the votes of anyone interested in selecting the program, and additionally lightning talks and breakout sessions allow wide participation and exposure to extremely new projects that have not made it into the literature or to conferences with a longer lead time. The Code4Lib 2013 conference ran February 11-14 at University of Illinois Chicago. You can see a list of all programs here, which includes links to the video archive of the conference.

While there were many types of projects presented, I want to focus on those talks which illustrated what I saw as thread running through the conference–care and emotion. This is perhaps unexpected for a technical conference. Yet those themes underlie a great deal of the work that takes place in academic library technology and the types of projects presented at Code4Lib. We tend to work in academic libraries because we care about the collections and the people using those collections. That intrinsic motivation focuses our work.

Caring about the best way to display collections is central to successful projects. Most (though not all) the presenters and topics came out of academic libraries, and many of the presentations dealt with creating platforms for library and archival metadata and collections. To highlight a few: Penn State University has developed their own institutional repository application called ScholarSphere that provides a better user experience for researchers and managers of the repository. The libraries and archives of the Rock and Roll Hall of Fame dealt with the increasingly common problem of wanting to present digital content alongside more traditional finding aids, and so developed a system for doing so. Corey Harper from New York University presented an extremely interesting and still experimental project to use linked data to enrich interfaces for interacting with library collections. Note that all these projects combined various pieces of open source software and library/web standards to create solutions that solve a problem facing academic or research libraries for a particular setting. I think an important lesson for most academic librarians looking at descriptions of projects like this is that it takes more than development staff to make projects like this. It takes purpose, vision, and dedication to collecting and preserving content–in other words, emotion and care. A great example of this was the presentation about DIYHistory from the University of Iowa. This project started out initially as an extremely low-tech solution for crowdsourcing archival transcription, but got so popular that it required a more robust solution. They were able to adapt open source tools to meet their needs, still keeping the project very within the means of most libraries (the code is here).

Another view of emotion and care came from Mark Matienzo, who did a lightning talk (his blog post gives a longer version with more details). His talk discussed the difficulties of acknowledging and dealing with the emotional content of archives, even though emotion drives interactions with materials and collections. The records provided are emotionless and affectless, despite the fact that they represent important moments in history and lives. The type of sharing of what someone “likes” on Facebook does not satisfactorily answer the question of what they care about,or represent the emotion in their lives. Mark suggested that a tool like Twine, which allows writing interactive stories could approach the difficult question of bringing together the real with the emotional narrative that makes up experience.

One of the ways we express care for our work and for our colleagues is by taking time to be organized and consistent in code. Naomi Dushay of Stanford University Library presented best practices for code handoffs, which described some excellent practices for documenting and clarifying code and processes. One of the major takeaways is that being clear, concise, and straightforward is always preferable, even as much as we want to create cute names for our servers and classes. To preserve a spirit of fun, you can use the cute name and attach a description of what the item actually does.

Originally Bess Sadler, also from Stanford, was going to present with Naomi, but ended up presenting a different talk and the last one of the conference on Creating a Commons (the full text is available here). This was a very moving look at what motivates her to create open source software and how to create better open source software projects. She used the framework of the Creative Commons licenses to discuss open source software–that it needs to be “[m]achine readable, human readable, and lawyer readable.” Machine readable means that code needs to be properly structured and allow for contributions from multiple people without breaking, lawyer readable means that the project should have the correct structure and licensing to collaborate across institutions. Bess focused particularly on the “human readable” aspect of creating communities and understanding the “hacker epistemology,” as she so eloquently put it, “[t]he truth is what works.” Part of understanding that requires being willing to reshape default expectations–for instance, the Code4Lib community developed a Code of Conduct at Bess’s urging to underline the fact that the community aims at inclusion and creating a safe space. She encouraged everyone to keep working to do better and “file bug reports” about open source communities.

This year’s Code4Lib conference was a reminder to me about why I do the work I do as an academic librarian working in a technical role. Even though I may spend a lot of time sitting in front of a computer looking at code, or workflows, or processes, I know it makes access to the collections and exploration of those collections better.


What is Node.js & why do I care?

At its simplest, Node.js is server-side JavaScript. JavaScript is a popular programming language, but it almost always runs inside a web browser. So JavaScript can, for instance, manipulate the contents of this page by being included inside <script> tags, but it doesn’t get to play around with the files on our computer or tell our server what HTML to send like the PHP that runs this WordPress blog.

Node is interesting for more than just being on the server side. It provides a new way of writing web servers while using an old UNIX philosophy. Hopefully, by the end of this post, you’ll see its potential and how it differentiates itself from other programming environments and web frameworks.

Hello, World

To start, let’s do some basic Node programming. Head over to nodejs.org and click Install.1 Once you’ve run the installer, a node executable will be available for you on the command line. Any script you pass to node will be interpreted and the results displayed. Let’s do the classic “hello world” example. Create a new file in a text editor, name it hello.js, and put the following on its only line:

console.log('Hello, world!');

If you’ve written JavaScript before, you may recognize this already. console.log is a common debugging method which prints strings to your browser’s JavaScript console. In Node, console.log will output to your terminal. To see that, open up a terminal (on Mac, you can use Terminal.app while on Windows both cmd.exe and PowerShell will work) and navigate to the folder where you put hello.js. Your terminal will likely open in your user’s home folder; you can change directories by typing cd followed by a space and the subdirectory you want to go inside. For instance, if I started at “C:\users\me” I could run cd Documents to enter “C:\users\me\Documents”. Below, we open a terminal, cd into the Documents folder, and run our script to see its results.

$ cd Documents
$ node hello.js
Hello, world!

That’s great and all, but it leaves a lot to be desired. Let’s do something a little more sophisticated; let’s write a web server which responds “Hello!” to any request sent to it. Open a new file up, name it server.js, and write this inside:

var http = require('http');
http.createServer(handleRequest).listen(8888);
function handleRequest (request, response) {
  response.end( 'Hello!' );
}

In our terminal, we can run node server.js and…nothing happens. Our prompt seems to hang, not outputting anything but also not letting us type another command. What gives? Well, Node is running a web server and it’s waiting for responses. Open up your web browser and navigate to “localhost:8888″; the exclamation “Hello!” should appear. In four lines of code, we just wrote an HTTP server. Sure, it’s the world’s dumbest server that only says “Hello!” over and over no matter what we request from it, but it’s still an achievement. If you’re the sort of person who gets giddy at how easy this was, then Node.js is for you.

Let’s walk through server.js line-by-line. First, we import the core HTTP library that comes with Node. The “require” function is a way of loading external modules into your script, similar to how the function of the same name does in Ruby or import in Python. The HTTP library gives us a handy “createServer” method which receives HTTP requests and passes them along to a callback function. On our 2nd line, we call createServer, pass it the function we want to handle incoming requests, and set it to listen for requests sent to port 8888. The choice of 8888 is arbitrary; we could choose any number over 1024, while operating systems often restrict the lower ports which are already in use by specific protocols. Finally, we define our handleRequest callback which will receive a request and response object for each HTTP request. Those objects have many useful properties and methods, but we simply called the response object’s end method which sends a response and optionally accepts some data to put into that response.

The use of callback functions is very common in Node. If you’ve written JavaScript for a web browser you may recognize this style of programming; it’s the same as when you define an event listener which responds to mouse clicks, or assign a function to process the result of an AJAX request. The callback function doesn’t executive synchronously in the same order you wrote it in your code, it waits for some “event” to occur, whether that event is a click or an AJAX request returning data.

In our HTTP server example, we also see a bit of what makes Node different from other server-side languages like PHP, Perl, Python, and Ruby. Those languages typically work with a web server, such as Apache, which passes certain requests over to the languages and serves up whatever they return. Node is a server, it gives you low-level access to the inner workings of protocols like HTTP and TCP. You don’t need to run Apache and have requests sent to Node: it handles them on its own.

Who cares?

Some of you are no doubt wondering: what exactly is the big deal? Why am I reading about this? Surely, the world has enough programming languages, and JavaScript is nothing new, even server-side JavaScript isn’t that new.2 There are already plenty of web servers out there. What need does Node.js fill?

To answer that, we must revisit the origins of Node. The best way to understand is to watch Ryan Dahl present on the impetus for creating Node. He says, essentially, that other programming frameworks are doing IO (input/output) wrong. IO comes in many forms: when you’re reading or writing to a file, when you’re querying databases, and when you’re receiving and sending HTTP requests. In all of these situations, your code asks for data…waits…and waits…and then, once it has the data, it manipulates it or performs some calculation, and then sends it somewhere else…and waits…and waits. Basically, because the code is constantly waiting for some IO operation, it spends most of its time sitting around rather than crunching digits like it wants to. IO operations are commonly the bottlenecks in programs, so we shouldn’t let our code just stop every time they perform one.

Node not only has a beneficial asynchronous programming model but it has developed other advantages as well. Because lots of people already know how to write JavaScript, it’s started up much quicker than languages which are entirely new to developers. It reuses Google Chrome’s V8 as a JavaScript interpreter, giving it a big speed boost. Node’s package manager, NPM, is growing at a tremendous rate, far faster than its sibling package managers for Java, Ruby, and Python. NPM itself is a strong point of Node; it’s learned from other package managers and has many excellent features. Finally, other programming languages were developed to be all-purpose tools. Node, while it does share the same all-purpose utility, is really intended for the web. It’s meant to write web servers and handle HTTP intelligently.

Node also follows many UNIX principles. Doug McIlroy succinctly summarized the UNIX philosophy as “Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.” NPM does a great job letting authors write small modules which work well together. This has been tough previously in JavaScript because web browsers have no “require” function; there’s no native way for modules to define and load their dependencies, which resulted in the popularity of large, complicated libraries.3 jQuery is a good example; it’s tremendously popular and it includes hundreds of functions in its API, while most sites that use it really only need a few. Large, complicated programs are more difficult to test, debug, and reason about, which is why UNIX avoided them.

Many Node modules also support streams that allow you to pipe data through a series of programs. This is analogous to how BASH and other shells let you pipe text from one command to another, with each command taking the output of the last as its input. To visualize this, see Stream Playground written by John Resig, creator of jQuery. Streams allow you to plug-in different functionality in when needed. This pseudocode shows how one might read a CSV from a server’s file system (the core “fs” library stands for “file system”), filter out certain rows, and send it over HTTP:

fs.createReadStream('spreadsheet.csv').pipe(filter).pipe(http);
// Want to compress the response? Just add another pipe.
fs.createReadStream('spreadsheet.csv').pipe(filter).pipe(compressor).pipe(http);

Streams have the advantage of limiting how much memory a program uses because only small portions of data are being operated on at once. Think of the difference between copying a million-line spreadsheet all at once or line-by-line; the second is less likely to crash or run into the limit of how much data the system clipboard can hold.

Libraryland Examples

Node is still very new and there aren’t a lot prominent examples of library usage. I’ll try to present a few, but I think it’s more worth knowing about as a major trend in web development.

Most amusingly, Ed Summers of the Library of Congress and Sean Hannan of Johns Hopkins University made a Cataloging Highscores page that presents original cataloging performed in WorldCat in a retro arcade-style display. This app uses the popular socket.io module that establishes a real-time connection between your browser and the server, a strength of Node. Any web service that needs to be continually updated is a prime candidate for Node.js: current news articles, social media streams, auto-complete suggestions as a user types in search terms, and chat reference all come to mind. In fact, SpringShare’s LibChat uses socket.io as well, though I can’t tell if it’s using Node on the server or PHP. A similar example of real-time updating, also by Ed Summers, is Wikistream which streams the dizzying number of edits happening on various Wikipedias through your browser.4

There was a lightning talk on Node at Code4Lib 2010 which mentions writing a connector to the popular Apache Solr search platform. Aaron Coburn’s proposed talk for Code4Lib 2014 mentions that Amherst is using Node to build the web front-end to their Fedora-based digital library.

Tools You Can Use

With the explosive growth of NPM, there are already tons of useful tools written in Node. While many of these are tools for writing web servers, like Express, some are command line programs you can use to accomplish a variety of tasks.

Yeoman is a scaffolding application that makes it easy to produce various web apps by giving you expert templates. You can install separate generators that produce templates for things like a Twitter Bootstrap site, a JavaScript bookmarklet, a mobile site, or a project using the Angular JavaScript MVC framework. Running yo angular to invoke the Angular generator gives you a lot more than just a base HTML file and some JavaScript libraries; it also provides a series of Grunt tasks for testing, running a development server, and building a site optimized for production. Grunt is another incredibly useful Node project, dubbed “the JavaScript task runner.” It lets you pick from hundreds of community plugins to automate tedious tasks like minifying and concatenating your scripts before deploying a website.

Finally, another tool that I like is phantomas which is a Node project that works with PhantomJS to run a suite of performance tests on a site. It provides more detailed reports than any other performance tool I’ve used, telling you things like how many DOM queries ran and median latency of HTTP requests.

Learn More

Nodeschool.io features a growing number of lessons on using Node. Better yet, the lessons are actually written in Node, so you install them with NPM and verify your results on the command line. There are several topics, from basics to using streams to working with databases.

Nettuts+, always a good place for coding tutorials, has an introduction to Node which takes you from installation to coding a real-time server. If you want to learn about writing a real-time chat application with socket.io, they have a tutorial for that, too.

If you want a broad and thorough overview, there are a few introductory books on Node, with The Node Beginner Book offering several free chapters. O’Reilly’s Node for Front-End Developers is also a good starting point.

How to Node is a popular blog with articles on various topics, though some are too in-depth for beginners. I’d head here if you want to learn more on a specific topic, such as streams, or working with particular databases like MongoDB.

Finally, the Node API docs are a good place to go when you get stuck using a particular core module.

Notes

  1. If you use a package manager, such as Homebrew on Mac OS X or APT on Linux, Node is likely available within it. One caveat I have noticed is that the stock Debian/Ubuntu apt-get install nodejs is a few major versions behind; you may want to add Chris Lea’s PPA to get a current version. If you’re subject to the whims of your IT department, you may need to convince them to install Node for you, or talk to your sysadmin to get it on your server. Since it’s a rather new technology, don’t be surprised if you have to explain what it is and why you want to try it out.
  2. Previous projects, including Rhino from Mozilla and Narwhal, have let people use JavaScript outside the server. Node, however, has caught on far more than either of these projects, for some of the reasons outlined in this post.
  3. RequireJS is one project that’s trying to address this need. The ECMAScript standard that defines JavaScript is also working on native modules but they’re in draft form and it’ll be a long time before all browsers support them.
  4. If you’re curious, the code for both Cataloging Highscores and Wikistream are open source and available on GitHub.

An Experiment with Publishing on GitHub

Scholarly publishing, if you haven’t noticed, is nearing a crisis. Authors are questioning the value added by publishers. Open Access publications are growing in number and popularity. Peer review is being criticized and re-invented. Libraries are unable to pay price increases for subscription journals. Traditional measures of scholarly impact and journal rankings are being questioned while new ones are developed. Fresh business models or publishing platforms appear to spring up daily.1

I personally am a little frustrated with scholarly publishing, albeit for reasons not entirely related to the above. I find that most journals haven’t adapted to the digital age yet and thus are still employing editorial workflows and yielding final products suited to print.

How come I have yet to see a journal article PDF with clickable hyperlinks? For that matter, why is PDF still the dominant file format? What advantage does a fixed-width format hold over flexible, fluid-width HTML?2 Why are raw data not published alongside research papers? Why are software tools not published alongside research papers? How come I’m still submitting black-and-white charts to publications which are primarily read online? Why are digital-only publications still bound to regular publication schedules when they could publish like blogs, as soon as the material is ready? To be fair, some journals have answered some of these questions, but the issues are still all too frequent.

So, as a bit of an experiment, I recently published a short research study entirely on GitHub.3 I included the scripts used to generate data, the data, and an article-like summary of the whole process.

What makes it possible

Unfortunately, I wouldn’t recommend my little experiment for most scholars, except perhaps for pre- or post-prints of work published elsewhere. Why? The primary reason people publish research is for tenure review, for enhancing a CV. I won’t list my study—though, arguably, I should be able to—simply because it didn’t go through the usual scholarly publishing gauntlet. It wasn’t peer-reviewed, it didn’t appear in a journal, and it wouldn’t count for much in the eyes of traditional faculty members.

However, I’m at a community college. Research and publication are not among my position’s requirements. I’m judged on my teaching and various library responsibilities, while publications are an unnecessary bonus. Would it help to have another journal article on my CV? Yes, probably. But there’s little pressure and personally I’m more interested in experimentation than in lengthening my list of publications.

Other researchers might also worry about someone stealing their ideas or data if they begin publishing an incomplete project. For me, again, publication isn’t really a competitive field. I would be happy to see someone reuse my project, even if they didn’t give proper attribution back. Openness is an advantage, not a vulnerability.

It’s ironic that being at a non-research institution frees me up to do research. It’s done mostly in my free-time, which isn’t great, but the lack of pressure means I can play with modes of publication, or not worry about the popularity of journals I submit to. To some degree, this is indicative of structural problems with scholarly publishing: there’s inertia in that, in order to stay in the game and make a name for yourself, you can’t do anything too wild. You need to publish, and publish in the recognized titles. Only tenured faculty, who after all owe at least some of their success to the current system, can risk dabbling with new publishing models and systems of peer-review.

What’s really good

GitHub, and the web more generally, are great mediums for scholarship. They address several of my prior questions.

For one, the web is just as suited to publishing data as text. There’s no limit on file format or (practically) size. Even if I was analyzing millions of data points, I could make a compressed archive available for others to download, verify, and reuse in their own research. For my project, I used a Google Spreadsheet which allows others to download the data or simply view it on the web. The article itself can be published on GitHub Pages, which provides free hosting for static websites.

article on GitHub pages

Here’s how the final study looks when published on GitHub Pages.

While my study didn’t undergo any peer review, it is open for feedback via a pull request or the “issues” queue on GitHub. Typically, peer review is a closed process. It’s not apparent what criticisms were leveled at an article, or what the authors did to address them. Having peer review out in the open not only illuminates the history of a particular article but also makes it easier to see the value being added. Luckily, there are more and more journals with open peer review, such as PeerJ which we’ve written about previously. When I explain peer review to students, I often open up the “Peer Review history” section of a PeerJ article. Students can see that even articles written by professional researchers have flaws which the reviewing process is designed to identify and mitigate.

Another benefit of open peer review, present in publishing on GitHub too, is the ability to link to specific versions of an article. This has at least two uses. First of all, it has historical value in that one can trace the thought process of the researcher. Much like original manuscripts are a source of insight for literary analyses, merely being able to trace the evolution of a journal article enables new research projects in and of itself.

Secondly, as web content can be a moving target as it is revised over time, being able to link to specific versions aids those referencing a work. Linking to a git “commit” (think a particular point in time), possibly using perma.cc or the Internet Archive to store a copy of the project as it existed then, is an elegant way of solving this problem. For instance, at one point I manually removed some data points which were inappropriate for the study I was performing. One can inspect the very commit where I did this, seeing which lines of text were deleted and possibly identifying any mistakes which were made.

I’ve also grown tired of typical academic writing. The tendency to value erudite over straightforward language, lengthy titles with the snarky half separated from the actually descriptive half by a colon, the anxiety about the particularities of citations and style manuals; all of these I could do without. Let’s write compelling, truthful content without fetishizing consistency and losing the uniqueness of our voice. I’m not saying my little study achieves much in this regard, but it was a relief to be free to write in whatever manner I found most suitable.

Finally, and most encouraging in my mind, the time to publication of a research project can be greatly reduced with new web-based means. I wrote a paper in graduate school which took almost two years to appear in a peer-reviewed journal; by the time I was given the pre-prints to review, I’d entirely forgotten about it. On GitHub, all delays were solely my fault. While it’s true (you can see so in the project’s history) that the seeds of this project were planted nearly a year ago, I started working in earnest just a few months ago and finished the writing in early October.

What’s really bad

GitHub, while a great company which has reduced the effort needed to use version control with its clean web interface and graphical applications, is not the most universally understood platform. I have little doubt that if I were to publish a study on my blog, I would receive more commentary. For one, GitHub requires an account which only coders or technologists would be likely to have already, while many comment platforms (like Disqus) build off of common social media accounts like Twitter and Facebook. Secondly, while GitHub’s “pull requests” are more powerful than comments in that they can propose changes to the actual content of a project, they’re doubtless less understood as well. Expecting scholarly publishing to suddenly embrace software development methodologies is naive at best.

As a corollary to GitHub’s rather niche appeal, my article hasn’t undergone any semblance of peer review. I put it out there; if someone spots an inaccuracy, I’ll make note of and address it, but no relevant parties will necessarily critique the work. While peer review has its problems—many intimate with the problems of scholarly publishing at large—I still believe in the value of the process. It’s hard to argue a publication has reached an objective conclusion when only a single pair of eyes have scrutinized it.

Researchers who are afraid of having their work stolen, or of publishing incomplete work which may contain errors, will struggle to accept open publishing models using tools like GitHub. Prof Hacker, in an excellent post on “Forking the Academy”, notes many cultural challenges to moving scholarly publishing towards an open source software model. Scholars may worry that forking a repository feels like plagiarism or goes against the tradition of valuing original work. To some extent, these fears may come more from misunderstandings than genuine problems. Using version control, it’s perfectly feasible to withhold publishing a project until it’s complete and to remove erroneous missteps taken in the middle of a work. Theft is just as possible under the current scholarly publishing model; increasing the transparency and speed of one’s publishing does not give license to others to take credit for it. Unless, of course, one uses a permissive license like the Public Domain.

Convincing academics that the fears above are unwarranted or can be overcome is a challenge that cannot be overstated. In all likelihood, GitHub as a platform will never be a major player in scholarly publishing. The learning curve, both technical and cultural, is simply too great. Rather, a good starting point would be to let the appealing aspects of GitHub—versioning, pull requests, issues, granular attribution of authorship at the commit level—inform the development of new, user-friendly platforms with final products that more closely resemble traditional journals. Prof Hacker, again, goes a long way towards developing this with a wish list for a powerful collaborative writing platform.

What about the IR?

The discoverability of web publications is problematic. While I’d like to think my research holds value for others’ literature reviews, it’s never going to show up while searching in a subscription database. It seems unreasonable to ask researchers, who already look in many places to compile complete bibliographies, to add GitHub to their list of commonly consulted sources. Further fracturing the scholarly publishing environment not only inconveniences researchers but it goes against the trend of discovery layers and aggregators (e.g. Google Scholar) which aim to provide a single search across multiple databases.

On the other hand, an increasing amount of research‐from faculty and students alike—is conducted through Google, where GitHub projects will appear alongside pre-prints in institutional repositories. Simply being able to tweet out a link to my study, which is readable on a smartphone and easily saved to any read-it-later service, likely increases its readership over stodgy PDFs sitting in subscription databases.

Institutional repositories solve some, but not all, of the deficiencies of publishing on GitHub. Discoverability is increased because researchers at your institution may search the IR just like they do subscription databases. Futhermore, thanks to the Open Archives Initiative and the OAI-PMH standard, content can be aggregated from multiple IRs into larger search engines like OCLC’s OAIster. However, none of the major IR software players support versioned publication. Showing work-in-progress, linking to specific points in time of a work, and allowing for easy reuse are all lost in the IR.

Every publication in its place

As I’ve stated, publishing independently on GitHub isn’t for everyone. It’s not going to show up on your CV and it’s not necessarily going to benefit from the peer review process. But plenty of librarians are already doing something similar, albeit a bit less formal: we’re writing blog posts with original research or performing quick studies at our respective institutions. It’s not a great leap to put these investigations under version control and then publish them on the web. GitHub could be a valuable compliment to more traditional venues, reducing the delay between when data is collected and when it’s available for public consumption. Furthermore, it’s not at all mutually exclusive with article submissions. One could gain both the immediate benefit of getting one’s conclusions out there, but also produce a draft of a journal article.

As scholarly publishing continues to evolve, I hope we’ll see a plethora of publishing models rather than one monolithic process replacing traditional print-based journals. Publications hosted on GitHub, or a similar platform, would sit nicely alongside open, web-based publications like PeerJ, scholarly blog/journal hybrids like In The Library with the Lead Pipe, deposits in Institutional Repositories, and numerous other sources of quality content.

Notes

  1. I think a lot of these statements are fairly well-recognized in the library community, but here’s some evidence: the recent Open Access “sting” operation (which we’ll cover more in-depth in a forthcoming post) that exposed flaws in some journals’ peer review process, altmetrics, PeerJ, other experiments with open peer review (e.g. by Shakespeare Quarterly), the serials crisis (which is well-known enough to have a Wikipedia entry), predictions that all scholarship will be OA in a decade or two, and increasing demands that scholarly journals allow text mining access all come to mind.
  2. I’m totally prejudiced in this matter because I read primarily through InstaPaper. A journal like Code4Lib, which publishes in HTML, is easy to send to read-it-later services, while PDFs aren’t. PDFs also are hard to read on smartphones, but they can preserve details like layout, tables, images, and font choices better than HTML. A nice solution is services which offer a variety of formats for the same content, such as Open Journal Systems with its ability to provide HTML, PDF, and ePub versions of articles.
  3. For non-code uses of GitHub, see our prior Tech Connect post.

Demystifying Programming

We talk quite a bit about code here at Tech Connect and it’s not unusual to see snippets of it pasted into a post. But most of us, indeed most librarians, aren’t professional programmers or full-time developers; we had to learn like everyone else. Depending on your background, some parts of coding will be easy to pick up while others won’t make sense for years. Here’s an attempt to explain the fundamental building blocks of programming languages.

The Languages

There are a number of popular programming languages: C, C#, C++, Java, JavaScript, Objective C, Perl, PHP, Python, and Ruby. There are numerous others, but this semi-arbitrary selection cover the ones most commonly in use. It’s important to know that each programming language requires its own software to run. You can write Python code into a text file on a machine that doesn’t have the Python interpreter installed, but you can’t execute it and see the results.

A lot of learners stress over which language to learn first unnecessarily. Once you’ve picked up one language, you’ll understand all of the foundational pieces listed below. Then you’ll be able to transition quickly to another language by understanding a few syntax changes: Oh, in JavaScript I write function myFunction(x) to define a function, while in Python I write def myFunction(x). Programming languages differ in other ways too, but knowing the basics of one provides a huge head start on learning the basics of any other.

Finally, it’s worth briefly distinguishing compiled versus interpreted languages. Code written in a compiled language, such as all the capital C languages and Java, must first be passed to a compiler program which then spits out an executable—think a file ending if .exe if you’re on Windows—that will run the code. Interpreted languages, like Perl, PHP, Python, and Ruby, are quicker to program in because you just pass your code along to an interpreter program which immediately executes it. There’s one fewer step: for a compiled language you need to write code, generate an executable, and then run the executable while interpreted languages sort of skip that middle step.

Compiled languages tend to run faster (i.e. perform more actions or computations in a given amount of time) than interpreted ones, while interpreted ones tend to be easier to learn and more lenient towards the programmer. Again, it doesn’t matter too much which you start out with.

Variables

Variables are just like variables in algebra; they’re names which stand in for some value. In algebra, you might write:

x = 10 + 3

which is also valid code in many programming languages. Later on, if you used the value of x, it would be 13.

The biggest difference between variables in math and in programming is that programming variables can be all sort of things, not just numbers. They can be strings of text, for instance. Below, we combine two pieces of text which were stored in variables:

name = 'cat'
mood = ' is laughing'
both = name + mood

In the above code, both would have a value of ‘cat is laughing’. Note that text strings have to be wrapped in quotes—often either double or single quotes is acceptable—in order to distinguish them from the rest of the code. We also see above that variables can be the product of other variables.

Comments

Comments are pieces of text inside a program which are not interpreted as code. Why would you want to do that? Well, comments are very useful for documenting what’s going on in your code. Even if your code is never going to be seen by anyone else, writing comments helps understand what’s going on if you return to a project after not thinking about it for a while.

// This is a comment in JavaScript; code is below.
number = 5;
// And a second comment!

As seen above, comments typically work by having some special character(s) at the beginning of the line which tells the programming language that the rest of the line can be ignored. Common characters that indicate a line is a comment are # (Python, Ruby), // (C languages, Java, JavaScript, PHP), and /* (CSS, multi-line blocks of comments in many other languages).

Functions

As with variables, functions are akin to those in math: they take an input, perform some calculations with it, and return an output. In math, we might see:

f(x) = (x * 3)/4

f(8) = 6

Here, the first line is a function definition. It defines how many parameters can be passed to the function and what it will do with them. The second line is more akin to a function execution. It shows that the function returns the value 6 when passed the parameter 8. This is really, really close to programming already. Here’s the math above written in Python:

def f(x):
  return (x * 3)/4

f(8)
# which returns the number 6

Programming functions differ from mathematical ones in much the same way variables do: they’re not limited to accepting and producing numbers. They can take all sorts of data—including text—process it, and then return another sort of data. For instance, virtually all programming languages allow you to find the length of a text string using a function. This function takes text input and outputs a number. The combinations are endless! Here’s how that looks in Python:

len('how long?')
# returns the number 9

Python abbreviates the word “length” to simply “len” here, and we pass the text “how long?” to the function instead of a number.

Combining variables and functions, we might store the result of running a function in a variable, e.g. y = f(8) would store the value 6 in the variable y if f(x) is the same as above. This may seem silly—why don’t you just write y = 6 if that’s what you want!—but functions help by abstracting out blocks of code so you can reuse them over and over again.

Consider a program you’re writing to manage the e-resource URLs in your catalog, which are stored in MARC field 856 subfield U. You might have a variable named num_URLs (variable names can’t have spaces, thus the underscore) which represents the number of 856 $u subfields a record has. But as you work on records, that value is going to change; rather than manually calculate it each time and set num_URLs = 3 or num_URLs = 2 you can write a function to do this for you. Each time you pass the function a bibliographic record, it will return the number of 856 $u fields, substantially reducing how much repetitive code you have to write.

Conditionals

Many readers are probably familiar with IFTTT, the “IF This Then That” web service which can glue together various accounts, for instance “If I post a new photo to Instagram, then save it to my Dropbox backup folder.” These sorts of logical connections are essential to programming, because often whether or not you perform a particular action varies depending on some other condition.

Consider a program which counts the number of books by Virginia Woolf in your catalog. You want to count a book only if the author is Virginia Woolf. You can use Ruby code like this:

if author == 'Virginia Woolf'
  total = total + 1
end

There are three parts here: first we specify a condition, then there’s some code which runs only if the condition is true, and then we end the condition. Without some kind of indication that the block of code inside the condition has ended, the entire rest of our program would only run depending on if the variable author was set to the right string of text.

The == is definitely weird to see for the first time. Why two equals? Many programming languages use a variety of double-character comparisons because the single equals already has a meaning: single equals assigns a value to a variable (see the second line of the example above) while double-equals compares two values. There are other common comparisons:

  • != often means “is not equal to”
  • > and < are the typical greater or lesser than
  • >= and <= often mean “greater/lesser than or equal to”

Those can look weird at first, and indeed one of the more common mistakes (made by professionals and newbies alike!) is accidentally putting a single equals instead of a double.[1] While we’re on the topic of strange double-character equals signs, it’s worth pointing out that += and -= are also commonly seen in programming languages. These pairs of symbols respectively add or subtract a given number from a variable, so they do assign a value but they alter it slightly. For instance, above I could have written total += 1 which is identical in outcome as total = total + 1.

Lastly, conditional statements can be far more sophisticated than a mere “if this do that.” You can write code that says “if blah do this, but if bleh do that, and if neither do something else.” Here’s a Ruby script that would count books by Virginia Woolf, books by Ralph Ellison, and books by someone other than those two.

total_vw = 0
total_re = 0
total_others = 0
if author == 'Virginia Woolf'
  total_vw += 1
elsif author == 'Ralph Ellison'
  total_re += 1
else
  total_others += 1
end

Here, we set all three of our totals to zero first, then check to see what the current value of author is, adding one to the appropriate total using a three-part conditional statement. The elsif is short for “else if” and that condition is only tested if the first if wasn’t true. If neither of the first two conditions is true, our else section serves as a kind of fallback.

Arrays

An array is simply a list of variables, in fact the Python language has an array-like data type named “list.” They’re commonly denoted with square brackets, e.g. in Python a list looks like

stuff = [ "dog", "cat", "tree"]

Later, if I want to retrieve a single piece of the array, I just access it using its index wrapped in square brackets, starting from the number zero. Extending the Python example above:

stuff[0]
# returns "dog"
stuff[2]
# returns "tree"

Many programming languages also support associative arrays, in which the index values are strings instead of numbers. For instance, here’s an associative array in PHP:

$stuff = array(
  "awesome" => "sauce",
  "moderate" => "spice",
  "mediocre" => "condiment",
);
echo $stuff["mediocre"];
// prints out "condiment"

Arrays are useful for storing large groups of like items: instead of having three variables, which requires more typing and remembering names, we have just have one array containing everything. While our three natural numbers aren’t a lot to keep track of, imagine a program which deals with all the records in a library catalog, or all the search results returned from a query: having an array to store that large list of items suddenly becomes essential.

Loops

Loops repeat an action a set number of times or until a condition is met. Arrays are commonly combined with loops, since loops make it easy to repeat the same operation on each item in an array. Here’s a concise example in Python which prints every entry in the “names” array to the screen:

names = ['Joebob', 'Suebob', 'Bobob']
for name in names:
  print name

Without arrays and loops, we’d have to write:

name1 = 'Joebob'
name2 = 'Suebob'
name3 = 'Bobob'
print name1
print name2
print name3

You see how useful arrays are? As we’ve seen with both functions and arrays, programming languages like to expose tools that help you repeat lots of operations without typing too much text.

There are a few types of loops, including “for” loops and “while” loops loops. Our “for” loop earlier went through a whole array, printing each item out, but a “while” loop only keeps repeating while some condition is true. Here is a bit of PHP that prints out the first four natural numbers:

$counter = 1;
while ( $counter < 5 ) {
  echo $counter;
  $counter = $counter + 1;
}

Each time we go through the loop, the counter is increased by one. When it hits five, the loop stops. But be careful! If we left off the $counter = $counter + 1 line then the loop would never finish because the while condition would never be false. Infinite loops are another potential bug in a program.

Objects & Object-Oriented Programming

Object-oriented programming (oft-abbreviated OOP) is probably the toughest item in this post to explain, which is why I’d rather people see it in action by trying out Codecademy than read about it. Unfortunately, it’s not until the end of the JavaScript track that you really get to work with OOP, but it gives you a good sense of what it looks like in practice.

In general, objects are simply a means of organizing code. You can group related variables and functions under an object. You make an object inherit properties from another one, if it needs to use all the same variables and functions but also add some of its own.

For example, let’s say we have a program that deals with a series of people, each of which have a few properties like their name and age but also the ability to say hi. We can create a people class which is kind of like a template; it helps us stamp out new copies of objects without rewriting the same code over and over. Here’s an example in JavaScript:

function Person(name, age) {
  this.name = name;
  this.age = age;
  this.sayHi = function() {
    console.log("Hi, I'm " + name + ".");
  };
}

Joebob = new Person('Joebob', 39);
Suebob = new Person('Suebob', 40);
Bobob = new Person('Bobob', 3);
Bobob.sayHi();
// prints "Hi, I'm Bobob."
Suebob.sayHi();
// prints "Hi, I'm Suebob."

Our Person function is essentially a class here; it allows us to quickly create three people who are all objects with the same structure, yet they have unique values for their name and age.[2] The code is a bit complicated and JavaScript isn’t a great example, but basically think of this: if we wanted to do this without objects, we’d end up repeating the content of the Person block of code three times over.

The efficiency gained with objects is similar to how functions save us from writing lots of redundant code; identifying common structures and grouping them together under an object makes our code more concise and easier to maintain as we add new features. For instance, if we wanted to add a myAgeIs function that prints out the person’s age, we could just add it to the Person class and then all our people objects would be able to use it.

Modules & Libraries

Lest you worry that every little detail in your programs must be written from scratch, I should mention that all popular programming languages have mechanisms which allow you to reuse others’ code. Practically, this means that most projects start out by identifying a few fundamental building blocks which already exist. For instance, parsing MARC data is a non-trivial task which takes some serious knowledge both of the data structure and the programming language you’re using. Luckily, we don’t need to write a MARC parsing program on our own, because several exist already:

The Code4Lib wiki has an even more extensive list of options.

In general, it’s best to reuse as much prior work as possible rather than spend time working on problems that have already been solved. Complicated tasks like writing a full-fledged web application take a lot of time and expertise, but code libraries already exist for this. Particularly when you’re learning, it can be rewarding to use a major, well-developed project at first to get a sense of what’s possible with programming.

Attention to Detail

The biggest hangup for new programmers often isn’t conceptual: variables, functions, and these other constructs are all rather intuitive, especially once you’ve tried them a few times. Instead, many newcomers find out that programming languages are very literal and unyielding. They can’t read your mind and are happy to simply give up and spit out errors if they can’t understand what you’re trying to do.

For instance, earlier I mentioned that text variables are usually wrapped in quotes. What happens if I forget an end quote? Depending on the language, the program may either just tell you there’s an error or it might badly misinterpret your code, treating everything from your open quote down to the next instance of a quote mark as one big chunk of variable text. Similarly, accidentally misusing double equals or single equals or any of the other arcane combinations of mathematical symbols can have disastrous results.

Once you’ve worked with code a little, you’ll start to pick up tools that ease a lot of minor issues. Most code editors use syntax highlighting to distinguish different constructs  which helps to aid in error recognition. This very post uses a syntax highlighter for WordPress to color keywords like “function” and distinguish variable names. Other tools can “lint” your code for mistakes or code which, while technically valid, can easily lead to trouble. The text editor I commonly use does wonderful little things like provide closing quotes and parens, highlight lines which don’t pass linting tests, and enable me to test-run selected snippets of code.

There’s lots more…

Code isn’t magic; coders aren’t wizards. Yes, there’s a lot to programming and one can devote a lifetime to its study and practice. There are also thousands of resources available for learning, from MOOCs to books to workshops for beginners. With just a few building blocks like the ones described in this post, you can write useful code which helps you in your work.

Footnotes

[1]^ True story: while writing the very next example, I made this mistake.

[2]^ Functions which create objects are called constructor functions, which is another bit of jargon you probably don’t need to know if you’re just getting started.


Advice on Being a Solo Library Technologist

I am an Emerging Technologies Librarian at a small library in the middle of a cornfield. There are three librarians on staff. The vast majority of our books fit on one floor of open stacks. Being so small can pose challenges to a technologist. When I’m banging my head trying to figure out what the heck “this” refers to in a particular JavaScript function, to whom do I turn? That’s but an example of a wide-ranging set of problems:

  • Lack of colleagues with similar skill sets. This has wide-ranging ill effects, from giving me no one to ask questions to or bounce ideas off of, to making it more difficult to sell my ideas.
  • Broad responsibilities that limit time spent on technology
  • Difficulty creating endurable projects that can be easily maintained
  • Difficulty determining which projects are appropriate to our scale

Though listservs and online sources alleviate some of these concerns, there’s a certain knack to be a library technologist at a small institution.[1] While I still have a lot to learn, I want to share some strategies that have helped me thus far.

Know Thy Allies

At my current position, it took me a long time to figure out how the college was structured. Who is responsible for managing the library’s public computers? Who develops the website? If I want some assessment data, where do I go? Knowing the responsibilities of your coworkers is vital and effective collaboration is a necessary element of being a technologist. I’ve been very fortunate to work with coworkers who are immensely helpful.

IT Support can help with both your personal workstation and the library’s setup. Remember that IT’s priorities are necessarily adverse to yours: they want to keep everything up and running, you want to experiment and kick the tires. When IT denies a request or takes ages to fix something that seems trivial to you, remember that they’re just as overburdened as you are. Their assistance in installing and troubleshooting software is invaluable. This is a two-way street: you often have valuable insight into how users behave and what setups are most beneficial. Try to give and take, asking for favors at the same time that you volunteer your services.

Institutional Research probably goes by a dozen different names at any given dozen institutions. These names may include “Assessment Office,” “Institutional Computing,” or even the fearsome “Institutional Review Board” of research universities. These are your data collection and management people and—whether you know it or not—they have some great stuff for you. It took me far too long to browse the IR folder on our shared drive which contains insightful survey data from the CCSSE and in-house reports. There’s a post-graduate survey which essentially says “the library here is awesome,” good to have when arguing for funding. But they also help the library work with the assessment data that our college gathers; we hope to identify struggling courses and offer our assistance.

The web designer should be an obvious contact point. Most technology is administered through the web these days—shocking, I know. The webmaster will not only be able to get you access to institutional servers but they may have learned valuable lessons from their own positions. They, too, struggle to complete a wide range of tasks. They have to negotiate many stakeholders who all want a slice of the vaunted homepage, often the subject of territorial battles. They may have a folder of good PR images or a style guide sitting around somewhere; at the very least, some O’Reilly books you want to borrow.

The Learning Management System administrator is similar to the webmaster. They probably have some coding skills and carry an immense, important burden. At my college, we have a slew of educational technologists who work in the “Faculty Development Center” and preside over the LMS. They’re not only technologically savvy, often introducing me to new tools or techniques, but they know how faculty structure their courses and have a handle on pedagogical theory. Their input can not only generate new ideas but help you ground your initiatives in a solid theoretical basis.

Finally, my list of allies is obviously biased towards academic libraries. But public librarians have similar resources available, they just go by different names. Your local government has many of these same positions: data management, web developer, technology guru. Find out who they are and reach out to them. Anyone can look for local hacker/makerspaces or meetups, which can be a great way not only to develop your skills but to meet people who may have brilliant ideas and insight.

Build Sustainably

Building projects that will last is my greatest struggle. It’s not so hard to produce an intricate, beautiful project if I pour months of work into it, but what happens the month after it’s “complete”? A shortage of ideas has never been my problem, it’s finding ones that are doable. Too often, I’ll get halfway into a project and realize there’s simply no way I can handle the upkeep on top of my usual responsibilities, which stubbornly do not diminish. I have to staff a reference desk, teach information literacy, and make purchases for our collection. Those are important responsibilities and they often provide a platform for experimentation, but they’re also stable obligations that cannot be shirked.

One of the best ways to determine if a project is feasible is to look around at what other libraries are doing. Is there an established project—for instance, a piece of open source software with a broad community base—which you can reuse? Or are other libraries devoting teams of librarians to similar tasks? If you’re seeing larger institutions struggle to perfect something, then maybe it’s best to wait until the technology is more mature. On the other hand, dipping your toe in the water can quickly give you a sense of how much time you’ll need to invest. Creating a prototype or bringing coworkers on board at early stages lets you see how much traction you have. If others are resistant or if your initial design is shown to have gaping flaws, perhaps another project is more worthy of your time. It’s an art but often saying no, dropping a difficult initiative, or recognizing that an experiment has failed is the right thing to do.

Documentation, Documentation, Documentation

One of the first items I accomplished on arrival at my current position was setting up a staff-side wiki on PBworks. While I’m still working on getting other staff members to contribute to it (approximately 90% of the edits are mine), it’s been an invaluable information-sharing resource. Part-time staff members in particular have noted how it’s nice to have one consistent place to look for updates and insider information.

How does this relate to technology? In the last couple years, my institution has added or redesigned dozens of major services. I was going to write a ludicrously long list but…just trust me, we’ve changed a lot of stuff. A new technology or service cannot succeed without buy-in, and you don’t get buy-in if no one knows how to use it. You need documentation: well-written, illustrative documentation. I try to keep things short and sweet, providing screencasts and annotated images to highlight important nuances. Beyond helping others, it’s been invaluable to me as well. Remember when I said I wasn’t so great at building sustainably? Well, I’ll admit that there are some workflows or code snippets that are Greek each time I revisit them. Without my own instructions or blocks of comments, I would have to reverse engineer the whole process before I could complete it again.

Furthermore, not all my fellow staff are on par with my technical skills. I’m comfortable logging into servers, running Drush commands, analyzing the statistics I collect. And that’s not an indictment of my coworkers; they shouldn’t need to do any of this stuff. But some of my projects are reliant on arcane data schemas or esoteric commands. If I were to win the lottery and promptly retire, sophisticated projects lacking documentation would grind to a halt. Instead, I try to write instructions such that anyone could login to Drupal and apply module updates, for instance, even if they were previously unfamiliar with the CMS. I feel a lot better knowing that my bus factor is a little lower and that I can perhaps even take a vacation without checking email, some day.

Choose Wisely

The honest truth is that smaller institutions cannot afford to invest in every new and shiny object that crosses their path. I see numerous awesome innovations at other libraries which simply are not wise investments for a college of our size. We don’t have the scale, skills, and budget for much of the technology out there. Even open source solutions are a challenge because they require skill to configure and maintain. Everything I wrote about sustainability and allies is trying to mitigate this lack of scale, but the truth is some things are just not right for us. It isn’t helpful to build projects that only you can continue, or develop ones which require so much attention that other fundamental responsibilities (doubtless less sexy—no less important) fall through the cracks.

I record my personal activities in Remember the Milk, tagging tasks according to topic. What do you think was the tag I used most last year? Makerspace? Linked data? APIs? Node.js? Nope, it was infolit. That is hardly an “emerging” field but it’s a vital aspect of my position nonetheless.

I find that the best way to select amongst initiatives is to work backwards: what is crucial to your library? What are the major challenges, obvious issues that you’re facing? While I would not abandon pet projects entirely, because sometimes they can have surprisingly wide-ranging effects, it helps to ground your priorities properly.[2] Working on a major issue virtually guarantees that your work will attract more support from your institution. You may find more allies willing to help, or at least coworkers who are sympathetic when you plead with them to cover a reference shift or swap an instruction session because you’re overwhelmed. The big issues themselves are easy to find: user experience, ebooks, discovery, digital preservation, {{insert library school course title here}}. At my college, developmental education and information literacy are huge. It’s not hard to align my priorities with the institution’s.

Enjoy Yourself

No doubt working on your own or with relatively little support is challenging and stressful. It can be disappointing to pass up new technologies because they’re too tough to implement, or when a project fails due to one of the bullet points listed above. But being a technologist should always be fun and bring feelings of accomplishment. Try to inject a little levity and experimentation into the places where it’s least expected; who knows, maybe you’ll strike a chord.

There are also at least a couple advantages to being at a smaller institution. For one, you often have greater freedom and less bureaucracy. What a single individual does on your campus may be done by a committee (or even—the horror—multiple committees) elsewhere. As such, building consensus or acquiring approval can be a much simplified process. A few informal conversations can substitute for mountains of policies, forms, meetings, and regulations.

Secondly, workers at smaller places are more likely to be jack-of-all trades librarians. While I’m a technologist, I wear plenty of more traditional librarian hats as well. On the one hand, that certainly means I have less time to devote to each responsibility than a specialist would; on the other, it gives me a uniquely holistic view of the library’s operations. I not only understand how the pieces fit together, but am better able to identify high-level problems affecting multiple areas of service.

I’m still working through a lot of these issues, on my own. How do you survive as a library technologist? Is it just as tough being a large institution? I’m all eyes.

Footnotes

[1]^ Here are a few of my favorite sources for being a technology librarian:

  • Listservs, particularly Code4Lib and Drupal4Lib. Drupal4Lib is a great place to be if you’re using Drupal and are running into issues, there are a lot of “why won’t this work” and “how do you do X at your library” threads and several helpful experts who hang around the list.
  • For professional journals, once again Code4Lib is very helpful. ITAL is also open access and periodically good tech tips appear in C&RL News or C&RL. Part of being at a small institution is being limited to open access journals; these are the ones I read most often.
  • Google. Google is great. For answering factual questions or figuring out what the most common tool is for a particular task, a quick search can almost always turn up the answer. I’d be remiss if I didn’t mention that Google usually leads me to one of a couple excellent sources, like Stack Overflow or the Mozilla Developer Network.
  • Twitter. Twitter is great, too. I follow many innovative librarians but also leading figures in other fields.
  • GitHub. GitHub can help you find reusable code, but there’s also a librarian community and you can watch as they “star” projects and produce new repositories. I find GitHub useful as a set of instructive code; if I’m wondering how to accomplish a task, I can visit a repo that does something similar and learn from how better developers do it.

[2]^ We’ve covered managing side projects and work priorities previously in “From Cool to Useful: Incorporating hobby projects into library work.”


Coding & Collaboration on GitHub

Previously on Tech Connect we wrote about the Git version control system, walking you through “cloning” a project onto to your computer, making some small changes, and committing them to the project’s history. But that post concluded on a sad note: all we could do was work by ourselves, fiddling with Git on our own computer and gaining nothing from the software’s ability to manage multiple contributors. Well, here we will return to Git to specifically cover GitHub, one of the most popular code-sharing websites around.

Git vs. GitHub

Git is open source version control software. You don’t need to rely on any third-party service to use it and you can benefit from many of its features even if you’re working on your own.

GitHub, on the other hand, is a company that hosts Git repositories on their website. If you allow your code to be publicly viewable, then you can host your repository for free. If you want to have a private repository, then you have to pay for a subscription.

GitHub layers some unique features on top of Git. There’s an Issues queue where bug reports and feature requests can be tracked and assigned to contributors. Every project has a Graphs section where interesting information, such as number of lines added and deleted over time, is charted (see the graphs for jQuery, for instance). You can create gists which are mini-repositories, great for sharing or storing snippets of useful code. There’s even a Wiki feature where a project can publish editable documentation and examples. All of these nice features build upon, but ultimately have little to do with, Git.

Collaboration

GitHub is so successful because of how well it facilitates collaboration. Hosted version control repositories are nothing new; SourceForge has been doing this since 1999, almost a decade prior to GitHub’s founding in 2008. But something about GitHub has struck a chord and it’s taken off like wildfire. Depending on how you count, it’s the most popular collection of open source code, over SourceForge and Google Code.[1] The New York Times profiled co-founder Tom Preston-Werner. It’s inspired spin-offs, like Pixelapse which has been called “GitHub for Photoshop” and Docracy which TechCrunch called “GitHub for legal documents.” In fact, just like the phrase “It’s Facebook for {{insert obscure user group}}” became a common descriptor for up-and-coming social networks, “It’s GitHub for {{insert non-code document}}” has become commonplace. There are many inventive projects which use GitHub as more than just a collection of code (more on this later).

Perhaps GitHub’s popularity is due to Git’s own popularity, though similar sites host Git repositories too.[2] Perhaps the GitHub website simply implements better features than its competitors. Whatever the reason, it’s certain that GitHub does a marvelous job of allowing multiple people to manage and work on a project.

Fork It, Bop It, Pull It

Let’s focus two nice features of GitHub—Forking and the Pull Request [3]—to see exactly why GitHub is so great for collaboration.

If you recall our prior post on Git, we cloned a public repository from GitHub and made some minor changes. Then, when reviewing the results of git log, we could see that our changes were present in the project’s history. That’s great, but how would we go about getting our changes back into the original project?

For the actual step-by-step process, see the LibCodeYear GitHub Project’s instructions. There are basically only two changes from our previous process, one at the very beginning and one at the end.

GItHub's Fork Button

First, start by forking the repository you want to work on. To do so, set up a GitHub account, sign in, visit the repository, and click the Fork button in the upper right. After a pretty sweet animation of a book being scanned, a new project (identical to the original in both name and files) will appear on your GitHub account. You can then clone this forked repository onto your local computer by running git clone on the command line and supplying the URL listed on GitHub.

Now you can do your editing. This part is the same as using Git without GitHub. As you change files and commit changes to the repository, the history of your cloned version and the one on your GitHub account diverge. By running git push you “push” your local changes up to GitHub’s remote server. Git will prompt you for your GitHub password, which can get annoying after a while so you may want to set up an SSH key on GitHub so that you don’t need to type it in each time. Once you’ve pushed, if you visit the repository on GitHub and click the “commits” tab right above the file browser, you can see that your local changes have been published to GitHub. However, they’re still not in the original repository, which is underneath someone else’s account. How do you add your changes to the original account?

GitHub's Pull Request Button

In your forked repository on GitHub, something is different: there’s a Pull Request button in the same upper right area where the Fork one is. Click that button to initiate a pull request. After you click it, you can choose which branches on your GitHub repository to push to the original GitHub repository, as well as write a note explaining your changes. When you submit the request, a message is sent to the project’s owners. Part of the beauty of GitHub is in how pull requests are implemented. When you send one, an issue is automatically opened in the receiving project’s Issues queue. Any GitHub account can comment on public pull requests, connecting them to open issues (e.g. “this fixes bug #43″) or calling upon other contributors to review the request. Then, when the request is approved, its changes are merged into the original repository.

diagram of forking & pulling on GitHub

“Pull Request” might seem like a strange term. “Push” is the name of the command that takes commits from your local computer and adds them to some remote server, such as your GitHub account. So shouldn’t it be called a “push request” since you’re essentially pushing from your GitHub account to another one? Think of it this way: you are requesting that your changes be pulled (e.g. the git pull command) into the original project. Honestly, “push request” might be just as descriptive, but for whatever reason GitHub went with “pull request.”

GitHub Applications

While hopefully we’ve convinced you that the command line is a fine way to do things, GitHub also offers Mac and Windows applications. These apps are well-designed and turn the entire process of creating and publishing a Git repository into a point-and-click affair. For instance, here is the fork-edit-pull request workflow from earlier except done entirely through a GitHub app:

  • Visit the original repository’s page, click Fork
  • On your repository’s page, select “Clone in Mac” or “Clone in Windows” depending on which OS you’re using. The repository will be cloned onto your computer
  • Make your changes and then, when you’re ready to commit, open up the GitHub app, selecting the repository from the list of your local ones
  • Type in a commit message and press Commit
    writing a commit message in GitHub for Windows
  • To sync changes with GitHub, click Sync
  • Return to the repository on GitHub, where you can click the Pull Request button and continue from there

GitHub without the command line, amazing! You can even work with local Git repositories, using the app to do commits and view previous changes, without ever pushing to GitHub. This is particularly useful on Windows, where installing Git can have a few more hurdles. Since the GitHub for Windows app comes bundled with Git, a simple installation and login can get you up-and-running. The apps also make the process of pushing a local repository to GitHub incredibly easy, whereas there are a few steps otherwise. The apps’ visual display of “diffs” (differences in a file between versions, with added and deleted lines highlighted) and handy shortcuts to revert to particular commits can appeal even to those of us that love the command line.

viewing a diff in GitHub for Windows

More than Code

In my previous post on Git, I noted that version control has applications far beyond coding. GitHub hosts a number of inventive projects that demonstrate this.

  • The Code4Lib community hosts an Antiharassment Policy on GitHub. Those in support can simply fork the repository and add their name to a text file, while the policy’s entire revision history is present online as well
  • The city of Philadelphia experimented with using GitHub for procurements with successful results
  • ProfHacker just wrapped up a series on GitHub, ending by discussing what it would mean to “fork the academy” and combine scholarly publishing with forking and pull requests
  • The Jekyll static-site generator makes it possible to generate a blog on GitHub
  • The Homebrew package manager for Mac makes extensive use of Git to manage the various formulae for its software packages. For instance, if you want to roll back to a previous version of an installed package, you run brew versions $PACKAGE where $PACKAGE is the name of the package. That command prints a list of Git commits associated with older versions of the package, so you can enter the Homebrew repository and run a Git command like git checkout 0476235 /usr/local/Library/Formula/gettext.rb to get the installation formula for version 0.17 of the gettext package.

These wonderful examples aside, GitHub is not a magic panacea for coding, collaboration, or any of the problems facing libraries. GitHub can be an impediment to those who are intimidated or simply not sold on the value of learning what’s traditionally been a software development tool. On the Code4Lib listserv, it was noted that the small number of signatories on the Antiharassment Policy might actually be due to its being hosted on GitHub. I struggle to sell people on my campus of the value of Google Docs with its collaborative editing features. So, as much as I’d like the Strategic Plan the college is producing to be on GitHub where everyone could submit pull requests and comment on commits, it’s not necessarily the best platform. It is important, however, not to think of it as limited purely to versioning code written by professional developers. GitHub has uses for amateurs and non-coders alike.

Footnotes

[1]^ GitHub Has Passed SourceForge, (June 2, 2011), ReadWrite.

[2]^ Previously-mentioned SourceForge also supports Git, as does Bitbucket.

[3]^ I think this would make an excellent band name, by the way.


A Librarian’s Guide to OpenRefine

Academic librarians working in technical roles may rarely see stacks of books, but they doubtless see messy digital data on a daily basis. OpenRefine is an extremely useful tool for dealing with this data without sophisticated scripting skills and with a very low learning curve. Once you learn a few tricks with it, you may never need to force a student worker to copy and paste items onto Excel spreadsheets.

As this comparison by the creator of OpenRefine shows, the best use for the tool is to explore and transform data, and it allows you to make edits to many cells and rows at once while still seeing your data. This allows you to experiment and undo mistakes easily, which is a great advantage over databases or scripting where you can’t always see what’s happening or undo the typo you made. It’s also a lot faster than editing cell by cell like you would do with a spreadsheet.

Here’s an example of a project that I did in a spreadsheet and took hours, but then I redid in Google Refine and took a lot less time. One of the quickest things to do with OpenRefine is spot words or phrases that are almost the same, and possibly are the same thing. Recently I needed to turn a large export of data from the catalog into data that I could load into my institutional repository. There were only certain allowed values that could be used in the controlled vocabulary in the repository, so I had to modify the bibliographic data from the catalog (which was of course in more or less proper AACR2 style) to match the vocabularies available in the repository. The problem was that the data I had wasn’t consistent–there were multiple types of abbreviations, extra spaces, extra punctuation, and outright misspellings. An example is the History Department. I can look at “Department of History”, “Dep. of History”, “Dep of Hist.” and tell these are probably all referring to the same thing, but it’s difficult to predict those potential spellings. While I could deal with much of this with regular expressions in a text editor and find and replace in Excel, I kept running into additional problems that I couldn’t spot until I got an error. It took several attempts of loading the data until I cleared out all the errors.

In OpenRefine this is a much simpler task, since you can use it to find everything that probably is the same thing despite the slight differences in spelling, punctuation and spelling. So rather than trying to write a regular expression that accounts for all the differences between “Department of History”, “Dep. of History”, “Dep of Hist.”, you can find all the clusters of text that include those elements and change them all in one shot to “History”. I will have more detailed instructions on how to do this below.

Installation and Basics

OpenRefine was called, until last October, Google Refine, and while the content from the Google Refine page is being moved to the Open Refine page you should plan to look at both sites. Documentation and video tutorials refer interchangeably to Google Refine and OpenRefine. The official and current documentation is on the OpenRefine GitHub wiki. For specific questions you will probably want to use the OpenRefine Custom Search Engine, which brings together all the mix of documentation and tutorials on the web. OpenRefine is a web app that runs on your computer, so you don’t need an internet connection to run it. You can get the installation instructions on this page.

While you can jump in right away and get started playing around, it is well worth your time to watch the tutorial videos, which will cover the basic actions you need to take to start working with data. As I said, the learning curve is low, but not all of the commands will make sense until you see them in action. These videos will also give you an idea of what you might be able to do with a data set you have lying around. You may also want to browse the “recipes” on the OpenRefine site, as well search online for additional interesting things people have done. You will probably think of more ideas about what to try. The most important thing to know about OpenRefine is that you can undo anything, and go back to the beginning of the project before you messed up.

A basic understanding of the Google Refine Expression Language, or GREL will improve your ability to work with data. There isn’t a whole lot of detailed documentation, so you should feel free to experiment and see what happens when you try different functions. You will see from the tutorial videos the basics you need to know. Another essential tool is regular expressions. So much of the data you will be starting with is structured data (even if it’s not perfectly structured) that you will need to turn into something else. Regular expressions help you find patterns which you can use to break apart strings into something else. Spending a few minutes understanding regular expression syntax will save hours of inefficient find and replace. There are many tutorials–my go-to source is this one. The good news for librarians is that if you can construct a Dewey Decimal call number, you can construct a regular expression!

Some ideas for librarians

 

(A) Typos

Above I described how you would use OpenRefine to clean up messy and inconsistent catalog data. Here’s how to do it. Load in the data, and select “Text Facet” on the column in question. OpenRefine will show clusters of text that is similar and probably the same thing.

AcademicDept Text Facet

AcademicDept Text Facet

 

Click on Cluster to get a menu for working with multiple values. You can click on the “Merge” check box and then edit the text to whatever you need it to be. You can also edit each text cluster to be the correct text.

Cluster and Edit

Cluster and Edit

You can merge and re-cluster until you have fixed all the typos. Back on the first Text Facet, you can hover over any value to edit it. That way even if the automatic clustering misses some you can edit the errors, or change anything that is the same but you need to look different–for instance, change “Dept. of English” to just “English”.

(B) Bibliographies

The main thing that I have used OpenRefine for in my daily work is to change a bibliography in plain text into columns in a spreadsheet that I can run against an API. This was inspired by this article in the Code4Lib Journal: “Using XSLT and Google Scripts to Streamline Populating an Institutional Repository” by Stephen X. Flynn, Catalina Oyler, and Marsha Miles. I wanted to find a way to turn a text CV into something that would work with the SHERPA/RoMEO API, so that I could find out which past faculty publications could be posted in the institutional repository. Since CVs are lists of data presented in a structured format but with some inconsistencies, OpenRefine makes it very easy to present the data in a certain way as well as remove the inconsistencies, and then to extend the data with a web service. This is a very basic set of instructions for how to accomplish this.

The main thing to accomplish is to put the journal title in its own column. Here’s an example citation in APA format, in which I’ve colored all the “separator” punctuation in red:

Heller, M. (2011). A Review of “Strategic Planning for Social Media in Libraries”. Journal of Electronic Resources Librarianship, 24 (4), 339-240)

From the drop-down menu at the top of the column click on “Split into several columns…” from the “Edit Column” menu. You will get a menu like the one below. This example finds the opening parenthesis and removes that in creating a new column. The author’s name is its own column, and the rest of the text is in another column.

Spit into columns

 

The rest of the column works the same way–find the next text, punctuation, or spacing that indicates a separation. You can then rename the column to be something that makes sense. In the end, you will end up with something like this:

Split columns

When you have the journal titles separate, you may want to cluster the text and make sure that the journals have consistent titles or anything else to clean up the titles. Now you are a ready to build on this data with fetching data from a web service. The third video tutorial posted above will explain the basic idea, and this tutorial is also helpful. Use the pull-down menu at the top of the journal column to select “Edit column” and then “Add column by fetching URLs…”. You will get a box that will help you construct the right URL. You need to format your URL in the way required by SHERPA/RoMEO, and will need a free API key. For the purposes of this example, you can use 'http://www.sherpa.ac.uk/romeo/api29.php?ak=[YOUR API KEY HERE]&qtype=starts&jtitle=' + escape(value,'url'). Note that it will give you a preview to see if the URL is formatted in the way you expect. Give your column a name, and set the Throttle delay, which will keep the service from rejecting too many requests in a short time. I found 1000 worked fine.

refine7

After this runs, you will get a new column with the XML returned by SHERPA/RoMEO. You can use this to pull out anything you need, but for this example I want to get pre-archiving and post-archiving policies, as well as the conditions. A quick way to to this is to use the Googe Refine Expression Language parseHtml function. To use this, click on “Add column based on this column” from the “Edit Column” menu, and you will get a menu to fill in an expression.

refine91

In this example I use the code value.parseHtml().select("prearchiving")[0].htmlText(), which selects just the text from within the prearchving element. Conditions are a little different, since there are multiple conditions for each journal. In that case, you would use the following syntax (after join you can put whatever separator you want): forEach(value.parseHtml().select("condition"),v,v.htmlText()).join(". ")"

So in the end, you will end up with a neatly structured spreadsheet from your original CV with all the bibliographic information in its own column and the publisher conditions listed. You can imagine the possibilities for additional APIs to use–for instance, the WorldCat API could help you determine which faculty published books the library owns.

Once you find a set of actions that gets your desired result, you can save them for the future or to share with others. Click on Undo/Redo and then the Extract option. You will get a description of the actions you took, plus those actions represented in JSON.

refine13

Unselect the checkboxes next to any mistakes you made, and then copy and paste the text somewhere you can find it again. I have the full JSON for the example above in a Gist here. Make sure that if you save your JSON publicly you remove your personal API key! When you want to run the same recipe in the future, click on the Undo/Redo tab and then choose Apply. It will run through the steps for you. Note that if you have a mistake in your data you won’t catch it until it’s all finished, so make sure that you check the formatting of the data before running this script.

Learning More and Giving Back

Hopefully this quick tutorial got you excited about OpenRefine and thinking about what you can do. I encourage you to read through the list of External Resources to get additional ideas, some of which are library related. There is lots more to learn and lots of recipes you can create to share with the library community.

Have you used OpenRefine? Share how you’ve used it, and post your recipes.

 


Revisiting PeerJ

A few months ago as part of a discussion on open peer review, I described the early stages of planning for a new type of journal, called PeerJ. Last month on February 12 PeerJ launched with its first 30 articles. By last week, the journal had published 53 articles. There are a number of remarkable attributes of the journal so far, so in this post I want to look at what PeerJ is actually doing, and some lessons that academic libraries can take away–particularly for those who are getting into publishing.

What PeerJ is Doing

On the opening day blog post (since there are no editorials or issues in PeerJ, communication from the editors has to be done via blog post 1), the PeerJ team outlined their mission under four headings: to make their content open and help to make that standard practice, to practice constant innovation, to “serve academia”, and to make this happen at minimal cost to researchers and no cost to the public. The list of advisory board and academic editors is impressive–it is global and diverse, and includes some big names and Nobel laureates. To someone judging the quality of the work likely to be published, this is a good indication. The members of PeerJ range in disciplines, with the majority in Molecular Biology. To submit and/or publish work requires a fee, but there is a free plan that allows one pre-print to be posted on the forthcoming PeerJ PrePrints.

PeerJ’s publication methods are based on PLoS ONE, which publishes articles based on subjective scientific and methodological soundness rather with no emphasis placed on subjective measures of novelty or interest (see more on this). Like all peer-reviewed journals, articles are sent to an academic editor in the field, who then sends the article to peer reviewers. Everything is kept confidential until the article actually is published, but authors are free to talk about their work in other venues like blogs.

Look and Feel
PeerJ on an iPhone size screen

PeerJ on an iPhone size screen

There are several striking dissimilarities between PeerJ and standard academic journals. The home page of the journal emphasizes striking visuals and is responsive to devices, so the large image scales to a small screen for easy reading. The “timeline” display emphasizes new and interesting content. 2 The code they used to make this all happen is available openly on the PeerJ Github account. The design of the page reflects best practices for non-profit web design, as described by the non-profit social media guide Nonprofit Tech 2.0. The page tells a story, makes it easy to get updates, works on all devices, and integrates social media. The design of the page has changed iteratively even in the first month to reflect the realities of what was actually being published and how people were accessing it. 3 PDFs of articles were designed to be readable on screens, especially tablets, so rather than trying to fit as much text as possible on one page as many PDFs are designed, they have single columns with left margins, fewer words per line, and references hyperlinked in the text. 4

How Open Peer Review Works

One of the most notable features of PeerJ is open peer review. This is not mandatory, but approximately half the reviewers and authors have chosen to participate. 5 This article is an example of open peer review in practice. You can read the original article, the (in this case anonymous) reviewer’s comments, the editors comments and the author’s rebuttal letter. Anyone who has submitted an article to a peer reviewed journal before will recognize this structure, but if you have not, this might be an exciting glimpse of something you have never seen before. As a non-scientist, I personally find this more useful as a didactic tool to show the peer review process in action, but I can imagine how helpful it would be to see this process for articles about areas of library science in which I am knowledgeable.

With only 53 articles and in existence for such a short time, it is difficult to measure what impact open peer review has on articles, or to generalize about which authors and reviewers choose an open process. So far, however, PeerJ reports that several authors have been very positive about their experience publishing with the journal. The speed of review is very fast, and reviewers have been constructive and kind in their language. One author goes into more detail in his original post, “One of the reviewers even signed his real name. Now, I’m not totally sure why they were so nice to me. They were obvious experts in the system that I studied …. But they were nice, which was refreshing and encouraging.” He also points out that the exciting thing about PeerJ for him is that all it requires are projects that were technically well-executed and carefully described, so that this encourages publication of negative or unexpected results, thus avoiding the file drawer effect.6

This last point is perhaps the most important to note. We often talk of peer-reviewed articles as being particularly significant and “high-impact.” But in the case of PeerJ, the impact is not necessarily due to the results of the research or the type of research, but that it was well done. One great example of this is the article “Significant Changes in the Skin Microbiome Mediated by the Sport of Roller Derby”. 7 This was a study about the transfer of bacteria during roller derby matches, and the study was able to prove its hypothesis that contact sports are a good environment in which to study movements of bacteria among people. The (very humorous) review history indicates that the reviewers were positive about the article, and felt that it had promise for setting a research paradigm. (Incidentally, one of the reviewers remained anonymous , since he/she felt that this could “[free] junior researchers to openly and honestly critique works by senior researchers in their field,” and signed the letter “Diligent but human postdoc reviewer”.) This article was published the beginning of March, and already has 2,307 unique visits to the page, and has been shared widely on social media. We can assume that one of the motivations for sharing this article was the potential for roller derby jokes or similar, but will this ultimately make the article’s long term impact stronger? This will be something to watch.

What Can Academic Libraries Learn?

A recent article In the Library With the Lead Pipe discussed the open ethos in two library publications, In the Library With the Lead Pipe and Code4Lib Journal. 8 This article concluded that more LIS publications need to open the peer review process, though the publications mentioned are not peer reviewed in the traditional sense. There are very few, if any, open peer reviewed publications in the nature of PeerJ outside of the sciences. Could libraries or library-related publications match this process? Would they want to?

I think we can learn a few things from PeerJ. First, the rapid publication cycle means that more work is getting published more quickly. This is partly because they have so many reviewers and so any one reviewer isn’t overburdened–and due to their membership model, it is in the best financial interests of potential future authors to be current reviewers. As In the Library With the Lead Pipe points out that a central academic library journal, College & Research Libraries, is now open access and early content is available as a pre-print, the pre-prints reflect content that will be published in some cases well over a year from now. A year is a long time to wait, particularly for work that looks at current technology. Information Technology in Libraries (ITAL), the LITA journal is also open access and provides pre-prints as well–but this page appears to be out of date.

Another thing we can learn is making reading easier and more convenient while still maintaining a professional appearance and clean visuals. Blogs like ACRL Tech Connect and In the Library with the Lead Pipe deliver quality content fairly quickly, but look like blogs. Journals like the Journal of Librarianship and Scholarly Communication have a faster turnaround time for review and publication (though still could take several months), but even this online journal is geared for a print world. Viewing the article requires downloading a PDF with text presented in two columns–hardly the ideal online reading experience. In these cases, the publication is somewhat at the mercy of the platform (WordPress in the former, BePress Digital Commons in the latter), but as libraries become publishers, they will have to develop platforms that meet the needs of modern researchers.

A question put to the ACRL Tech Connect contributors about preferred reading methods for articles suggests that there is no one right answer, and so the safest course is to release content in a variety of formats or make it flexible enough for readers to transform to a preferred format. A new journal to watch is Weave: Journal of Library User Experience, which will use the Digital Commons platform but present content in innovative ways. 9 Any libraries starting new journals or working with their campuses to create new journals should be aware of who their readers are and make sure that the solutions they choose work for those readers.

 

 

  1. “The Launch of PeerJ – PeerJ Blog.” Accessed February 19, 2013. http://blog.peerj.com/post/42920112598/launch-of-peerj.
  2. “Some of the Innovations of the PeerJ Publication Platform – PeerJ Blog.” Accessed February 19, 2013. http://blog.peerj.com/post/42920094844/peerj-functionality.
  3. http://blog.peerj.com/post/45264465544/evolution-of-timeline-design-at-peerj
  4. “The Thinking Behind the Design of PeerJ’s PDFs.” Accessed March 18, 2013. http://blog.peerj.com/post/43558508113/the-thinking-behind-the-design-of-peerjs-pdfs.
  5. http://blog.peerj.com/post/43139131280/the-reception-to-peerjs-open-peer-review
  6. “PeerJ Delivers: The Review Process.” Accessed March 18, 2013. http://edaphics.blogspot.co.uk/2013/02/peerj-delivers-review-process.html.
  7. Meadow, James F., Ashley C. Bateman, Keith M. Herkert, Timothy K. O’Connor, and Jessica L. Green. “Significant Changes in the Skin Microbiome Mediated by the Sport of Roller Derby.” PeerJ 1 (March 12, 2013): e53. doi:10.7717/peerj.53.
  8. Ford, Emily, and Carol Bean. “Open Ethos Publishing at Code4Lib Journal and In the Library with the Lead Pipe.” In the Library with the Lead Pipe (December 12, 2012). http://www.inthelibrarywiththeleadpipe.org/2012/open-ethos-publishing/.
  9. Personal communication with Matthew Reidsma, March 19, 2013.

How to Git

We have written about version control before at Tech Connect, most notably John Fink’s excellent overview of modern version control. But getting started with VC (I have to abbreviate it because the phrase comes up entirely too much in this post) is intimidating. If you are generally afraid of anything that reminds you of the DOS Prompt, you’re not alone and you’re also totally capable of learning Git.

DOS prompt madness

By the end of this post, we will still not understand what’s going on here.

But why should you learn git?

Because Version Control Isn’t Just for Nerds

OK, never mind, it is, it totally is. But VC is for all kinds of nerds, not just l33t programmers lurking in windowless offices.

Are you into digital preservation and/or personal archiving? Then VC is your wildest dream. It records your changes in meaningful chunks, documenting not just the final product but all the steps it took you to get there. VC repositories show who did what, too. If you care about nerdy things like provenance, then you care about VC. If co-authors would always use VC for their writing, we’d know all the answers to the truly pressing questions, like whether Gilles Deleuze or Félix Guattari wrote the passage “A concept is a brick. It can be used to build a courthouse of reason. Or it can be thrown through the window.”

Are you a web developer? Then knowing Git can get you on GitHub, and GitHub is an immense warehouse of awesomeness. Sure, you can always just download .zip files of other people’s projects, but GitHub also provides more valuable opportunities: you can showcase your awesome tools, your brilliant tweaks to other people’s projects, and you can give back to the community at whatever level you’re comfortable with, from filing bug reports to submitting actual code fixes.

Are you an instruction librarian? Have you ever shared lesson plans, or edited other people’s lesson plans, or inherited poorly documented lesson plans? Basically, have you been an instruction librarian in the past century? Well, I have good news for you: Git can track any text file, so your lessons can easily be versioned and collaborated upon just like software programs are. Did you forget that fun intro activity you used two years ago? Look through your repository’s previous commits to find it. Want to maintain several similar but slightly different lesson plans for different professors teaching the same class? You’ve just described branching, something that Git happens to be great at. The folks over at ProfHacker have written a series of articles on using Git and GitHub for collaborative writing and syllabus design.

Are you a cataloger? Versioning bibliographic records makes a lot of sense. A presentation at last year’s Code4Lib conference talked not only about versioning metadata but data in general, concluding that the approach had both strengths and weaknesses. It’s been proposed that putting bibliographic records under VC solves some of the issues with multiple libraries creating and reusing them.

As an added bonus, having a record’s history can enable interesting analyses of how metadata changes over time. There are powerful tools that take a Git repository’s history and create animated visualizations; to see this in action, take a look at the visualization of Penn State’s ScholarSphere application. Files are represented as nodes in a network map while small orbs which represent individual developers fly around shooting lasers at them. If we want to be a small orb that shoots lasers at nodes, and we definitely do, we need to learn Git.

Alright, so now we know Git is great, but how do we learn it?

It’s As Easy As git rebase -i 97c9d7d

Actually, it’s a lot easier. The author doesn’t even know what git rebase does, and yet here he is lecturing to you about Git.

First off, we need to install Git like any other piece of software. Head over to the official Git website’s downloads page and grab the version for your operating system. The process is pretty straight-forward but if you get stuck, there’s also a nice “Getting Started – Installing Git” chapter of the excellent Pro Git book which is hosted on the official site.

Alright, now that you’ve got Git installed it’s time to start VCing the heck out of some text files. It’s worth noting that there are software packages that put a graphical interface on top of Git, such as Tower and GitHub’s apps for Windows and Mac. There’s a very comprehensive list of graphical Git software on the official Git website. But the most cross-platform and surefire way to understand Git and be able to access all of its features is with the command line so that’s what we’ll be using.

So enough rambling, let’s pop open a terminal (Mac and Linux both have apps simply called “Terminal” and Windows users can try the Git Bash terminal that comes with the Git installer) and make it happen.

$ git clone https://github.com/LibraryCodeYearIG/Codeyear-IG-Github-Project.git
Cloning into 'Codeyear-IG-Github-Project'...
remote: Counting objects: 115, done.
remote: Compressing objects: 100% (73/73), done.
remote: Total 115 (delta 49), reused 108 (delta 42)
Receiving objects: 100% (115/115), 34.38 KiB, done.
Resolving deltas: 100% (49/49), done.
$ cd Codeyear-IG-Github-Project/

 

The $ above is meant to indicate our command prompt, so anything beginning with a $ is something we’re typing. Here we “cloned” a project from a Git repository existing on the web (line 1), which caused Git to give us a little information in return. All Git commands begin with git and most provide useful info about their usage or results. In line 2, we’ve moved inside the project’s folder with a “change directory” command.

We now have a Git repository on our computer, if you peek inside the folder you’ll see some text (specifically Markdown) files and an image or two. But what’s more: we have the project’s entire history too, pretty much every state that any file has been in since the beginning of time.

OK, since the beginning of the project, but still, is that not awesome? Oh, you’re not convinced? Let’s look at the project’s history.

$ git log
commit b006c1afb9acf78b90452b284a111aed4daee4ca
Author: Eric Phetteplace <phette23@gmail.com>
Date:   Fri Mar 1 15:27:47 2013 -0500

    a couple more links, write Getting Setup section

commit 83d92e4a1be0fdca571012cb39f84d86b21121c6
Author: Eric Phetteplace <phette23@gmail.com>
Date:   Fri Feb 22 01:04:24 2013 -0500

    link up the YouTube video

 

We can hit Q to exit the log. In the log, we see the author, date, and a brief description of each change. The terrifying random gibberish which follows the word “commit” is a hash, which is computer science speak for terrifying random gibberish. Think of it as a unique ID for each change in the project’s history.

OK, so we can see previous changes (“commits” in VC-speak, which is like Newspeak but less user friendly), we can even revert back to previous states, but we won’t do that for now. Instead, let’s add a new change to the project’s history. First, we open up the “List of People.mdown” file in the Getting Started folder and add our name to the list. Now the magic sauce.

$ git status
# On branch master
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#   modified:   Getting Started/List of People.mdown
#
no changes added to commit (use "git add" and/or "git commit -a")
$ git add "Getting Started/List of People.mdown"
$ git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#   modified:   Getting Started/List of People.mdown
#
$ git commit -m "adding my name"
$ git status
# On branch master
nothing to commit, working directory clean
$ git log
commit wTf1984doES8th1s3v3Nm34NWtf2666bAaAaAaAa
Author: Awesome Sauce <awesome@sau.ce>
Date:   Wed Mar 13 12:30:35 2013 -0500

    adding my name

commit b006c1afb9acf78b90452b284a111aed4daee4ca
Author: Eric Phetteplace <phette23@gmail.com>
Date:   Fri Mar 1 15:27:47 2013 -0500

    a couple more links, write Getting Setup section

 

Our change is in the project’s history! Isn’t it better than seeing your name on Hollywood Walk of Fame? Here’s precisely what we did:

First we asked for the status of the repository, which is an easy way of seeing what changes you’re working on and how far along they are to being added to the history. We’ll run status throughout this procedure to watch how it changes. Then we added our changes; this tells Git “hey, these are a deliberate set of changes and we’re ready to put them in the project’s history.” It may seem like an unnecessary step but adding select sets of files can help you segment your changes into meaningful, isolated chunks that make sense when viewing the log later. Finally, we commit our change and add a short description inside quotes. This finalizes the change, which we can see in the log command’s results.

I’m Lonely, So Lonely

Playing around with Git on our local computer can be fun, but it sure gets lonely. Yeah, we can roll back to previous versions or use branches to keep similar but separate versions of our files, but really we’re missing the best part of VC: collaboration. VC as a class of software was specifically designed to help multiple programmers work on the same project. The power and brilliance of Git shines best when we can selectively “merge” changes from multiple people into one master project.

Fortunately, we will cover this in a future post. For now, we can visit the LITA/ALCTS Library Code Year‘s GitHub Project—it’s the very same Git project we cloned earlier, so we already have a copy on our computer!—to learn more about collaboration and GitHub. GitHub is a website where people can share and cooperate on Git repositories. It’s been described as “the Facebook of code” because of its popularity and slick user interface. If that doesn’t convince you that GitHub is worth checking out, the site also has a sweet mascot that’s a cross between an octopus and a cat (an octocat). And that’s really all you need to know.

Gangam Octocat

This is an Octocat. It is Awesome.


Aaron Swartz and Too-Comfortable Research Libraries

*** Update: Several references and a video added (thanks to Brett Bonfield) on Feb. 21, 2013. ***

Who was Aaron Swartz?

If you are a librarian and do not know who Aaron Swartz is, that should probably change now. He helped developing the RSS standard, was the co-founder of Reddit, worked on the Open Library project, downloaded and freed 20% (2.7 million documents) of the Public Access to Court Electronic Records (PACER) database that charges access fees for the United States federal court documents, out of which about 1,600 had privacy issues, played a lead role in preventing the Stop Online Piracy Act (SOPA), and wrote the Guerrilla Open Access Manifesto.

Most famously, he was arrested in 2011 for the mass download of journal articles from JSTOR. He returned the documents to JSTOR and apologized. The Massachusetts state court dismissed the charges, and JSTOR decided not to pursue civil litigation. But MIT stayed silent, and the federal court charged Swartz with wire fraud, computer fraud, unlawfully obtaining information from a protected computer and recklessly damaging a protected computer. If convicted on these charges, Swartz could be sentenced to up to 35 years in prison at the age of 26. He committed suicide after facing charges for two years, on January 11, 2013.

Information wants to be free; Information wants to be expensive

Now, he was a controversial figure. He advocated Open Access (OA) but to the extent of encouraging scholars, librarians, students who have access to copyrighted academic materials to trade passwords and circulate them freely on the grounds that this is an act of civil disobedience against unjust copyright laws in his manifesto. He was an advocate of the open Internet, the transparent government, and open access to scholarly output. But he also physically hacked into the MIT network wiring closet and attached his laptop to download over 4 million articles from JSTOR. Most people including librarians are not going to advocate trading their institutions’ subscription database passwords or breaking into a staff-only computer networking area of an institution. The actual method of OA that Swartz recommended was highly controversial even among the strongest OA advocates.

But in his Guerrilla OA manifesto, Swartz raised one very valid point about the nature of information in the era of the World Wide Web. That is, information is power. (a) As power, information can be spread to and be made useful to as many of us as possible. Or, (b) it can be locked up and the access to it can be restricted to only those who can pay for it or have access privileges some other way. One thing is clear. Those who do not have access to information will be at a significant disadvantage compared to those who do.

And I would like to ask what today’s academic and/or research libraries are doing to realize Scenario (a) rather than Scenario (b). Are academic/research libraries doing enough to make information available to as many as possible?

Too-comfortable Internet, Too-comfortable academic libraries

Among the many articles I read about Aaron Swartz’s sudden death, the one that made me think most was “Aaron Swartz’s suicide shows the risk of a too-comfortable Internet.” The author of this article worries that we may now have a too-comfortable Internet. The Internet is slowly turning into just another platform for those who can afford purchasing information. The Internet as the place where you could freely find, use, modify, create, and share information is disappearing. Instead pay walls and closed doors are being established. Useful information on the Internet is being fast monetized, and the access is no longer free and open. Even the government documents become no longer freely accessible to the public when they are put up on the Internet (likely to be due to digitization and online storage costs) as shown in the case of PACER and Aaron Swartz. We are more and more getting used to giving up our privacy or to paying for information. This may be inevitable in a capitalist society, but should the same apply to libraries as well?

The thought about the too-comfortable Internet made me wonder whether perhaps academic research libraries were also becoming too comfortable with the status quo of licensing electronic journals and databases for patrons. In the times when the library collection was physical, people who walk into the library were rarely turned away. The resources in the library are collected and preserved because we believe that people have the right to learn and investigate things and to form one’s own opinions and that the knowledge of the past should be made available for that purpose. Regardless of one’s age, gender, social and financial status, libraries have been welcoming and encouraging people who were in the quest for knowledge and information. With the increasing number of electronic resources in the library, however, this has been changing.

Many academic libraries offer computers, which are necessary to access electronic resources of the library itself. But how many of academic libraries keep all the computers open for user without the user log-in? Often those library computers are locked up and require the username and password, which only those affiliated with the institution possess. The same often goes for many electronic resources. How many academic libraries allow the on-site access to electronic resources by walk-in users? How many academic libraries insist on the walk-in users’ access to those resources that they pay for in the license?  Many academic libraries also participate in the Federal Depository Library program, which requires those libraries to provide free access to the government documents that they receive to the public. But how easy is it for the public to enter and access the free government information at those libraries?

I asked in Twitter about the guest access in academic libraries to computers and e-resources. Approximately 25 academic librarians generously answered my question. (Thank you!) According to the responses in Twitter,  almost all except a few libraries ( mentioned in Twitter responses) offer guest access to computers and e-resources on-site. It is to be noted, however, that a few offer the guest -access to neither. Also some libraries limit the guests’ computer-use to 30 minutes – 4 hours, thereby restricting the access to the library’s electronic resources as well. Only a few libraries offer free wi-fi for guests. And at some libraries, the guest wi-fi users are unable to access the library’s e-resources even on-site because the IP range of the guest wi-fi is different from that of the campus wi-fi.

I am not sure how many academic libraries consciously negotiate the walk-in users’ on-site access with e-resources vendors or whether this is done somewhat semi-automatically because many libraries ask the library building IP range to be registered with vendors so that the authentication can be turned off inside the building. I surmise that publishers and database vendors will not automatically permit the walk-in users’ on-site access in their licenses unless libraries ask for it. Some vendors also explicitly prohibit libraries from using their materials to fill the Interlibrary loan requests from other libraries. The electronic resource vendors and publishers’ pricing has become more and more closely tied to the number of patrons who can access their products. Academic libraries has been dealing with the escalating costs for electronic resources by filtering out library patrons and limiting the access to those in a specific disciplines. For example, academic medical and health sciences libraries often subscribe to databases and resources that have the most up-to-date information about biomedical research, diseases, medications, and treatments. These are almost always inaccessible to the general public and often even to those affiliated with the institution. The use of these prohibitively expensive resources is limited to a very small portion of people who are affiliated with the institution in specific disciplines such as medicine and health sciences. Academic research libraries have been partially responsible for the proliferation of these access limitations by welcoming and often preferring these limitations as a cost-saving measure. (By contrast, if those resources were in the print format, no librarian would think that it is OK to permanently limit its use to those in medical or health science disciplines only.)

Too-comfortable libraries do not ask themselves if they are serving the public good of providing access to information and knowledge for those who are in need but cannot afford it. Too-comfortable libraries see their role as a mediator and broker in the transaction between the information seller and the information buyer. They may act as an efficient and successful mediator and broker. But I don’t believe that that is why libraries exist. Ultimately, libraries exist to foster the sharing and dissemination of knowledge more than anything, not to efficiently mediate information leasing. And this is the dangerous idea: You cannot put a price tag on knowledge; it belongs to the human race. Libraries used to be the institution that validates and confirms this idea. But will they continue to be so in the future? Will an academic library be able to remain as a sanctuary for all ideas and a place for sharing knowledge for people’s intellectual pursuits regardless of their institutional membership? Or will it be reduced to a branch of an institution that sells knowledge to its tuition-paying customers only? While public libraries are more strongly aligned with this mission of making information and knowledge freely and openly available to the public than academic libraries, they cannot be expected to cover the research needs of patrons as fully as academic libraries.

I am not denying that libraries are also making efforts in continuing the preservation and access to the information and resources through initiatives such as Hathi Trust and DPLA (Digital Public Library of America). My concern is rather whether academic research libraries are becoming perhaps too well-adapted to the times of the Internet and online resources and too comfortable serving the needs of the most tangible patron base only in the most cost-efficient way, assuming that the library’s mission of storing and disseminating knowledge can now be safely and neutrally relegated to the Internet and the market. But it is a fantasy to believe that the Internet will be a sanctuary for all ideas (The Internet is being censored as shown in the case of Tarek Mehanna.), and the market will surely not have the ideal of the free and open access to knowledge for the public.

If libraries do not fight for and advocate those who are in need of information and knowledge but cannot afford it, no other institution will do so. Of course, it costs to create, format, review, and package content. Authors as well as those who work in this business of content formatting, reviewing, packaging, and producing should be compensated for their work. But not to the extent that the content is completely inaccessible to those who cannot afford to purchase but nevertheless want access to it for learning, inquiry, and research. This is probably the reason why we are all moved by Swartz’s Guerrilla Open Access Manifesto in spite of the illegal implications of the action that he actually recommended in the manifesto.

Knowledge and information is not like any other product for purchase. Sharing increases its value, thereby enabling innovation, further research, and new knowledge. Limiting knowledge and information to only those with access privilege and/or sufficient purchasing power creates a fundamental inequality. The mission of a research institution should never be limited to self-serving its members only, in my opinion. And if the institution forgets this, it should be the library that first raises a red flag. The mission of an academic research institution is to promote the freedom of inquiry and research and to provide an environment that supports that mission inside and outside of its walls, and that is why a library is said to be the center of an academic research institution.

I don’t have any good answers to the inevitable question of “So what can an academic research library do?” Perhaps, we can start with broadening the guest access to the library computers, wi-fi, and electronic resources on-site. Academic research libraries should also start asking themselves this question: What will libraries have to offer for those who seek knowledge for learning and inquiry but cannot afford it? If the answer is nothing, we will have lost libraries.

In his talk about the Internet Archive’s Open Library project at the Code4Lib Conference in 2008 (at 11:20), Swartz describes how librarians had argued about which subject headings to use for the books in the Open Library website. And he says, “We will use all of them. It’s online. We don’t have to have this kind of argument.” The use of online information and resources does not incur additional costs for use once produced. Many resources, particularly those scholarly research outputs, already have established buyers such as research libraries. Do we have to deny access to information and knowledge to those who cannot afford but are seeking for it, just so that we can have a market where information and knowledge resources are sold and bought and authors are compensated along with those who work with the created content as a result? No, this is a false question. We can have both. But libraries and librarians will have to make it so.

Videos to Watch

“Code4Lib 2008: Building the Open Library – YouTube.”


“Aaron Swartz on Picking Winners” American Library Association Midwinter meeting, January 12, 2008.

“Freedom to Connect: Aaron Swartz (1986-2013) on Victory to Save Open Internet, Fight Online Censors.”

REFERENCES

“Aaron Swartz.” 2013. Accessed February 10. http://www.aaronsw.com/.

“Aaron Swartz – Wikipedia, the Free Encyclopedia.” 2013. Accessed February 10. http://en.wikipedia.org/wiki/Aaron_Swartz#JSTOR.

“Aaron Swartz on Picking Winners – YouTube.” 2008. http://www.youtube.com/watch?feature=player_embedded&v=BvJqXaoO4FI.

“Aaron Swartz’s Suicide Shows the Risk of a Too-comfortable Internet – The Globe and Mail.” 2013. Accessed February 10. http://www.theglobeandmail.com/commentary/aaron-swartzs-suicide-shows-the-risk-of-a-too-comfortable-internet/article7509277/.

“Academics Remember Reddit Co-Founder With #PDFTribute.” 2013. Accessed February 10. http://www.slate.com/blogs/the_slatest/2013/01/14/aaron_swartz_death_pdftribute_hashtag_aggregates_copyrighted_articles_released.html.

“After Aaron, Reputation Metrics Startups Aim To Disrupt The Scientific Journal Industry | TechCrunch.” 2013. Accessed February 10. http://techcrunch.com/2013/02/03/the-future-of-the-scientific-journal-industry/.

American Library Association, “A Memorial Resolution Honoring Aaron Swartz.” 2013. http://connect.ala.org/files/memorial_5_aaron%20swartz.pdf.

“An Effort to Upgrade a Court Archive System to Free and Easy – NYTimes.com.” 2013. Accessed February 10. http://www.nytimes.com/2009/02/13/us/13records.html?_r=1&.

Bonfield, Brett. 2013. “Aaron Swartz.” In the Library with the Lead Pipe (February 20). http://www.inthelibrarywiththeleadpipe.org/2013/aaron-swartz/.

“Code4Lib 2008: Building the Open Library – YouTube.” 2013. Accessed February 10. http://www.youtube.com/watch?v=oV-P2uzzc4s&feature=youtu.be&t=2s.

“Daily Kos: What Aaron Swartz Did at MIT.” 2013. Accessed February 10. http://www.dailykos.com/story/2013/01/13/1178600/-What-Aaron-Swartz-did-at-MIT.

Dupuis, John. 2013a. “Around the Web: Aaron Swartz Chronological Link Roundup – Confessions of a Science Librarian.” Accessed February 10. http://scienceblogs.com/confessions/2013/01/20/around-the-web-aaron-swartz-chronological-link-roundup/.

———. 2013b. “Library Vendors, Politics, Aaron Swartz, #pdftribute – Confessions of a Science Librarian.” Accessed February 10. http://scienceblogs.com/confessions/2013/01/17/library-vendors-politics-aaron-swartz-pdftribute/.

“FDLP for PUBLIC.” 2013. Accessed February 10. http://www.gpo.gov/libraries/public/.

“Freedom to Connect: Aaron Swartz (1986-2013) on Victory to Save Open Internet, Fight Online Censors.” 2013. Accessed February 10. http://www.democracynow.org/2013/1/14/freedom_to_connect_aaron_swartz_1986.

“Full Text of ‘Guerilla Open Access Manifesto’.” 2013. Accessed February 10. http://archive.org/stream/GuerillaOpenAccessManifesto/Goamjuly2008_djvu.txt.

Groover, Myron. 2013. “British Columbia Library Association – News – The Last Days of Aaron Swartz.” Accessed February 21. http://www.bcla.bc.ca/page/news/ezlist_item_9abb44a1-4516-49f9-9e31-57685e9ca5cc.aspx#.USat2-i3pJP.

Hellman, Eric. 2013a. “Go To Hellman: Edward Tufte Was a Proto-Phreaker (#aaronswnyc Part 1).” Accessed February 21. http://go-to-hellman.blogspot.com/2013/01/edward-tufte-was-proto-phreaker.html.

———. 2013b. “Go To Hellman: The Four Crimes of Aaron Swartz (#aaronswnyc Part 2).” Accessed February 21. http://go-to-hellman.blogspot.com/2013/01/the-four-crimes-of-aaron-swartz.html.

“How M.I.T. Ensnared a Hacker, Bucking a Freewheeling Culture – NYTimes.com.” 2013. Accessed February 10. http://www.nytimes.com/2013/01/21/technology/how-mit-ensnared-a-hacker-bucking-a-freewheeling-culture.html?pagewanted=all.

March, Andrew. 2013. “A Dangerous Mind? – NYTimes.com.” Accessed February 10. http://www.nytimes.com/2012/04/22/opinion/sunday/a-dangerous-mind.html?pagewanted=all.

“MediaBerkman » Blog Archive » Aaron Swartz on The Open Library.” 2013. Accessed February 22. http://blogs.law.harvard.edu/mediaberkman/2007/10/25/aaron-swartz-on-the-open-library-2/.

Peters, Justin. 2013. “The Idealist.” Slate, February 7. http://www.slate.com/articles/technology/technology/2013/02/aaron_swartz_he_wanted_to_save_the_world_why_couldn_t_he_save_himself.html.

“Public Access to Court Electronic Records.” 2013a. Accessed February 10. http://www.pacer.gov/.

“Publishers and Library Groups Spar in Appeal to Ruling on E-Reserves – Technology – The Chronicle of Higher Education.” 2013. Accessed February 10. http://chronicle.com/article/PublishersLibrary-Groups/136995/?cid=pm&utm_source=pm&utm_medium=en.

“Remember Aaron Swartz.” 2013. Celebrating Aaron Swartz. Accessed February 22. http://www.rememberaaronsw.com.

Rochkind, Jonathan. 2013. “Library Values and the Growing Scholarly Digital Divide: In Memoriam Aaron Swartz | Bibliographic Wilderness.” Accessed February 10. http://bibwild.wordpress.com/2013/01/13/library-values-and-digital-divide-in-memoriam-aaron-swartz/.

Sims, Nancy. 2013. “What Is the Government’s Interest in Copyright? Not That of the Public. – Copyright Librarian.” Accessed February 10. http://blog.lib.umn.edu/copyrightlibn/2013/02/what-is-the-governments-interest-in-copyright.html.

Stamos, Alex. 2013. “The Truth About Aaron Swartz’s ‘Crime’.” Unhandled Exception. Accessed February 22. http://unhandled.com/2013/01/12/the-truth-about-aaron-swartzs-crime/.

Summers, Ed. 2013. “Aaronsw | Inkdroid.” Accessed February 21. http://inkdroid.org/journal/2013/01/19/aaronsw/.

“The Inside Story of Aaron Swartz’s Campaign to Liberate Court Filings | Ars Technica.” 2013. Accessed February 10. http://arstechnica.com/tech-policy/2013/02/the-inside-story-of-aaron-swartzs-campaign-to-liberate-court-filings/.

“Welcome to Open Library (Open Library).” 2013. Accessed February 10. http://openlibrary.org/.

West, Jessamyn. 2013. “Librarian.net » Blog Archive » On Leadership and Remembering Aaron.” Accessed February 21. http://www.librarian.net/stax/3984/on-leadership-and-remembering-aaron/.