Web Scraping: Creating APIs Where There Were None

Websites are human-readable. That’s great for us, we’re humans. It’s not so great for computer programs, which tend to be better at navigating structured data rather than visuals.

Web scraping is the practice of “scraping” information from a website’s HTML. At its core, web scraping lets programs visit and manipulate a website much like people do. The advantage to this is that, while programs aren’t great at navigating the web on their own, they’re really good at repeating things over and over. Once a web scraping script is set up, it can run an operation thousands of times over without breaking a sweat. Compare that to the time and tedium of clicking through a thousand websites to copy-paste the information you’re interested in and you can see the appeal of automation.

Why web scraping?

Why would anybody use web scraping? There are a few good reasons which are, unfortunately, all too common in libraries.

You need an API where there is none.

Many of the web services we subscribe to don’t expose their inner workings via an API. It’s worth taking a moment to explain the term API, which is used frequently but rarely given a better definition beyond the uninformative “Application Programming Interface”.

Let’s consider a common type of API, a search API. When you visit Worldcat and search, the site checks an enormous database of millions of metadata records and returns a nice, visually formatted list of ones relevant to your query. Again, this is great for humans. We can read through the results and pick out the ones we’re interested in. But what happens when we want to repurpose this data elsewhere? What if we want to build a bento search box, displaying results from our databases and Worldcat alongside each other?1 The answer is that we can’t easily accomplish this without an API.

For example, the human-readable results of search engine may look like this:

1. Instant PHP Web Scraping

by Jacob Ward

Publisher: Packt Publishing 2013

2. Scraping by: wage labor, slavery, and survival in early Baltimore

by Seth Rockman

Publisher: Johns Hopkins University Press 2009

That’s fine for human eyes, but for our search application it’s a pain in the butt. Even if we could embed a result like this using an iframe, the styling might not match what we want and the metadata fields might not display in a manner consistent with our other records (e.g. why is the publication year included with publisher?). What an API returns, on the other hand, may look like this:

[
  {
    "title": "Instant PHP Web Scraping",
    "author": "Jacob Ward",
    "publisher": "Packt Publishing",
    "publication_date": "2013"
  },
  {
    "title": "Scraping by: wage labor, slavery, and survival in early Baltimore",
    "author": "Seth Rockman",
    "publisher": "Johns Hopkins University Press",
    "publication_date": "2009"
  }
]

Unless you really love curly braces and quotation marks, that looks awful. But it’s very easy to manipulate in many programming languages. Here’s an incomplete example in Python:

results = json.load( data )
for result in results:
  print result.title + ' - ' + result.author

Here “data” is our search results from above and we can use a function to easily parse that data into a variable. The script then loops over each search result and prints out its title in author in a format like “Instant PHP Web Scraping – Jacob Ward”.

An API is hard to use or doesn’t have the data you need.

Sometimes services do expose their data via an API, but the API has limitations that the human interface of the website doesn’t. Perhaps it doesn’t expose all the metadata which is visible in search results. Fellow Tech Connect author Margaret Heller mentioned that Ulrich’s API doesn’t include subject information, though it’s present in the search results presented to human users.

Some APIs can also be more difficult to use than web scraping. The ILS at my place of work is like this; you have to pay extra to get the API activated and it requires server configuration on a shared server I don’t have access to. The API has strict authentication requirements which are required even for read-only calls (e.g. I’m just accessing publicly-viewable data, not making account changes). The boilerplate code the vendor provides doesn’t work, or rather only works for trivial examples. All these hurdles combine to make scraping the catalog appealing.

As a side effect, how you reconfigure a site’s data might inspire its own API. Are you sorely missing a feature so bad you need to hack around it? Writing a nice proof-of-concept with web scraping might prove that there’s a use case for a particular API feature.

How?

More or less all web scraping works the same way:

  • Use a scripting language to get the HTML of a particular page
  • Find the interesting pieces of a page using CSS, XPath, or DOM traversal—any means of identifying specific HTML elements
  • Manipulate those pieces, extracting the data you need
  • Pipe the data somewhere else, e.g. into another web page, spreadsheet, or script

Let’s go through an example using the Directory of Open Access Journals. Now, the DOAJ has an API of sorts; it supports retrieving metadata via the OAI-PMH verbs. This means a request for a URL like http://www.doaj.org/oai?verb=GetRecord&identifier=18343147&metadataPrefix=oai_dc will return XML with information about one of the DOAJ journals. But OAI-PMH doesn’t support any search APIs; we can use standard identifiers and other means of looking up specific articles or publications, but we can’t do a traditional keyword search.

Libraries, of the code persuasion

Before we get too far, let’s lean on those who came before us. Scraping a website is both a common task and a complex one. Remember last month, when I said that we don’t need to reinvent the wheel in our programming because reusable modules exist for most common tasks? Please let’s not write our own web scraping library from scratch.

Code libraries, which go by different names depending on the language (most amusingly, they’re called “eggs” in Python and “gems” in Ruby), are pre-written chunks of code which help you complete common tasks. Any task which several people have had to do before probably has a library devoted to it. Google searches for “best [insert task] module for [insert language]” typically turn up useful guidance on where to start.

While each language has its own means of incorporating others’ code into your own, they all basically have two steps: 1) download the external library somewhere onto your hard drive or server, often using a command-line tool, and 2) import the code into your script. The external library should have some documentation on how to use it’s special features once you’re imported it.

What does this look like in PHP, the language our example will be in? First, we visit the Simple HTML DOM website on Sourceforge to download a single PHP file. Then, we place that file in the same directory that our scraping script will live. In our scraping script, we write a single line up at the top:

<?php
require_once( 'simple_html_dom.php' );
?>

Now it’s as if the whole contents of the simple_html_dom.php file were in our script. We can use functions and classes which were defined in the other file, such as the file_get_html function which is not otherwise available. PHP actually has a few functions which are used to import code in different ways; the documentation page for the include function describes the basic mechanics.

Web scraping a DOAJ search

While the DOAJ doesn’t have a search API, it does have a search bar which we can manipulate in our scraping. Let’s run a test search, view the HTML source of the result, and identify the elements we’re interested in. First, we visit doaj.org and type in a search. Note the URL:

doaj.org/doaj?func=search&template=&uiLanguage=en&query=librarianship

I’ve highlighted the key-value pairs in the URLs query string, making the keys bold and the values italicized. Here our search term was “librarianship” which is the value associated with the appropriately-named “query” key. If we change the word “librarianship” to a different search term and visit the new URL, we see results for the new term, predictably. With easily hackable URLs like this, it’s easy for us to write a web scraping script. Here’s the first half of our example in PHP:

<?php
// see http://simplehtmldom.sourceforge.net/manual_api.htm for documentation
require_once( 'simple_html_dom.php' );

$base = 'http://www.doaj.org/doaj?func=search&template=&uiLanguage=en&query=';
$query = urlencode( 'librarianship' );

$html = file_get_html( $base . $query );
// to be continued...
?>

So far, everything is straightforward. We insert the web scraping library we’re using, then use what we’ve figured out about the DOAJ URL structure: it has a base which won’t change and a query which we want to change according to our interests. You could have the query come from command-line arguments or web form data like the $_GET array in PHP, but let’s just keep it as a simple string.

We urlencode the string because we don’t want spaces or other illegal characters sneaking their way in there; while the script still works with $query = 'new librarianship' for example, using unencoded text in URLs is a bad habit to get into. Other functions, such as file_get_contents, will produce errors if passed a URL with spaces in it. On the other hand, urlencode( 'new librarianship' ) returns the appropriately encoded string “new+librarianship”. If you do take user input, remember to sanitize it before using it elsewhere.

For the second part, we need to investigate the HTML source of DOAJ’s search results page. Here’s a screenshot and a simplified example of what it looks like:

2 search results from the DOAJ
A couple search results from DOAJ for the term “librarianship”
<div id="result">
  <div class="record" id="record1">
    <div class="imageDiv">
      <img src="/doajImages/journal.gif"><br><span><small>Journal</small></span>
    </div><!-- END imageDiv -->
    <div class="data">
      <a href="/doaj?func=further&amp;passMe=http://www.collaborativelibrarianship.org">
        <b>Collaborative Librarianship</b>
      </a>
      <strong>ISSN/EISSN</strong>: 19437528
      <br><strong>Publisher</strong>: Regis University
      <br><strong>Subject</strong>:
      <a href="/doaj?func=subject&amp;cpId=129&amp;uiLanguage=en">Library and Information Science</a>
      <br><b>Country</b>: United States
      <b>Language</b>: English<br>
      <b>Start year</b> 2009<br>
      <b>Publication fee</b>:
    </div> <!-- END data -->
    <!-- ...more markup -->
  </div> <!-- END record -->
  <div class="recordColored" id="record2">
    <div class="imageDiv">
      <img src="/doajImages/article.png"><br><span><small>Article</small></span>
    </div><!-- END imageDiv -->
    <div class="data">
       <b>Mentoring for Emerging Careers in eScience Librarianship: An iSchool – Academic Library Partnership </b>
      <div style="color: #585858">
        <!-- author (s) -->
         <strong>Authors</strong>:
          <a href="/doaj?func=search&amp;query=au:&quot;Gail Steinhart&quot;">Gail Steinhart</a>
          ---
          <a href="/doaj?func=search&amp;query=au:&quot;Jian Qin&quot;">Jian Qin</a><br>
        <strong>Journal</strong>: <a href="/doaj?func=issues&amp;jId=88616">Journal of eScience Librarianship</a>
        <strong>ISSN/EISSN</strong>: 21613974
        <strong>Year</strong>: 2012
        <strong>Volume</strong>: 1
        <strong>Issue</strong>: 3
        <strong>Pages</strong>: 120-133
        <br><strong>Publisher</strong>: University of Massachusetts Medical School
      </div><!-- End color #585858 -->
    </div> <!-- END data -->
    <!-- ...more markup -->
   </div> <!-- END record -->
   <!-- more records -->
</div> <!-- END results list -->

Even with much markup removed, there’s a lot going on here. We need to zone in on what’s interesting and find patterns in the markup that help us retrieve it. While it may not be obvious from the example above, the title of each search result is contained in a <b> tag towards the beginning of each record (lines 8 and 26 above).

Here’s a sketch of the element hierarchy leading to the title: a <div> with id=”result” > a <div> with a class of either “record” or “recordColored” > a <div> with a class of “data” > possibly an <a> tag (present in the first example, absent in the second) > the <b> tag containing the title. Noting the conditional parts of this hierarchy is important; if we didn’t note that sometimes an <a> tag is present and that the class can be either “record” or “recordColored”, we wouldn’t be getting all the items we want.

Let’s try to return the titles of all search results on the first page. We can use Simple HTML DOM’s find method to extract the content of specific elements using CSS selectors. Now that we know how the results are structured, we can write a more complete example:

<?php
require_once( 'simple_html_dom.php' );

$base = 'http://www.doaj.org/doaj?func=search&template=&uiLanguage=en&query=';
$query = urlencode( 'librarianship' );

$html = file_get_html( $base . $query );

// using our knowledge of the DOAJ results page
$records = $html->find( '.record .data, .recordColored .data' );

foreach( $records as $record ) {
  echo $record->getElementsByTagName( 'b', 0 )->plaintext . PHP_EOL;
}
?>

The beginning remains the same, but this time we actually do something with the HTML. We use find to pull the records which have class “data.” Then we echo the first <b> tag’s text content. The getElementsByTagName method typically returns an array, but if you pass a second integer parameter it returns the array element at that index (0 being the first element in the array, because computer scientists count from zero). The ->plaintext property simply contains the text found in the element, if we echoed the element itself we would see opening and closing <b> tags wrapped around the title. Finally, we append an “end-of-line” (EOL) character just to make the output easier to read.

To see our results, we can run our script on the command line. For Linux or Mac users, that likely means merely opening a terminal (in Applications/Utilities on a Mac) since they come with PHP pre-installed. On Windows, you may need to use WAMP or XAMPP to run PHP scripts. XAMPP gives you a “shell” button to open a terminal, while you can put the PHP executable in your Windows environment variables if you’re using WAMP.

Once you have a terminal open, the php command will execute whatever PHP script you pass it as a parameter. If we run php name-of-our-script.php in the same directory as our script, we see ten search result titles printed to the terminal:

> php doaj-search.php
Collaborative Librarianship
Mentoring for Emerging Careers in eScience Librarianship: An iSchool – Academic Library Partnership
Education for Librarianship in Turkey Education for Librarianship in Turkey
Turkish Librarianship: A Selected Bibliography Turkish Librarianship: A Selected Bibliography
Journal of eScience Librarianship
Editorial: Our Philosophies of Librarianship
Embedded Academic Librarianship: A Review of the Literature
Model Curriculum for 'Oriental Librarianship' in India
A General Outlook on Turkish Librarianship and Libraries
The understanding of subject headings among students of librarianship

This is a simple, not-too-useful example. But it could expanded in many ways. Try copying the script above and attempting some of the following:

  • Make the script return more than the ten items on the first page of results
  • Use some of DOAJ’s advanced search functions, for instance a date limiter
  • Only return journals or articles, not both
  • Return more than just the title of results, for instance the author(s), URLs, or publication date

Accomplishing these tasks involves learning more about DOAJ’s URL and markup structure, but also learning more about the scraping library you’re using.

Common Problems

There are a couple possible hangups when web scraping. First of all, many websites employ user-agent sniffing to serve different versions of themselves to different devices. A user agent is a hideous string of text which web browsers and other HTTP clients use to identify themselves.2 If a site misinterprets our script’s user agent, we may end up on a mobile or other version of a site instead of the desktop one we were expecting. Worse yet, some sites try to prevent scraping by blacklisting certain user agents.

Luckily, most web scraping libraries have tools built in to work around this problem. A nice example is Ruby’s Mechanize, which has an agent.user_agent_alias property which can be set to a number of popular web browsers. When using an alias, our script essentially tells the responding web server that it’s a common desktop browser and thus is more likely to get a standard response.

It’s also routine that we’ll want to scrape something behind authentication. While IP authentication can be circumvented by running scripts from an on-campus connection, other sites may require login credentials. Again, most web scraping libraries already have built-in tools for handling authentication. We can find which form controls on the page we need to fill in, insert your username and password into the form, and then submit it programmatically. Storing a login in a plain text script is never a good idea though, so be careful.

Considerations

Not all web scraping is legitimate. Taking data which is copyrighted and merely re-displaying it on our site without proper attribution is not only illegal, it’s just not being a good citizen of the web. The Wikipedia article on web scraping has a lengthy section on legal issues with a few historical cases from various countries.

It’s worth noting that web scraping can be very brittle, meaning it breaks often and easily. Scraping typically relies on other people’s markup to remain consistent. If just a little piece of HTML changes, our entire script might be thrown off, looking for elements that no longer exist.

One way to counteract this is to write selectors which are as broad as possible. For instance, let’s return to the DOAJ search results markup. Why did we use such a concise CSS selector to find the title when we could have been much more specific? Here’s a more explicit way of getting the same data:

$html->find( 'div#result > div.record > div.data, div#result > div.recordColored > div.data' );

What’s wrong with these selectors? We’re relying on so much more to stay the same. We need: the result wrapper to be a <div>, the result wrapper to have an id of “result”, the record to be a <div>, and the data inside the record to be a <div>. Our use of the child selector “>” means we need the element hierarchy to stay precisely the same. If any of these properties of the DOAJ markup changed, our selector wouldn’t find anything and our script would need to be updated. Meanwhile, our much more generic line still grabs the right information because it doesn’t depend on particular tags or other aspects of the markup remaining constant:

$html->find( '.record .data, .recordColored .data' );

We’re still relying on a few things—we have to, there’s no getting around that in web scraping—but a lot could change and we’d be set. If the DOAJ upgraded to HTML5 tags, swapping out <div> for <article> or <section>, we would be OK. If the wrapping <div> was removed, or had its id change, we’d be OK. If a new wrapper was inserted in between the “data” and “record” <div>, we’d be OK. Our approach is more resilient.

If you did try running our PHP script, you probably noticed it was rather slow. It’s not like typing a query into Google and seeing results immediately. We have to request a page from an external site, which then queries its backend database, processes the results, and displays HTML which we ultimately don’t use, at least not as intended. This highlights that web scraping isn’t a great option for user-facing searches; it can take too long to return results. One option is to cache searches, for instance storing results of previous scrapings in a database and then checking to see if the database has something relevant before resorting to pulling content off an external site.

It’s also worth noting that web scraping projects should try to be reasonable about the number of times they request an external resource. Every time our script pulls in a site’s HTML, it’s another request that site’s server has to process. A site may not have an API because it cannot handle the amount of traffic one would attract. If our web scraping project is going to be sending thousands of requests per hour, we should consider how reasonable that is. A simple email to the third party explaining what we’re doing and the amount of traffic it may generate is a nice courtesy.

Overall, web scraping is handy in certain situations (see below) or for scripts which are run seldom or a single time. For instance, if we’re doing an analysis of faculty citations at our institution, we might not have access to a raw list of citations. But faculty may have university web pages where they list all their publications in a consistent format. We could write a script which only needs to run once, culling a large list of citations for analysis. Once we’ve scraped that information, you could use OpenRefine or other power tools to extract particular journal titles or whatever else we’re interested in.

How is web scraping used in libraries?

I asked Twitter what other libraries are using web scraping for and got a few replies:

@phette23 Pulling working papers off a departmental website for the inst repo. Had to web scrape for metadata.
— Ondatra libskoolicus (@LibSkrat) September 25, 2013

Matthew Reidsma of Grand Valley State University also had several examples:

To fuel a live laptop/iPad availability site by scraping holdings information from the catalog. See the availability site as well as the availability charts for the last few days and the underlying code which does the scraping. This uses the same Simple HTML Dom library as our example above.

It’s also used to create a staff API by scraping the GVSU Library’s Staff Directory and reformatting it; see the code and the result. The result may not look very readable—it’s JSON, a common data format that’s particularly easy to reuse in some languages such as JavaScript—but remember that APIs are for machine-readable data which can be easily reused by programs, not people.

Jacqueline Hettel of Stanford University has a great blog post that describes using a Google Chrome extension and XPath queries to scrape acknowledgments from humanities monographs in Google Books; no coding required! She and Chris Bourg are presenting their results at the Digital Library Federation in November.

Finally, I use web scraping to pull hours information from our main library site into our mobile version. I got tired of updating the hours in two places every time they changed, so now I pull them in using a PHP script. It’s worth noting that this dual-maintenance annoyance is one major reason websites can and should be done in responsive designs.

Most of these library examples are good uses of web scraping because they involve simply transporting our data from one system to another; scraping information from the catalog to display it elsewhere is a prime use case. We own the data, so there are no intellectual property issues, and they’re our own servers so we’re responsible for keeping them up.

Code Libraries

While we’ve used PHP above, there’s no need to limit ourselves to a particular programming language. Here’s a set of popular web scraping choices in a few languages:

To provide a sense of how the different tools above work, I’ve written a series of gists which uses each to scrape titles from the first page of a DOAJ search.

Notes
  1. See the NCSU or Stanford library websites for examples of this search style. Essentially, results from several different search engines—a catalog, databases, the library website, study guides—are all displaying on the same page in seperate “bento” compartments.
  2. The browser I’m in right now, Chrome, has this beauty for a user agent string: “Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.76 Safari/537.36”. Yes, that’s right: Mozilla, Mac, AppleWebKit, KHTML, Gecko, Chrome, & Safari all make an appearance.

Demystifying Programming

We talk quite a bit about code here at Tech Connect and it’s not unusual to see snippets of it pasted into a post. But most of us, indeed most librarians, aren’t professional programmers or full-time developers; we had to learn like everyone else. Depending on your background, some parts of coding will be easy to pick up while others won’t make sense for years. Here’s an attempt to explain the fundamental building blocks of programming languages.

The Languages

There are a number of popular programming languages: C, C#, C++, Java, JavaScript, Objective C, Perl, PHP, Python, and Ruby. There are numerous others, but this semi-arbitrary selection cover the ones most commonly in use. It’s important to know that each programming language requires its own software to run. You can write Python code into a text file on a machine that doesn’t have the Python interpreter installed, but you can’t execute it and see the results.

A lot of learners stress over which language to learn first unnecessarily. Once you’ve picked up one language, you’ll understand all of the foundational pieces listed below. Then you’ll be able to transition quickly to another language by understanding a few syntax changes: Oh, in JavaScript I write function myFunction(x) to define a function, while in Python I write def myFunction(x). Programming languages differ in other ways too, but knowing the basics of one provides a huge head start on learning the basics of any other.

Finally, it’s worth briefly distinguishing compiled versus interpreted languages. Code written in a compiled language, such as all the capital C languages and Java, must first be passed to a compiler program which then spits out an executable—think a file ending if .exe if you’re on Windows—that will run the code. Interpreted languages, like Perl, PHP, Python, and Ruby, are quicker to program in because you just pass your code along to an interpreter program which immediately executes it. There’s one fewer step: for a compiled language you need to write code, generate an executable, and then run the executable while interpreted languages sort of skip that middle step.

Compiled languages tend to run faster (i.e. perform more actions or computations in a given amount of time) than interpreted ones, while interpreted ones tend to be easier to learn and more lenient towards the programmer. Again, it doesn’t matter too much which you start out with.

Variables

Variables are just like variables in algebra; they’re names which stand in for some value. In algebra, you might write:

x = 10 + 3

which is also valid code in many programming languages. Later on, if you used the value of x, it would be 13.

The biggest difference between variables in math and in programming is that programming variables can be all sort of things, not just numbers. They can be strings of text, for instance. Below, we combine two pieces of text which were stored in variables:

name = 'cat'
mood = ' is laughing'
both = name + mood

In the above code, both would have a value of ‘cat is laughing’. Note that text strings have to be wrapped in quotes—often either double or single quotes is acceptable—in order to distinguish them from the rest of the code. We also see above that variables can be the product of other variables.

Comments

Comments are pieces of text inside a program which are not interpreted as code. Why would you want to do that? Well, comments are very useful for documenting what’s going on in your code. Even if your code is never going to be seen by anyone else, writing comments helps understand what’s going on if you return to a project after not thinking about it for a while.

// This is a comment in JavaScript; code is below.
number = 5;
// And a second comment!

As seen above, comments typically work by having some special character(s) at the beginning of the line which tells the programming language that the rest of the line can be ignored. Common characters that indicate a line is a comment are # (Python, Ruby), // (C languages, Java, JavaScript, PHP), and /* (CSS, multi-line blocks of comments in many other languages).

Functions

As with variables, functions are akin to those in math: they take an input, perform some calculations with it, and return an output. In math, we might see:

f(x) = (x * 3)/4

f(8) = 6

Here, the first line is a function definition. It defines how many parameters can be passed to the function and what it will do with them. The second line is more akin to a function execution. It shows that the function returns the value 6 when passed the parameter 8. This is really, really close to programming already. Here’s the math above written in Python:

def f(x):
  return (x * 3)/4

f(8)
# which returns the number 6

Programming functions differ from mathematical ones in much the same way variables do: they’re not limited to accepting and producing numbers. They can take all sorts of data—including text—process it, and then return another sort of data. For instance, virtually all programming languages allow you to find the length of a text string using a function. This function takes text input and outputs a number. The combinations are endless! Here’s how that looks in Python:

len('how long?')
# returns the number 9

Python abbreviates the word “length” to simply “len” here, and we pass the text “how long?” to the function instead of a number.

Combining variables and functions, we might store the result of running a function in a variable, e.g. y = f(8) would store the value 6 in the variable y if f(x) is the same as above. This may seem silly—why don’t you just write y = 6 if that’s what you want!—but functions help by abstracting out blocks of code so you can reuse them over and over again.

Consider a program you’re writing to manage the e-resource URLs in your catalog, which are stored in MARC field 856 subfield U. You might have a variable named num_URLs (variable names can’t have spaces, thus the underscore) which represents the number of 856 $u subfields a record has. But as you work on records, that value is going to change; rather than manually calculate it each time and set num_URLs = 3 or num_URLs = 2 you can write a function to do this for you. Each time you pass the function a bibliographic record, it will return the number of 856 $u fields, substantially reducing how much repetitive code you have to write.

Conditionals

Many readers are probably familiar with IFTTT, the “IF This Then That” web service which can glue together various accounts, for instance “If I post a new photo to Instagram, then save it to my Dropbox backup folder.” These sorts of logical connections are essential to programming, because often whether or not you perform a particular action varies depending on some other condition.

Consider a program which counts the number of books by Virginia Woolf in your catalog. You want to count a book only if the author is Virginia Woolf. You can use Ruby code like this:

if author == 'Virginia Woolf'
  total = total + 1
end

There are three parts here: first we specify a condition, then there’s some code which runs only if the condition is true, and then we end the condition. Without some kind of indication that the block of code inside the condition has ended, the entire rest of our program would only run depending on if the variable author was set to the right string of text.

The == is definitely weird to see for the first time. Why two equals? Many programming languages use a variety of double-character comparisons because the single equals already has a meaning: single equals assigns a value to a variable (see the second line of the example above) while double-equals compares two values. There are other common comparisons:

  • != often means “is not equal to”
  • > and < are the typical greater or lesser than
  • >= and <= often mean “greater/lesser than or equal to”

Those can look weird at first, and indeed one of the more common mistakes (made by professionals and newbies alike!) is accidentally putting a single equals instead of a double.[1] While we’re on the topic of strange double-character equals signs, it’s worth pointing out that += and -= are also commonly seen in programming languages. These pairs of symbols respectively add or subtract a given number from a variable, so they do assign a value but they alter it slightly. For instance, above I could have written total += 1 which is identical in outcome as total = total + 1.

Lastly, conditional statements can be far more sophisticated than a mere “if this do that.” You can write code that says “if blah do this, but if bleh do that, and if neither do something else.” Here’s a Ruby script that would count books by Virginia Woolf, books by Ralph Ellison, and books by someone other than those two.

total_vw = 0
total_re = 0
total_others = 0
if author == 'Virginia Woolf'
  total_vw += 1
elsif author == 'Ralph Ellison'
  total_re += 1
else
  total_others += 1
end

Here, we set all three of our totals to zero first, then check to see what the current value of author is, adding one to the appropriate total using a three-part conditional statement. The elsif is short for “else if” and that condition is only tested if the first if wasn’t true. If neither of the first two conditions is true, our else section serves as a kind of fallback.

Arrays

An array is simply a list of variables, in fact the Python language has an array-like data type named “list.” They’re commonly denoted with square brackets, e.g. in Python a list looks like

stuff = [ "dog", "cat", "tree"]

Later, if I want to retrieve a single piece of the array, I just access it using its index wrapped in square brackets, starting from the number zero. Extending the Python example above:

stuff[0]
# returns "dog"
stuff[2]
# returns "tree"

Many programming languages also support associative arrays, in which the index values are strings instead of numbers. For instance, here’s an associative array in PHP:

$stuff = array(
  "awesome" => "sauce",
  "moderate" => "spice",
  "mediocre" => "condiment",
);
echo $stuff["mediocre"];
// prints out "condiment"

Arrays are useful for storing large groups of like items: instead of having three variables, which requires more typing and remembering names, we have just have one array containing everything. While our three natural numbers aren’t a lot to keep track of, imagine a program which deals with all the records in a library catalog, or all the search results returned from a query: having an array to store that large list of items suddenly becomes essential.

Loops

Loops repeat an action a set number of times or until a condition is met. Arrays are commonly combined with loops, since loops make it easy to repeat the same operation on each item in an array. Here’s a concise example in Python which prints every entry in the “names” array to the screen:

names = ['Joebob', 'Suebob', 'Bobob']
for name in names:
  print name

Without arrays and loops, we’d have to write:

name1 = 'Joebob'
name2 = 'Suebob'
name3 = 'Bobob'
print name1
print name2
print name3

You see how useful arrays are? As we’ve seen with both functions and arrays, programming languages like to expose tools that help you repeat lots of operations without typing too much text.

There are a few types of loops, including “for” loops and “while” loops loops. Our “for” loop earlier went through a whole array, printing each item out, but a “while” loop only keeps repeating while some condition is true. Here is a bit of PHP that prints out the first four natural numbers:

$counter = 1;
while ( $counter < 5 ) {
  echo $counter;
  $counter = $counter + 1;
}

Each time we go through the loop, the counter is increased by one. When it hits five, the loop stops. But be careful! If we left off the $counter = $counter + 1 line then the loop would never finish because the while condition would never be false. Infinite loops are another potential bug in a program.

Objects & Object-Oriented Programming

Object-oriented programming (oft-abbreviated OOP) is probably the toughest item in this post to explain, which is why I’d rather people see it in action by trying out Codecademy than read about it. Unfortunately, it’s not until the end of the JavaScript track that you really get to work with OOP, but it gives you a good sense of what it looks like in practice.

In general, objects are simply a means of organizing code. You can group related variables and functions under an object. You make an object inherit properties from another one, if it needs to use all the same variables and functions but also add some of its own.

For example, let’s say we have a program that deals with a series of people, each of which have a few properties like their name and age but also the ability to say hi. We can create a people class which is kind of like a template; it helps us stamp out new copies of objects without rewriting the same code over and over. Here’s an example in JavaScript:

function Person(name, age) {
  this.name = name;
  this.age = age;
  this.sayHi = function() {
    console.log("Hi, I'm " + name + ".");
  };
}

Joebob = new Person('Joebob', 39);
Suebob = new Person('Suebob', 40);
Bobob = new Person('Bobob', 3);
Bobob.sayHi();
// prints "Hi, I'm Bobob."
Suebob.sayHi();
// prints "Hi, I'm Suebob."

Our Person function is essentially a class here; it allows us to quickly create three people who are all objects with the same structure, yet they have unique values for their name and age.[2] The code is a bit complicated and JavaScript isn’t a great example, but basically think of this: if we wanted to do this without objects, we’d end up repeating the content of the Person block of code three times over.

The efficiency gained with objects is similar to how functions save us from writing lots of redundant code; identifying common structures and grouping them together under an object makes our code more concise and easier to maintain as we add new features. For instance, if we wanted to add a myAgeIs function that prints out the person’s age, we could just add it to the Person class and then all our people objects would be able to use it.

Modules & Libraries

Lest you worry that every little detail in your programs must be written from scratch, I should mention that all popular programming languages have mechanisms which allow you to reuse others’ code. Practically, this means that most projects start out by identifying a few fundamental building blocks which already exist. For instance, parsing MARC data is a non-trivial task which takes some serious knowledge both of the data structure and the programming language you’re using. Luckily, we don’t need to write a MARC parsing program on our own, because several exist already:

The Code4Lib wiki has an even more extensive list of options.

In general, it’s best to reuse as much prior work as possible rather than spend time working on problems that have already been solved. Complicated tasks like writing a full-fledged web application take a lot of time and expertise, but code libraries already exist for this. Particularly when you’re learning, it can be rewarding to use a major, well-developed project at first to get a sense of what’s possible with programming.

Attention to Detail

The biggest hangup for new programmers often isn’t conceptual: variables, functions, and these other constructs are all rather intuitive, especially once you’ve tried them a few times. Instead, many newcomers find out that programming languages are very literal and unyielding. They can’t read your mind and are happy to simply give up and spit out errors if they can’t understand what you’re trying to do.

For instance, earlier I mentioned that text variables are usually wrapped in quotes. What happens if I forget an end quote? Depending on the language, the program may either just tell you there’s an error or it might badly misinterpret your code, treating everything from your open quote down to the next instance of a quote mark as one big chunk of variable text. Similarly, accidentally misusing double equals or single equals or any of the other arcane combinations of mathematical symbols can have disastrous results.

Once you’ve worked with code a little, you’ll start to pick up tools that ease a lot of minor issues. Most code editors use syntax highlighting to distinguish different constructs  which helps to aid in error recognition. This very post uses a syntax highlighter for WordPress to color keywords like “function” and distinguish variable names. Other tools can “lint” your code for mistakes or code which, while technically valid, can easily lead to trouble. The text editor I commonly use does wonderful little things like provide closing quotes and parens, highlight lines which don’t pass linting tests, and enable me to test-run selected snippets of code.

There’s lots more…

Code isn’t magic; coders aren’t wizards. Yes, there’s a lot to programming and one can devote a lifetime to its study and practice. There are also thousands of resources available for learning, from MOOCs to books to workshops for beginners. With just a few building blocks like the ones described in this post, you can write useful code which helps you in your work.

Footnotes

[1]^ True story: while writing the very next example, I made this mistake.

[2]^ Functions which create objects are called constructor functions, which is another bit of jargon you probably don’t need to know if you’re just getting started.

Libraries & Privacy in the Internet Age

Recently, we covered library data collection practices with an eye towards identifying what your library really needs to retain. In an era of seemingly comprehensive surveillance, libraries do their best to afford their patrons some privacy. Limiting our circulation statistics is a prime example: while many libraries track how many times a particular item circulates, it’s common practice to delete loan histories in patron records once items have been returned. Thus, in keeping with the Library Code of Ethics, we can “protect each library user’s right to privacy and confidentiality” while at once using data to improve our services.

However, not all information lives in books and our privacy protections must stay current with technology. Obfuscating the circulation of physical items is one thing, but what about all of our online resources? Most of the data noted in the data collection post is in and of the digital: web analytics, server logs, and heat maps. Today, people expose more and more of their personal information online and do so mostly on for-profit websites. In this post, I’ll go beyond library-specific data to talk further about how we can offer patrons enhanced privacy even when they’re not using resources we control, such as the library website or ILS.

Public Computers

Libraries are a great bastion of public computer access. We’re pretty much the only institution in modern society that a community can rely upon for free software use and web access. But how much thought do we put into the configuration of our public computers? Are we sure that each user’s session is completely isolated, unable to be accessed by others?

For a while, I tried to do quantitative research on how well libraries handled web browser settings on public computers. I went to whatever libraries I could—public, academic, law, anyone who would let me in the door and sit down at a computer, typically without a library card, which is not everyone. If I could get to a machine, I ran a brief audit of sorts, these being the main items:

  • List the web browsers on the machine, their versions, settings, & any add-ons present
  • Run Mozilla’s Plugin Check to test for outdated plugins, a common security vulnerability for browsers
  • Attempt to install problematic add-ons, such as keyloggers [1]
  • Attempt to change the browser’s settings, e.g. set it to offer to save passwords
  • Close the browser, then reopen it to see if my history and settings changes persisted
  • DELETE ALL THE THINGS

After awhile, I gave up on this effort, because I became busy with other projects and I never received a satisfactory sample size. Of the fourteen browsers across six (see what I mean about sample size?) libraries I tested, results were discouraging:

  • 93% (all but one) of browsers were outdated
  • On average, browsers had two plug-ins with known security vulnerabilities and two-and-a-half more which were outdated
  • The majority of browsers (79%) retained their history after being closed
  • A few (36%) offered to remember passwords, which could lead to dangerous accidents on shared computers
  • The majority (86%) had no add-ons installed
  • All but one allowed its settings to be changed
  • All but one allowed arbitrary add-ons to be installed

I understand that IT departments often control public computer settings and that there are issues beyond privacy which dictate these settings. But these are miserable figures by any standard. I encourage all librarians to run similar audits on their public computers and see if any improvements can be made. We’re allowing users’ sessions to bleed over into each other and giving them too much power to monitor each others’ activity. Much as libraries commonly anonymize circulation information to protect patrons from invasive government investigations, we should strive to keep web activities safe with sensible defaults. [2]

Many libraries force users to sign in or reserve a computer. Academic libraries may use Active Directory, wherein students sign in with a common login they use for other services like email, while public libraries may use PC reservation software like EnvisionWare. These approaches go a long way towards isolating user sessions, but at the cost of imposing access barriers and slowing start-up times. Now users need an AD account or library card to use your computers. Furthermore, users don’t always remember to sign off at the end of their session, meaning someone else could still sit down at their machine and potentially access their information. These can seem like unimportant edge cases, but they’re still worthy of consideration. Privacy almost always involves some kind of tradeoff, for users and for libraries. We need to ensure we’re making the right tradeoffs with due diligence.

Proactive Privacy

Libraries needn’t be on the defensive about privacy. We can also proactively help patrons in two ways: by modifying the browsers on our public computers to offer enhanced protections and by educating the public about their privacy.

While providing sensible defaults, such as not offering to remember passwords and preventing the installation of keylogging software, is helpful, it does little to offer privacy above and beyond what one would experience on a personal machine. However, libraries can use a little knowledge and research to offer default settings which are unobtrusive and advantageous. The most obvious example is HTTPS. HTTPS is probably familiar to most people; when you see a lock or other security-connoting icon in your browser’s address bar, it’ll be right alongside a URL that begins with the HTTPS scheme. You can think of the S in HTTPS as standing for “Secure,” meaning your web traffic is encrypted as it goes from node to node in between your browser and the server delivering data.

Banking sites, social media, and indeed most web accounts are commonly accessed over HTTPS connections. They operate rather seamlessly, the same as HTTP connections, with one slight caveat: HTTPS sites don’t load HTTP resources (e.g. if https://example.com happens to include the image http://cats.com/lol.jpg) by default, meaning sometimes pieces of a page are missing or broken. This commonly results in a “mixed content” warning which the user can override, though how intuitive that process is varies widely across browser user interfaces.

In any case, mixed content happens rarely enough that HTTPS, when available, is a no-brainer benefit. But here’s the rub: not all sites default to HTTPS, even if they should. Most notably, Facebook doesn’t. Do you want your patrons logging into Facebook with unencrypted credentials? No, you don’t, because anyone monitoring network traffic, using a tool like Firesheep for instance, can grab and reuse those credentials. So installing an extension like the superlative HTTPS Everywhere [3], which uses a crowdsourced set of formulas to deliver HTTPS sites where available, is of immense benefit to users even though they likely will never notice it.

HTTPS is just a start: there are numerous add-ons which offer security and privacy enhancements, from blocking tracking cookies to the NoScript Security Suite which blocks, well, pretty much everything. How disruptive these add-ons are is variable and putting NoScript or a similar script-blocking tool on public computers is probably a bad idea; it’s simply too strange for unacquainted users to understand. But awareness of these tools is vital and some of the less disruptive ones still offer benefits that the majority of your patrons would enjoy. If you’re on the fence about a particular option, a little targeted usability testing could highlight whether it’s worth it or not.

Education

In terms of education, online privacy is a massively under-taught field. Workshops in public libraries and courses in academic libraries are obvious and in-demand services we can provide. They can cater to users of all skill levels. A basic introduction might appeal to people just beginning to use the web, covering core concepts like HTTPS, session data (e.g. cookies), and the importance of auditing account settings. An advanced workshop could cover privacy software, two-factor authentication, and pivotal extensions that have a more niche appeal.

Password management alone is a rich topic. Why? Because it’s a problem for everyone. Being a modern web user virtually necessitates maintaining a double-digit number of accounts. Password best practices are fairly well-known: use lengthy, unique passwords with a mixture of character types (lowercase and uppercase letters, numbers, and punctuation). Applying them is another matter. Repeating one password across accounts means if one company get hacked, suddenly all your accounts are potentially vulnerable. Using tricky number-letter replacement strategies can lead to painful forgetting—was it LibrarianFervor with “1”s instead of “i”s, a “3” instead of an “e”, a “0” instead of an “o”, or any combination thereof? Or did I spell it in reverse? These strategies aren’t much more secure and yet they make remembering passwords tough.

Users aren’t to be blamed: creating a well-considered and scalable approach to managing online accounts is difficult. But many wonderful software packages exist for this, e.g. the open source KeePass or paid solutions like 1Password and LastPass. Merely showing users these options and explaining their immense benefits is a public service.

To use a specific example, I co-taught an interdisciplinary course recently with a title broad enough—”The Nature of Knowledge,” try that on for size—that sneaking in privacy, social media, and web browsers was easy. One task I had willing students perform was to install the PrivacyFix extension [4] and then report back on their findings. PrivacyFix analyzes your Google and Facebook settings, telling you how much you’re worth to each company and pointing out places where you might be overexposing your information. It also includes a database of site ratings, judging sites based on how well they handle users data.

Our class was as diverse as any at my community college: we had adult students, teenage students, working parents, athletes, future teachers, future nurses, future police officers, black students, white students, Latino students, women, men. And you know what? Virtually everyone was shocked by their findings. They gasped, they changed their settings, they did independent research on online privacy, and at the end of the course they said still wanted to learn more. I hardly think this class was an anomaly. Americans know they’re being monitored at every turn. They want to share information online but they want to do so intelligently. If we offer them the tools to do so, they’ll jump at the chance.

Exeunt

For those who are curious about browser extensions, I wrote (shameless plug) a RUSQ column on web privacy that covers most of this post but goes further in detail in terms of recommendations. The Sec4Lib listserv is worth keeping an eye on as well, and if you really want to go the extra mile you could attend the Security preconference at the upcoming LITA Forum in November. Online privacy is not likely to get any less complicated in the future, but libraries should see that as an opportunity. We’re uniquely poised, both as information professionals with a devotion to privacy and as providers of public computing services, to tackle this issue. And no one is going to do it for us.

Footnotes

[1]^ Keyloggers are software which record each keystroke. As such, they can be used to find username and password information entered into web forms. I couldn’t find a free keylogger add-on for every browser so I only tested in browsers which had one available.

[2]^ I have a GitHub repository of what I consider to be sensible defaults for Mozilla Firefox, which happens to store settings in a JavaScript file and thus makes it easy to version and share them. The settings are liberally commented but not tested in production.

[3]^ As you’ll notice if you visit that link, HTTPS Everywhere is only available for Google Chrome and Mozilla Firefox. In my experience, it almost never causes problems, especially with major websites like Facebook, and there are a few similar extensions which one could try e.g. KB SSL for Chrome. Unfortunately, Internet Explorer has a much weaker add-on ecosystem with no real HTTPS solution that I’m aware of. Safari also has a weak extension ecosystem, though there is at least one HTTPS Everywhere-type option that I haven’t tried and has acknowledged limitations.

[4]^ Update 2016-05-27: PrivacyFix has been discontinued, sadly. Here’s a post discussing what your options are.

At the very least, installing HTTPS Everywhere on Firefox and Chrome still helps users who employ those browsers, without affecting users who prefer the others.

Advice on Being a Solo Library Technologist

I am an Emerging Technologies Librarian at a small library in the middle of a cornfield. There are three librarians on staff. The vast majority of our books fit on one floor of open stacks. Being so small can pose challenges to a technologist. When I’m banging my head trying to figure out what the heck “this” refers to in a particular JavaScript function, to whom do I turn? That’s but an example of a wide-ranging set of problems:

  • Lack of colleagues with similar skill sets. This has wide-ranging ill effects, from giving me no one to ask questions to or bounce ideas off of, to making it more difficult to sell my ideas.
  • Broad responsibilities that limit time spent on technology
  • Difficulty creating endurable projects that can be easily maintained
  • Difficulty determining which projects are appropriate to our scale

Though listservs and online sources alleviate some of these concerns, there’s a certain knack to be a library technologist at a small institution.[1] While I still have a lot to learn, I want to share some strategies that have helped me thus far.

Know Thy Allies

At my current position, it took me a long time to figure out how the college was structured. Who is responsible for managing the library’s public computers? Who develops the website? If I want some assessment data, where do I go? Knowing the responsibilities of your coworkers is vital and effective collaboration is a necessary element of being a technologist. I’ve been very fortunate to work with coworkers who are immensely helpful.

IT Support can help with both your personal workstation and the library’s setup. Remember that IT’s priorities are necessarily adverse to yours: they want to keep everything up and running, you want to experiment and kick the tires. When IT denies a request or takes ages to fix something that seems trivial to you, remember that they’re just as overburdened as you are. Their assistance in installing and troubleshooting software is invaluable. This is a two-way street: you often have valuable insight into how users behave and what setups are most beneficial. Try to give and take, asking for favors at the same time that you volunteer your services.

Institutional Research probably goes by a dozen different names at any given dozen institutions. These names may include “Assessment Office,” “Institutional Computing,” or even the fearsome “Institutional Review Board” of research universities. These are your data collection and management people and—whether you know it or not—they have some great stuff for you. It took me far too long to browse the IR folder on our shared drive which contains insightful survey data from the CCSSE and in-house reports. There’s a post-graduate survey which essentially says “the library here is awesome,” good to have when arguing for funding. But they also help the library work with the assessment data that our college gathers; we hope to identify struggling courses and offer our assistance.

The web designer should be an obvious contact point. Most technology is administered through the web these days—shocking, I know. The webmaster will not only be able to get you access to institutional servers but they may have learned valuable lessons from their own positions. They, too, struggle to complete a wide range of tasks. They have to negotiate many stakeholders who all want a slice of the vaunted homepage, often the subject of territorial battles. They may have a folder of good PR images or a style guide sitting around somewhere; at the very least, some O’Reilly books you want to borrow.

The Learning Management System administrator is similar to the webmaster. They probably have some coding skills and carry an immense, important burden. At my college, we have a slew of educational technologists who work in the “Faculty Development Center” and preside over the LMS. They’re not only technologically savvy, often introducing me to new tools or techniques, but they know how faculty structure their courses and have a handle on pedagogical theory. Their input can not only generate new ideas but help you ground your initiatives in a solid theoretical basis.

Finally, my list of allies is obviously biased towards academic libraries. But public librarians have similar resources available, they just go by different names. Your local government has many of these same positions: data management, web developer, technology guru. Find out who they are and reach out to them. Anyone can look for local hacker/makerspaces or meetups, which can be a great way not only to develop your skills but to meet people who may have brilliant ideas and insight.

Build Sustainably

Building projects that will last is my greatest struggle. It’s not so hard to produce an intricate, beautiful project if I pour months of work into it, but what happens the month after it’s “complete”? A shortage of ideas has never been my problem, it’s finding ones that are doable. Too often, I’ll get halfway into a project and realize there’s simply no way I can handle the upkeep on top of my usual responsibilities, which stubbornly do not diminish. I have to staff a reference desk, teach information literacy, and make purchases for our collection. Those are important responsibilities and they often provide a platform for experimentation, but they’re also stable obligations that cannot be shirked.

One of the best ways to determine if a project is feasible is to look around at what other libraries are doing. Is there an established project—for instance, a piece of open source software with a broad community base—which you can reuse? Or are other libraries devoting teams of librarians to similar tasks? If you’re seeing larger institutions struggle to perfect something, then maybe it’s best to wait until the technology is more mature. On the other hand, dipping your toe in the water can quickly give you a sense of how much time you’ll need to invest. Creating a prototype or bringing coworkers on board at early stages lets you see how much traction you have. If others are resistant or if your initial design is shown to have gaping flaws, perhaps another project is more worthy of your time. It’s an art but often saying no, dropping a difficult initiative, or recognizing that an experiment has failed is the right thing to do.

Documentation, Documentation, Documentation

One of the first items I accomplished on arrival at my current position was setting up a staff-side wiki on PBworks. While I’m still working on getting other staff members to contribute to it (approximately 90% of the edits are mine), it’s been an invaluable information-sharing resource. Part-time staff members in particular have noted how it’s nice to have one consistent place to look for updates and insider information.

How does this relate to technology? In the last couple years, my institution has added or redesigned dozens of major services. I was going to write a ludicrously long list but…just trust me, we’ve changed a lot of stuff. A new technology or service cannot succeed without buy-in, and you don’t get buy-in if no one knows how to use it. You need documentation: well-written, illustrative documentation. I try to keep things short and sweet, providing screencasts and annotated images to highlight important nuances. Beyond helping others, it’s been invaluable to me as well. Remember when I said I wasn’t so great at building sustainably? Well, I’ll admit that there are some workflows or code snippets that are Greek each time I revisit them. Without my own instructions or blocks of comments, I would have to reverse engineer the whole process before I could complete it again.

Furthermore, not all my fellow staff are on par with my technical skills. I’m comfortable logging into servers, running Drush commands, analyzing the statistics I collect. And that’s not an indictment of my coworkers; they shouldn’t need to do any of this stuff. But some of my projects are reliant on arcane data schemas or esoteric commands. If I were to win the lottery and promptly retire, sophisticated projects lacking documentation would grind to a halt. Instead, I try to write instructions such that anyone could login to Drupal and apply module updates, for instance, even if they were previously unfamiliar with the CMS. I feel a lot better knowing that my bus factor is a little lower and that I can perhaps even take a vacation without checking email, some day.

Choose Wisely

The honest truth is that smaller institutions cannot afford to invest in every new and shiny object that crosses their path. I see numerous awesome innovations at other libraries which simply are not wise investments for a college of our size. We don’t have the scale, skills, and budget for much of the technology out there. Even open source solutions are a challenge because they require skill to configure and maintain. Everything I wrote about sustainability and allies is trying to mitigate this lack of scale, but the truth is some things are just not right for us. It isn’t helpful to build projects that only you can continue, or develop ones which require so much attention that other fundamental responsibilities (doubtless less sexy—no less important) fall through the cracks.

I record my personal activities in Remember the Milk, tagging tasks according to topic. What do you think was the tag I used most last year? Makerspace? Linked data? APIs? Node.js? Nope, it was infolit. That is hardly an “emerging” field but it’s a vital aspect of my position nonetheless.

I find that the best way to select amongst initiatives is to work backwards: what is crucial to your library? What are the major challenges, obvious issues that you’re facing? While I would not abandon pet projects entirely, because sometimes they can have surprisingly wide-ranging effects, it helps to ground your priorities properly.[2] Working on a major issue virtually guarantees that your work will attract more support from your institution. You may find more allies willing to help, or at least coworkers who are sympathetic when you plead with them to cover a reference shift or swap an instruction session because you’re overwhelmed. The big issues themselves are easy to find: user experience, ebooks, discovery, digital preservation, {{insert library school course title here}}. At my college, developmental education and information literacy are huge. It’s not hard to align my priorities with the institution’s.

Enjoy Yourself

No doubt working on your own or with relatively little support is challenging and stressful. It can be disappointing to pass up new technologies because they’re too tough to implement, or when a project fails due to one of the bullet points listed above. But being a technologist should always be fun and bring feelings of accomplishment. Try to inject a little levity and experimentation into the places where it’s least expected; who knows, maybe you’ll strike a chord.

There are also at least a couple advantages to being at a smaller institution. For one, you often have greater freedom and less bureaucracy. What a single individual does on your campus may be done by a committee (or even—the horror—multiple committees) elsewhere. As such, building consensus or acquiring approval can be a much simplified process. A few informal conversations can substitute for mountains of policies, forms, meetings, and regulations.

Secondly, workers at smaller places are more likely to be jack-of-all trades librarians. While I’m a technologist, I wear plenty of more traditional librarian hats as well. On the one hand, that certainly means I have less time to devote to each responsibility than a specialist would; on the other, it gives me a uniquely holistic view of the library’s operations. I not only understand how the pieces fit together, but am better able to identify high-level problems affecting multiple areas of service.

I’m still working through a lot of these issues, on my own. How do you survive as a library technologist? Is it just as tough being a large institution? I’m all eyes.

Footnotes

[1]^ Here are a few of my favorite sources for being a technology librarian:

  • Listservs, particularly Code4Lib and Drupal4Lib. Drupal4Lib is a great place to be if you’re using Drupal and are running into issues, there are a lot of “why won’t this work” and “how do you do X at your library” threads and several helpful experts who hang around the list.
  • For professional journals, once again Code4Lib is very helpful. ITAL is also open access and periodically good tech tips appear in C&RL News or C&RL. Part of being at a small institution is being limited to open access journals; these are the ones I read most often.
  • Google. Google is great. For answering factual questions or figuring out what the most common tool is for a particular task, a quick search can almost always turn up the answer. I’d be remiss if I didn’t mention that Google usually leads me to one of a couple excellent sources, like Stack Overflow or the Mozilla Developer Network.
  • Twitter. Twitter is great, too. I follow many innovative librarians but also leading figures in other fields.
  • GitHub. GitHub can help you find reusable code, but there’s also a librarian community and you can watch as they “star” projects and produce new repositories. I find GitHub useful as a set of instructive code; if I’m wondering how to accomplish a task, I can visit a repo that does something similar and learn from how better developers do it.

[2]^ We’ve covered managing side projects and work priorities previously in “From Cool to Useful: Incorporating hobby projects into library work.”

Coding & Collaboration on GitHub

Previously on Tech Connect we wrote about the Git version control system, walking you through “cloning” a project onto to your computer, making some small changes, and committing them to the project’s history. But that post concluded on a sad note: all we could do was work by ourselves, fiddling with Git on our own computer and gaining nothing from the software’s ability to manage multiple contributors. Well, here we will return to Git to specifically cover GitHub, one of the most popular code-sharing websites around.

Git vs. GitHub

Git is open source version control software. You don’t need to rely on any third-party service to use it and you can benefit from many of its features even if you’re working on your own.

GitHub, on the other hand, is a company that hosts Git repositories on their website. If you allow your code to be publicly viewable, then you can host your repository for free. If you want to have a private repository, then you have to pay for a subscription.

GitHub layers some unique features on top of Git. There’s an Issues queue where bug reports and feature requests can be tracked and assigned to contributors. Every project has a Graphs section where interesting information, such as number of lines added and deleted over time, is charted (see the graphs for jQuery, for instance). You can create gists which are mini-repositories, great for sharing or storing snippets of useful code. There’s even a Wiki feature where a project can publish editable documentation and examples. All of these nice features build upon, but ultimately have little to do with, Git.

Collaboration

GitHub is so successful because of how well it facilitates collaboration. Hosted version control repositories are nothing new; SourceForge has been doing this since 1999, almost a decade prior to GitHub’s founding in 2008. But something about GitHub has struck a chord and it’s taken off like wildfire. Depending on how you count, it’s the most popular collection of open source code, over SourceForge and Google Code.[1] The New York Times profiled co-founder Tom Preston-Werner. It’s inspired spin-offs, like Pixelapse which has been called “GitHub for Photoshop” and Docracy which TechCrunch called “GitHub for legal documents.” In fact, just like the phrase “It’s Facebook for {{insert obscure user group}}” became a common descriptor for up-and-coming social networks, “It’s GitHub for {{insert non-code document}}” has become commonplace. There are many inventive projects which use GitHub as more than just a collection of code (more on this later).

Perhaps GitHub’s popularity is due to Git’s own popularity, though similar sites host Git repositories too.[2] Perhaps the GitHub website simply implements better features than its competitors. Whatever the reason, it’s certain that GitHub does a marvelous job of allowing multiple people to manage and work on a project.

Fork It, Bop It, Pull It

Let’s focus two nice features of GitHub—Forking and the Pull Request [3]—to see exactly why GitHub is so great for collaboration.

If you recall our prior post on Git, we cloned a public repository from GitHub and made some minor changes. Then, when reviewing the results of git log, we could see that our changes were present in the project’s history. That’s great, but how would we go about getting our changes back into the original project?

For the actual step-by-step process, see the LibCodeYear GitHub Project’s instructions. There are basically only two changes from our previous process, one at the very beginning and one at the end.

GItHub's Fork Button

First, start by forking the repository you want to work on. To do so, set up a GitHub account, sign in, visit the repository, and click the Fork button in the upper right. After a pretty sweet animation of a book being scanned, a new project (identical to the original in both name and files) will appear on your GitHub account. You can then clone this forked repository onto your local computer by running git clone on the command line and supplying the URL listed on GitHub.

Now you can do your editing. This part is the same as using Git without GitHub. As you change files and commit changes to the repository, the history of your cloned version and the one on your GitHub account diverge. By running git push you “push” your local changes up to GitHub’s remote server. Git will prompt you for your GitHub password, which can get annoying after a while so you may want to set up an SSH key on GitHub so that you don’t need to type it in each time. Once you’ve pushed, if you visit the repository on GitHub and click the “commits” tab right above the file browser, you can see that your local changes have been published to GitHub. However, they’re still not in the original repository, which is underneath someone else’s account. How do you add your changes to the original account?

GitHub's Pull Request Button

In your forked repository on GitHub, something is different: there’s a Pull Request button in the same upper right area where the Fork one is. Click that button to initiate a pull request. After you click it, you can choose which branches on your GitHub repository to push to the original GitHub repository, as well as write a note explaining your changes. When you submit the request, a message is sent to the project’s owners. Part of the beauty of GitHub is in how pull requests are implemented. When you send one, an issue is automatically opened in the receiving project’s Issues queue. Any GitHub account can comment on public pull requests, connecting them to open issues (e.g. “this fixes bug #43”) or calling upon other contributors to review the request. Then, when the request is approved, its changes are merged into the original repository.

diagram of forking & pulling on GitHub

“Pull Request” might seem like a strange term. “Push” is the name of the command that takes commits from your local computer and adds them to some remote server, such as your GitHub account. So shouldn’t it be called a “push request” since you’re essentially pushing from your GitHub account to another one? Think of it this way: you are requesting that your changes be pulled (e.g. the git pull command) into the original project. Honestly, “push request” might be just as descriptive, but for whatever reason GitHub went with “pull request.”

GitHub Applications

While hopefully we’ve convinced you that the command line is a fine way to do things, GitHub also offers Mac and Windows applications. These apps are well-designed and turn the entire process of creating and publishing a Git repository into a point-and-click affair. For instance, here is the fork-edit-pull request workflow from earlier except done entirely through a GitHub app:

  • Visit the original repository’s page, click Fork
  • On your repository’s page, select “Clone in Mac” or “Clone in Windows” depending on which OS you’re using. The repository will be cloned onto your computer
  • Make your changes and then, when you’re ready to commit, open up the GitHub app, selecting the repository from the list of your local ones
  • Type in a commit message and press Commit
    writing a commit message in GitHub for Windows
  • To sync changes with GitHub, click Sync
  • Return to the repository on GitHub, where you can click the Pull Request button and continue from there

GitHub without the command line, amazing! You can even work with local Git repositories, using the app to do commits and view previous changes, without ever pushing to GitHub. This is particularly useful on Windows, where installing Git can have a few more hurdles. Since the GitHub for Windows app comes bundled with Git, a simple installation and login can get you up-and-running. The apps also make the process of pushing a local repository to GitHub incredibly easy, whereas there are a few steps otherwise. The apps’ visual display of “diffs” (differences in a file between versions, with added and deleted lines highlighted) and handy shortcuts to revert to particular commits can appeal even to those of us that love the command line.

viewing a diff in GitHub for Windows

More than Code

In my previous post on Git, I noted that version control has applications far beyond coding. GitHub hosts a number of inventive projects that demonstrate this.

  • The Code4Lib community hosts an Antiharassment Policy on GitHub. Those in support can simply fork the repository and add their name to a text file, while the policy’s entire revision history is present online as well
  • The city of Philadelphia experimented with using GitHub for procurements with successful results
  • ProfHacker just wrapped up a series on GitHub, ending by discussing what it would mean to “fork the academy” and combine scholarly publishing with forking and pull requests
  • The Jekyll static-site generator makes it possible to generate a blog on GitHub
  • The Homebrew package manager for Mac makes extensive use of Git to manage the various formulae for its software packages. For instance, if you want to roll back to a previous version of an installed package, you run brew versions $PACKAGE where $PACKAGE is the name of the package. That command prints a list of Git commits associated with older versions of the package, so you can enter the Homebrew repository and run a Git command like git checkout 0476235 /usr/local/Library/Formula/gettext.rb to get the installation formula for version 0.17 of the gettext package.

These wonderful examples aside, GitHub is not a magic panacea for coding, collaboration, or any of the problems facing libraries. GitHub can be an impediment to those who are intimidated or simply not sold on the value of learning what’s traditionally been a software development tool. On the Code4Lib listserv, it was noted that the small number of signatories on the Antiharassment Policy might actually be due to its being hosted on GitHub. I struggle to sell people on my campus of the value of Google Docs with its collaborative editing features. So, as much as I’d like the Strategic Plan the college is producing to be on GitHub where everyone could submit pull requests and comment on commits, it’s not necessarily the best platform. It is important, however, not to think of it as limited purely to versioning code written by professional developers. GitHub has uses for amateurs and non-coders alike.

Footnotes

[1]^ GitHub Has Passed SourceForge, (June 2, 2011), ReadWrite.

[2]^ Previously-mentioned SourceForge also supports Git, as does Bitbucket.

[3]^ I think this would make an excellent band name, by the way.

Learn to Love the Command Line

“Then the interface-makers went to work on their GUIs, and introduced a new semiotic layer between people and machines. People who use such systems have abdicated the responsibility, and surrendered the power, of sending bits directly to the chip that’s doing the arithmetic, and handed that responsibility and power over to the OS.”
—Neal Stephenson, In the Beginning was the Command Line

Many of us here at Tech Connect are fans of the command line. We’ve written posts on configuring a local server, renaming files en masse, & using the Git version control system that are full of command line incantations, some mysterious and some magical. No doubt; the command line is intimidating. I’m sure most computer users, when they see it, think “Didn’t we move beyond this already? Can’t someone just write an app for that?” Up until about a year ago, I felt that way, but I’ve come to love the command line like Big Brother. And I’m here to convince you that you should love the command line, too.

Scripting

So why use a command line when you can probably accomplish almost everything in a graphical user interface (GUI)? Consider the most repetitive, dreary task you do on a daily basis. It might be copying text back-and-forth between applications over and over. You might periodically back up several different folders by copying them to an external drive. We all have repetitive workflows in dire need of automation. Sure, you could write some macros, but macros can be brittle, breaking when the slightest change is introduced. They’re often tricky to write correctly, so tricky that the blinking square at the start of the command prompt is starting to look promising.

One of the first joys of the command line is that everything you do can be scripted. Any set of steps, no matter how lengthy and intricate, can be compiled into a script which completes the process in a matter of seconds. Copy a dozen folders in disparate locations to a backup drive? No problem. Optimize a web app for deployment? A script can minify your JavaScript and CSS, concatenate files, optimize images, test your code for bugs, and even push it out to a remote server.

Available Software

It may seem odd, but moving to the Neanderthal command line can actually increase the amount of software available to you. There are lots of programs that don’t come with a graphical interface, either because they’re lightweight tools that don’t require a GUI or because their author knew that they’d be used almost exclusively by people comfortable with the command line. There are also many software packages that, while they have GUIs available, are more powerful on the command line. Git is a good example: there are numerous graphical interfaces and most are quite good. But learning the commands opens up a wealth of options, configurations, and customizations that just do not exist graphically.

Tying It All Together

There are many wonderful GUIs that allow you to do complex things: generate visualizations, debug code, edit videos. But most applications are in their own separate silos and work you do in one app cannot be easily connected to any other. On the command line, output from one command can easily be “piped” into another, letting you create long chains of commands. For instance, let’s say I want to create a text document with all the filenames in a particular directory in it. With a file manager GUI, this is a royal pain: I can see a list of files but I can only copy their names one by one. If I click-and-drag to select all the names, the GUI thinks I want to select the files themselves and won’t let me paste the text of their names.

On the command line, I simply write the output of the ls command to a file: [1]

ls -a > filenames.txt

Now filenames.txt will list all the files and directories in the current folder. The > writes the output of ls -a (list all files) and it’s one of a few different methods that redirects output. Now what if I want only filenames that contain a number in them?

ls -a | grep [0-9] > filenames-with-numbers.txt

I already have a list of all the file names, so I “pipe” the output of that command using the vertical bar | to grep, which in turn outputs only lines that have a number zero through nine in them, finally writing the text to filenames-with-numbers.txt. This is a contrived example but it illustrates something incredibly powerful about the command line: the text output of any command can easily be used as input for another command. While it may look like a lot of punctuated gibberish, it’s actually pretty intuitive to work with. Anyone who has tried copying GUI-formatted text into an environment with different formatting should be envious.

REPLs

So you want to play around with a programming language. You’ve written a file named “hello-world.py” or “hello-world.rb” or “hello-world.php” but you have no idea how to get the code to actually run. Command line to the rescue! If you have the programming language installed and in your path, you can run python hello-world.py or ruby hello-world.rb or php hello-world.php and the output of your script will print right in your terminal. But wait, there’s more!

Many modern scripting languages come with a REPL built-in, which stands for “Read-Eval-Print loop.” In English, this means you write text that’s instantly evaluated as code in the programing language. For instance, below I enter the Python REPL and do some simple math:

$ python
Python 2.7.2
Type "help", "copyright", "credits" or "license" for more information.
>>> 1 + 2
3
>>> 54 * 34
1836
>>> 64 % 3
1
>>> exit()

The $ represents my command prompt, so I simply ran the command python which entered the REPL, printing out some basic information like the version number of the language I’m using. Inside the REPL, I did some addition, multiplication, and a modulus. The exit() function call at the end exited the REPL and put me back in my command prompt. There’s really no better way to try out a programming language than with a REPL. [2] The REPL commands for Ruby and PHP are irb and php -a respectively.

Stumbling Blocks

Everyone who starts out working on the command line will run into a bunch of common hangups. It’s frustrating, but it doesn’t need to be. It’s really like learning common user interface conventions: how would you know an anachronistic floppy disk means “save this file” if you hadn’t already clicked it a million times? Below is my attempt to alleviate some of the most common quirks of the command line.

Where Is It?

Before we get too far, it would probably be good to know how to get to a command prompt in the first place. You use a terminal emulator to open a command prompt in a window on your operating system. On a Mac, the terminal emulator is the aptly-named Terminal.app, which is located in Applications\Utilities. On Ubuntu Linux, there’s also a Terminal application which you can find using the Dash or with the handy Ctrl+Alt+T shortcut. [3] KDE Linux distributions use Konsole as the default command line application instead. Finally, Windows has a couple terminal emulators and, sadly, they are incompatible with *NIX shells like those found on Mac and Linux machines. Windows PowerShell can be found in Start > All Programs > Accessories or by searching after clicking the Start button. You can open PowerShell on Windows 8 by searching as well.

Escaping Spaces

Spaces are a meaningful character on the command line: they separate one command from the next. So when you leave spaces in filenames or text strings, commands tend to return errors. You need to escape spaces in one of two ways: precede a space with a backslash, “\”, which tells the command line to interpret the following character literally (i.e. not as a directive), or by wrapping the string with spaces in quotes. So mv File Name.txt New Name.txt will be greeted with the usage information for the mv command because the program assumes you don’t understand how to use it, while mv "Sad Cats.txt" "LOL Cats.txt" will successfully rename “Sad Cats.txt” to “LOL Cats.txt”.

Luckily, you won’t need to type quotes or slashes very often due to the miracle that is tab completion. If you start typing a command or file name, you can hit the tab key and your shell will try to fill in the rest. If there are two names that begin identically then you’ll have to provide enough to disambiguate between the two: if “This is my file.txt” and “This is my second file.txt” are both in the same folder, you’ll have to type at least “This is my ” and then, depending on whether your next letter is an F or an S, the shell can tab complete whatever follows to the appropriate name.

Copy & Paste

Command line interfaces were invented before common GUI applications and their keyboard shortcuts like Ctrl+V and Ctrl+C. Those shortcuts were already assigned different actions on the command line and so they typically do not do what you’d expect, since for backwards compatibility they need to stick to their original meanings. So does that mean you have to type every long URL or string of text into the terminal? No! You can still copy-paste but it works a bit different on most terminals. On Ubuntu Linux, Ctrl+Shift+C or V will perform copy and paste respectively, while I’ve found that right-clicking in Windows PowerShell will paste. Mac OS X’s Terminal.app can actually use the conventional ⌘+C or ⌘+V because they don’t conflict with Ctrl.

Moving Around the Line

Too many times I’ve typed out a long command, perhaps one with a URL or other pasted string of text in it, only to spot a typo dozens of characters back. On the command line, you cannot simply click to move your cursor to a different position, you have to press the back arrow dozens of times to go back and correct a mistake.

Or so I thought. On Mac’s Terminal.app, option-click will move the cursor. Unfortunately, most other terminal applications don’t provide any mouse navigation. Luckily, there’s a series of keyboard shortcuts that work in many popular shells:

  • Ctrl+A jumps to the beginning of the line
  • Ctrl+E jumps to the end of the line
  • Ctrl+U deletes to beginning of line
  • Ctrl+K deletes to end of line
  • Ctrl+W deletes the word next to cursor

Those shortcuts can save you a lot time and are often quicker than using the mouse. Typing Ctrl+W to delete a lengthy URL is much more convenient than shift-clicking with the mouse to select it all, then hitting the delete key.

Make It Stop!!!

One of the first tips for learning the command line is to “read the friendly manual.” You can run the man command (short for manual) and pass it any other command to learn more about its usage, e.g. man cp will give you the manual of the copy command. However, the manual is often presented in a special format; it doesn’t just print the manual, it gives you a scrollable interface that replaces the command prompt. But how the heck do you exit the manual to get back to the command line? I’ll admit, when I was first learning, I used to simply close the terminal and re-open because I had no idea how to get out of the all-too-friendly manual.

But that’s unnecessary: pressing the letter q will quit the manual, while battering away on the ESC key or typing “exit please dear lord let me exit” won’t do a thing. q exits a few other commands, too.

Another good trick is Ctrl+C which causes a keyboard interrupt, canceling whatever process is running. I use this all the time because I frequently run a test server from the command line, or a SASS command that watches for file changes, or do something really stupid that will execute forever. Ctrl+C is always the exit strategy.

The PATH

So you installed the “awesomesauce” software, but every time you run awesomesauce in your command prompt you get a message like “command not found.” What gives?

All shells come with a PATH variable that specifies where the shell looks for executable programs. You can’t simply type the name of a script or executable anywhere on your system and expect the shell to know about it. Instead, you have to tell the command line where to look. Let’s say I just installed awesomesauce to ~/bin/awesomesauce where ~ stands for my user’s home folder (on Mac OS X this will be /Users/MyUsername, for example—you can run echo ~ to see where your home folder is). Here’s how I get it working:

$ # command isn't in my path yet - this is a comment by the way
$ awesomesauce
bash: awesomesauce: command not found
$ export PATH=$PATH:~/bin
$ awesomesauce
You executed the Awesome Sauce command! Congratulations!

The export command lets me modify my PATH variable, appending the ~/bin directory to the list of other places in the PATH. Now that the shell is looking in ~/bin, it sees the awesomesauce command and lets me run it from the command line.

It quickly gets tedious appending directories to your PATH every time you want to use something, so it’s wise to use a startup script that runs every time you open a new Terminal. In BASH, the common shell that comes with Mac OS X and is the default in most Linux distributions, you can do this by adding the appropriate export commands to a file named .bash_profile in your home file. The commands in .bash_profile are executed every time a new Terminal window is opened.

Use the Google, my Friend

There are certainly other hangups not covered above, but a tremendous amount of good information exists on the web. People have been using command line interfaces for quite a while now and there is documentation for almost everything. There are great question-and-answer forums, like StackExchange’s Superuser section, with tons of content on the command line and writing shell scripts. I’ve basically self-taught myself the command line using Google and a few select sources. In fact, the man command alone is one of the shining advantages of the command line: image if you could just type text like man spot healing brush in PhotoShop to figure out how all those crazy tools work?

Learning the command line is a challenge but it’s well worth the investment. Once you have a few basics under your belt, it opens up vast possibilities and can greatly increase your productivity. Still, I wouldn’t say that everyone needs to know the command line. Not everyone needs to learn coding either, but everyone can benefit from it. With a little practice, some trial and error, and many, many Google searches, you’ll be well on your way to commandeering like a boss.

Further Reading

wiki.bash-hackers.org – a great wiki for learning more about the common BASH shell

ss64.com – command line reference

Using the Terminal – Ubuntu’s guide to the command line

In the Beginning was the Command Line – Neal Stephenson’s treatise about the command line, GUIs, Windows, Linux, & BeOS. It actually isn’t that much about the command line but it’s an interesting read.

Footnotes

[1]^ The commands listed in this post will not work on Windows, but that doesn’t mean you can’t do the same things: the names and syntax of the commands are just different. It’s worth noting that Windows users can use the Cygwin project to get a shell experience comparable to *NIX systems.

[2]^ If you’re still too scared to try the command line after this post, the great repl.it site provides REPLs for over a dozen different programming languages.

[3]^ The default terminal on Ubuntu, and many other Linuxes, is technically “GNOME Terminal” but it may show up as just Terminal in search results. While these are the default terminals, most systems have more advanced alternatives that offer a few niceties like arranging terminal windows in a grid, paste history, and configurable keyboard shortcuts. On Mac, iTerm2 is an excellent choice. Lifehacker recommends Terminator for Linux, and it is also available for Windows and Mac. Wikipedia has an unnecessarily long list of terminal emulators for the truly insane.

How to Git

We have written about version control before at Tech Connect, most notably John Fink’s excellent overview of modern version control. But getting started with VC (I have to abbreviate it because the phrase comes up entirely too much in this post) is intimidating. If you are generally afraid of anything that reminds you of the DOS Prompt, you’re not alone and you’re also totally capable of learning Git.

DOS prompt madness
By the end of this post, we will still not understand what’s going on here.

But why should you learn git?

Because Version Control Isn’t Just for Nerds

OK, never mind, it is, it totally is. But VC is for all kinds of nerds, not just l33t programmers lurking in windowless offices.

Are you into digital preservation and/or personal archiving? Then VC is your wildest dream. It records your changes in meaningful chunks, documenting not just the final product but all the steps it took you to get there. VC repositories show who did what, too. If you care about nerdy things like provenance, then you care about VC. If co-authors would always use VC for their writing, we’d know all the answers to the truly pressing questions, like whether Gilles Deleuze or Félix Guattari wrote the passage “A concept is a brick. It can be used to build a courthouse of reason. Or it can be thrown through the window.”

Are you a web developer? Then knowing Git can get you on GitHub, and GitHub is an immense warehouse of awesomeness. Sure, you can always just download .zip files of other people’s projects, but GitHub also provides more valuable opportunities: you can showcase your awesome tools, your brilliant tweaks to other people’s projects, and you can give back to the community at whatever level you’re comfortable with, from filing bug reports to submitting actual code fixes.

Are you an instruction librarian? Have you ever shared lesson plans, or edited other people’s lesson plans, or inherited poorly documented lesson plans? Basically, have you been an instruction librarian in the past century? Well, I have good news for you: Git can track any text file, so your lessons can easily be versioned and collaborated upon just like software programs are. Did you forget that fun intro activity you used two years ago? Look through your repository’s previous commits to find it. Want to maintain several similar but slightly different lesson plans for different professors teaching the same class? You’ve just described branching, something that Git happens to be great at. The folks over at ProfHacker have written a series of articles on using Git and GitHub for collaborative writing and syllabus design.

Are you a cataloger? Versioning bibliographic records makes a lot of sense. A presentation at last year’s Code4Lib conference talked not only about versioning metadata but data in general, concluding that the approach had both strengths and weaknesses. It’s been proposed that putting bibliographic records under VC solves some of the issues with multiple libraries creating and reusing them.

As an added bonus, having a record’s history can enable interesting analyses of how metadata changes over time. There are powerful tools that take a Git repository’s history and create animated visualizations; to see this in action, take a look at the visualization of Penn State’s ScholarSphere application. Files are represented as nodes in a network map while small orbs which represent individual developers fly around shooting lasers at them. If we want to be a small orb that shoots lasers at nodes, and we definitely do, we need to learn Git.

Alright, so now we know Git is great, but how do we learn it?

It’s As Easy As git rebase -i 97c9d7d

Actually, it’s a lot easier. The author doesn’t even know what git rebase does, and yet here he is lecturing to you about Git.

First off, we need to install Git like any other piece of software. Head over to the official Git website’s downloads page and grab the version for your operating system. The process is pretty straight-forward but if you get stuck, there’s also a nice “Getting Started – Installing Git” chapter of the excellent Pro Git book which is hosted on the official site.

Alright, now that you’ve got Git installed it’s time to start VCing the heck out of some text files. It’s worth noting that there are software packages that put a graphical interface on top of Git, such as Tower and GitHub’s apps for Windows and Mac. There’s a very comprehensive list of graphical Git software on the official Git website. But the most cross-platform and surefire way to understand Git and be able to access all of its features is with the command line so that’s what we’ll be using.

So enough rambling, let’s pop open a terminal (Mac and Linux both have apps simply called “Terminal” and Windows users can try the Git Bash terminal that comes with the Git installer) and make it happen.

$ git clone https://github.com/LibraryCodeYearIG/Codeyear-IG-Github-Project.git
Cloning into 'Codeyear-IG-Github-Project'...
remote: Counting objects: 115, done.
remote: Compressing objects: 100% (73/73), done.
remote: Total 115 (delta 49), reused 108 (delta 42)
Receiving objects: 100% (115/115), 34.38 KiB, done.
Resolving deltas: 100% (49/49), done.
$ cd Codeyear-IG-Github-Project/

 

The $ above is meant to indicate our command prompt, so anything beginning with a $ is something we’re typing. Here we “cloned” a project from a Git repository existing on the web (line 1), which caused Git to give us a little information in return. All Git commands begin with git and most provide useful info about their usage or results. In line 2, we’ve moved inside the project’s folder with a “change directory” command.

We now have a Git repository on our computer, if you peek inside the folder you’ll see some text (specifically Markdown) files and an image or two. But what’s more: we have the project’s entire history too, pretty much every state that any file has been in since the beginning of time.

OK, since the beginning of the project, but still, is that not awesome? Oh, you’re not convinced? Let’s look at the project’s history.

$ git log
commit b006c1afb9acf78b90452b284a111aed4daee4ca
Author: Eric Phetteplace <phette23@gmail.com>
Date:   Fri Mar 1 15:27:47 2013 -0500

    a couple more links, write Getting Setup section

commit 83d92e4a1be0fdca571012cb39f84d86b21121c6
Author: Eric Phetteplace <phette23@gmail.com>
Date:   Fri Feb 22 01:04:24 2013 -0500

    link up the YouTube video

 

We can hit Q to exit the log. In the log, we see the author, date, and a brief description of each change. The terrifying random gibberish which follows the word “commit” is a hash, which is computer science speak for terrifying random gibberish. Think of it as a unique ID for each change in the project’s history.

OK, so we can see previous changes (“commits” in VC-speak, which is like Newspeak but less user friendly), we can even revert back to previous states, but we won’t do that for now. Instead, let’s add a new change to the project’s history. First, we open up the “List of People.mdown” file in the Getting Started folder and add our name to the list. Now the magic sauce.

$ git status
# On branch master
# Changes not staged for commit:
#   (use "git add <file>..." to update what will be committed)
#   (use "git checkout -- <file>..." to discard changes in working directory)
#
#   modified:   Getting Started/List of People.mdown
#
no changes added to commit (use "git add" and/or "git commit -a")
$ git add "Getting Started/List of People.mdown"
$ git status
# On branch master
# Changes to be committed:
#   (use "git reset HEAD <file>..." to unstage)
#
#   modified:   Getting Started/List of People.mdown
#
$ git commit -m "adding my name"
$ git status
# On branch master
nothing to commit, working directory clean
$ git log
commit wTf1984doES8th1s3v3Nm34NWtf2666bAaAaAaAa
Author: Awesome Sauce <awesome@sau.ce>
Date:   Wed Mar 13 12:30:35 2013 -0500

    adding my name

commit b006c1afb9acf78b90452b284a111aed4daee4ca
Author: Eric Phetteplace <phette23@gmail.com>
Date:   Fri Mar 1 15:27:47 2013 -0500

    a couple more links, write Getting Setup section

 

Our change is in the project’s history! Isn’t it better than seeing your name on Hollywood Walk of Fame? Here’s precisely what we did:

First we asked for the status of the repository, which is an easy way of seeing what changes you’re working on and how far along they are to being added to the history. We’ll run status throughout this procedure to watch how it changes. Then we added our changes; this tells Git “hey, these are a deliberate set of changes and we’re ready to put them in the project’s history.” It may seem like an unnecessary step but adding select sets of files can help you segment your changes into meaningful, isolated chunks that make sense when viewing the log later. Finally, we commit our change and add a short description inside quotes. This finalizes the change, which we can see in the log command’s results.

I’m Lonely, So Lonely

Playing around with Git on our local computer can be fun, but it sure gets lonely. Yeah, we can roll back to previous versions or use branches to keep similar but separate versions of our files, but really we’re missing the best part of VC: collaboration. VC as a class of software was specifically designed to help multiple programmers work on the same project. The power and brilliance of Git shines best when we can selectively “merge” changes from multiple people into one master project.

Fortunately, we will cover this in a future post. For now, we can visit the LITA/ALCTS Library Code Year‘s GitHub Project—it’s the very same Git project we cloned earlier, so we already have a copy on our computer!—to learn more about collaboration and GitHub. GitHub is a website where people can share and cooperate on Git repositories. It’s been described as “the Facebook of code” because of its popularity and slick user interface. If that doesn’t convince you that GitHub is worth checking out, the site also has a sweet mascot that’s a cross between an octopus and a cat (an octocat). And that’s really all you need to know.

Gangam Octocat
This is an Octocat. It is Awesome.

The Setup: What We Use at ACRL TechConnect

Inspired by “The Setup” a few of us at Tech Connect have decided to share some of our favorite tools and techniques with you. What software, equipment, or time/stress management tools do you love? Leave us a note in the comments.

Eric – Homebrew package manager for OS X

I love Macs. I love their hardware, their operating system, even some of their apps like Garage Band. But there are certain headaches that Mac OS X comes with. While OS X exposes its inner workings via UNIX command line, it doesn’t provide a package manager like the apt of many Linux distros to install and update software.
Enter Homebrew, a lifesaver that’s helped me to up my game on the command line without much ancillary pain. Homebrew helps you find (“brew search php“), install (“brew install phantomjs“), and update (“brew upgrade git“) software from a central repository. I currently have 36 packages installed, among them utilities that Apple neglected to include like wget, programming tools like Node.js, and brilliant timesavers like z, a bookmarking system for the command line. Installing a lot of these tools can be tougher than using them, requiring permissions tweaks and enigmatic incantations. Homebrew makes installation easy and checking thirty-six separate websites for available updates becomes unnecessary.
As a bonus, some Homebrew commands now produce unicode beer mugs.

Updated Homebrew from bad98b12 to 150b5f96.
==> Updated Formulae
autojump berkeley-db gtk+ imagemagick libxml2
==> Upgrading 1 outdated package, with result:
libxml2 2.9.0
==> Upgrading libxml2
==> Downloading ftp://xmlsoft.org/libxml2/libxml2-2.9.0.tar.gz
####################################### 100.0%
==> Patching
patching file threads.c
patching file xpath.c
==> ./configure --prefix=/usr/local/Cellar/libxml2/2.9.0 --without-python
==> make
==> make install
==> Caveats
This formula is keg-only: so it was not symlinked into /usr/local.
==> Summary
beer mug usr/local/Cellar/libxml2/2.9.0: 273 files, 11M, built in 94 seconds

[Note: simulation, not verbatim output]

Magic! And a shameless plug: Homebrew has a Kickstarter currently to help them with some automated tests, so if you use Homebrew consider a donation.

Margaret – Pomodoro Technique/using time wisely

Everyone works differently and has more effective times of day to complete certain types of work. Some people have to start writing first thing in the mornings, others can’t do much of anything that early. For me personally I find late afternoon the most effective time to work on code or technical work—but late afternoon is a time very prone to being distractible. So many funny things have been posted on the internet, and my RSS reader is all full up again. The Pomodoro technique (as well as similar systems) is a promise to yourself that if you just work hard on something for a relatively short amount of time that you will finish it, and then can have a guilt-free break.

Read the website for the full description of how to use this technique and some tools, but here’s the basic idea. You list the tasks you need to do, and then pick a task to work on for 25 minutes. Then you set a timer and start work. After the timer goes off, you get a 5 minute break to do whatever you want, and then after a few Pomodoros you take a longer break. The timer ideally should have a visual component so that you know how much time you have left and remember to stay on task. My personal favorite is focus booster. This is what mine looks like right now:

Pomodoro status bar

Note that the line changes color as I get closer to the end. It will become blue and count down my break when that starts. Another one I like a lot, especially when I am not at my own computer is e.ggtimer.com. This is a simple display, and you can bookmark http://e.ggtimer.com/pomodoro to get a Pomodoro started.

I can’t do Pomodoros all day—as a librarian, I need to be available to work with others at certain times—that’s not an interruption, that’s my job. Other times I really need to focus and can’t. This is the best technique to get started—and sometimes once I am started I get so focused on the project that I don’t even notice I am supposed to be on a break.

Jim – Tomcat Server with Jersey servlet: a customizable middleware/API system

The Tomcat/Jersey stack is the backbone of the library’s technology prototyping initiative. With this tool, our staff of research programmers and student programmers can take any webpage/database source and turn it into an API that could then feed into a mobile app, a data feed in a website, or a widget in some other emerging technology. While using and leveraging the Tomcat/Jersey stack does require some Java background, it can be learned in a couple weeks by anyone who has some scripting and server experience. The hardest thing to this whole pipeline is finding enough time to keep cranking out the library APIs — one that I got running over the winter holiday is a feed of group rooms that are available to be checked out/scheduled within the next hour at the library.

The data feed sends back a JSON array of available rooms, like this (abbreviated):

[{"roomName":"Collaboration Room 02 - Undergraduate Library",

"startTime":"10:00 AM",

"endTime":"11:00 AM",

"date":"1/27/2013"}, …
Bohyun – Get into the mood for concentration and focus

I am one of those people who are easily excited by new happenings around me. I am also one of those people who often would do anything but the thing that I must be doing. That is, I am prone to distraction and procrastination. My work often requires focus and concentration but I have an extremely hard time getting into the right mood.
there are no limits to what you can accomplish when you are supposed to be doing something else
The two tools that I found help me quite a bit are (a) Scribblet and (b) Rainy Mood. Scribblet (http://scribblet.org/) is a simple Javascript bookmarklet that lets you literally scribble on your web browser. If you tend to read more efficiently while annotating, this simple tool will help you a great deal with online reading. Rainy Mood (http://www.rainymood.com/) is a website that displays the window of any rainy day with even the sound of thunder sprinkled in. I tend to get much calmer on a rainy day which can do wonders for my writing and other projects that require a calm and focused state of mind. This tool instantly makes me have a rainy day regardless of the weather.
rainy mood websitescribblet website

Meghan – Evernote

Evernote is not a terribly technical tool, but it is one I love and constantly use.  It provides the ability for you to take notes, clip items from the web, attach files to notes, organize into notebooks, share notebooks (or keep them private) and search existing notes.  It is available to download for desktops but I use the web version primarily, along with the web clipper and the Android app on my phone.  Everything syncs together, so it is easy to locate notes from any location.  Here are three examples of how it fits into my daily life:

An enormous pile of classified bookmarks: I am currently trying to get up to speed on Drupal development as well as looking at examples of online image collections and brainstorming for my next TechConnect blog entry.  The web clipper allows me to save things into specific piles by using notebooks and then add tags for classification and easier searching.  For example, I can classify an issue description or resolution in the my web development reference notebook, but tag it with the name for our site which is affected by the issue. This is especially useful when I know I have to change tasks and am likely to navigate away from my tabs in the browser.  When I return to the task in a day or so, I can search for the helpful pages I saved.  Classifying in notebooks is also good to build a list of sources that I consult every time I do a certain task, like building a server.

Evernote library

Course and conference notes: Using the web or phone version, I can type notes during a lecture or conference session.  I can also attach a pdf of the slides from a presentation for reference later.  Frequently, I create a notebook for a particular conference that I can opt to share with others.

Conference notes in Evernote

Personal uses:  I am learning to cook, and this tool has been really useful.  Say I find a great recipe that I decide I want to (try and) make for dinner tonight.  Clip the recipe using the web clipper, save it to my recipes notebook and then pull it up on my phone while I’m cooking to follow along (which also explains all the flour on my phone).  In a few months if I want to use it again, I’ll probably have to search for it, because all I will remember is that it had chickpeas in it.  But, that’s all I have to remember.

recipe in Evernote
There are lots of other add-ins for this application, but I love and use the base service the most often.

Event Tracking with Google Analytics

Note: Google recently replacing their “ga.js” analytics script with the new “analytics.js” and changed the syntax of how event tracking works. While the code examples on this post won’t work with the new analytics script, you can read how to update your code to the new syntax and the same conceptual principles apply. — Eric (2016-04-18)


 

In a previous post by Kelly Sattler and Joel Richard, we explored using web analytics to measure a website’s success. That post provides a clear high-level picture of how to create an analytics strategy by evaluating our users, web content, and goals. This post will explore a single topic in-depth; how to set up event tracking in Google Analytics.

Why Do We Need Event Tracking?

Finding solid figures to demonstrate a library’s value and make strategic decisions is a topic of increasing importance. It can be tough to stitch together the right information from a hodgepodge of third-party services; we rely on our ILSs to report circulation totals, our databases to report usage like full-text downloads, and our web analytics software to show visitor totals. But are pageviews and bounce rates the only meaningful measure of website success? Luckily, Google Analytics provides a way to track arbitrary events which occur on web pages. Event tracking lets us define what is important. Do we want to monitor how many people hover over a carousel of book covers, but only in the first second after the page has loaded? How about how many people first hover over the carousel, then the search box, but end up clicking a link in the footer? As long as we can imagine it and JavaScript has an event for it, we can track it.

How It Works

Many people are probably familiar with Google Analytics as a snippet of JavaScript pasted into their web pages. But Analytics also exposes some of its inner workings to manipulation. We can use the _gaq.push method to execute a “_trackEvent” method which sends information about our event back to Analytics. The basic structure of a call to _trackEvent is:

_gaq.push( [ '_trackEvent', 'the category of the event', 'the action performed', 'an optional label for the event', 'an optional integer value that quantifies something about the event' ] );

Looking at the array parameter of _gaq.push is telling: we should have an idea of what our event categories, actions, labels, and quantitative details will be before we go crazy adding tracking code to all our web pages. Once events are recorded, they cannot be deleted from Analytics. Developing a firm plan helps us to avoid the danger of switching the definition of our fields after we start collecting data.

We can be a bit creative with these fields. “Action” and “label” are just Google’s way of describing them; in reality, we can set up anything we like, using category->action->label as a triple-tiered hierarchy or as three independent variables.

Example: A List of Databases

Almost every library has a web page listing third-party databases, be they subscription or open access. This is a prime opportunity for event tracking because of the numerous external links. Default metrics can be misleading on this type of page. Bounce rate—the proportion of visitors who start on one of our pages and then immediately leave without viewing another page—is typically considered a negative metric; if a page has a high bounce rate, then visitors are not engaged with its content. But the purpose of a databases page is to get visitors to their research destinations as quickly as possible; bounce rate is a positive figure. Similarly, time spent on page is typically considered a positive sign of engagement, but on a databases page it’s more likely to indicate confusion or difficulty browsing. With event tracking, we can not only track which links were clicked but we can make it so database links don’t count towards bounce rate, giving us a more realistic picture of the page’s success.

One way of structuring “database” events is:

  • The top-level Category is “database”
  • The Action is the topical category, e.g. “Social Sciences”
  • The Label is the name of the database itself, e.g. “Academic Search Premier”

The final, quantitative piece could be the position of the database in the list or the number of seconds after page load it took the user to click its link. We could report some boolean value, such as whether the database is open access or highlighted in some way.

To implement this, we set up a JavaScript function which will be called every time one of our events occur. We will store some contextual information in variables, push that information to Google Analytics, and then delay the page’s navigation so the event has a chance to be recorded. Let’s walk through the code piece by piece:

function databaseTracking  ( event ) {
    var destination = $( this )[ 0 ].href,
        resource = $( this ).text(),
        // move up from <a> to parent element, then find the nearest preceding <h2> section header
        section = $( this ).parent().prevAll( 'h2' )[ 0 ].innerText,
        highlighted = $( this ).hasClass( 'highlighted' ) ? 1 : 0;

_gaq.push( [ '_trackEvent', 'database', resource, section, highlighted ] );

The top of our function just grabs information from the page. We’re using jQuery to make our lives easier, so all the $( this ) pieces of our code refer to the element that initiated the event. In our case, that’s the link pointing to an external database which the user just clicked. So we set destination to the link’s href attribute, resource to its text (e.g. the database’s name), section to the text inside the h2 element that labels a topical set of databases, and highlighted is a boolean value equal to 1 if the element has a class of “highlighted.” Next, this data is pushed into the _gaq array which is a queue of functions and their parameters that Analytics fires asynchronously. In this instance, we’re telling Analytics to run the _trackEvent function with the parameters that follow. Analytics will then record an event of type “database” with an action of [database name], a label of [section header], and a boolean representing whether it was highlighted or not.

setTimeout( function () {
    window.location = destination;
}, 200 );
event.preventDefault();
}

Next comes perhaps the least obvious piece: we prevent the default browser behavior from occurring, which in the case of a link is navigating away from our page, but then send the user to destination 200 milliseconds later anyways. The _trackEvent function now has a chance to fire; if we let the user follow the link right away it might not complete and our event would not be recorded.1

$( document ).ready(
    // target all anchors in list of databases
    $( '#databases-list a' ).on( 'click', databaseTracking )
);

There’s one last step; merely defining the databaseTracking function won’t cause it to execute when we want it to. JavaScript uses event handlers to execute certain functions based on various user actions, such as mousing over or clicking an element. Here, we add click event handlers to all <a> elements in the list of databases. Now whenever a user clicks a link in the databases list (which has a container with id “databases-list”), databaseTracking will run and send data to Google Analytics.

There is a demo on JSFiddle which uses the code above with some sample HTML. Every time you click a link, a pop-up shows you what the _gaq.push array looks like.

Though we used jQuery in our example, any JavaScript library can be used with event tracking.2 The procedure is always the same: write a function that gathers data to send back to Google Analytics and then add that function as a handler to an appropriate event, such as click or mouseover, on an element.

For another example, complete with code samples, see the article “Discovering Digital Library User Behavior with Google Analytics” in Code4Lib Journal. In it, Kirk Hess of the University of Illinois Urbana-Champaign details how to use event tracking to see how often external links are clicked or files are downloaded. While these events are particularly meaningful to digital libraries, most libraries offer PDFs or other documents online.

Some Ideas

The true power of Event Tracking is that it does not have to be limited to the mere clicking of hyperlinks; any interaction which JavaScript knows about can be recorded and categorized. Google’s own Event Tracking Guide uses the example of a video player, recording when control buttons like play, pause, and fast forward are activated. Here are some more obvious use cases for event tracking:

  • Track video plays on particular pages; we may already know how many views a video gets, but how many come from particular embedded instances of the video?
  • Clicking to external content, such as a vendor’s database or another library’s study materials.
  • If there is a print or “download to PDF” button on our site, we can track each time it’s clicked. Unfortunately, only Internet Explorer and Firefox (versions >= 6.0) have an onbeforeprint event in JavaScript which could be used to detect when a user hits the browser’s native print command.
  • Web applications are particularly suited to event tracking. Many modern web apps have a single page architecture, so while the user is constantly clicking and interacting within the app they rarely generate typical interaction statistics like pageviews or exits.

 

Notes
  1. There is a discussion on the best way to delay outbound links enough to record them as events. A Google Analytics support page condones the setTimeout approach. For other methods, there are threads on StackOverflow and various blog posts around the web. Alternatively, we could use the onmousedown event which fires slightly earlier than onclick but also might record false positives due to click-and-drag scrolling.
  2. Below is an attempt at rewriting the jQuery tracking code in pure JavaScript. It will only work in modern browsers because of use of querySelectorAll, parentElement, and previousElementSibling. Versions of Internet Explorer prior to 9 also use a unique attachEvent syntax for event handlers. Yes, there’s a reason people use libraries to do anything the least bit sophisticated with JavaScript.
function databaseTracking  ( event ) {
        var destination = event.target.href,
            resource = event.target.innerHTML,
            section = "none",
            highlighted = event.target.className.match( /highlighted/ ) ? 1: 0;

        // getting a parent element's nearest <h2> sibling is non-trivial without a library
        var currentSibling = event.target.parentElement;
        while ( currentSibling !== null ) {
            if ( currentSibling.tagName !== "H2" ) {
                currentSibling = currentSibling.previousElementSibling;
            }
            else {
                section = currentSibling.innerHTML;
                currentSibling = null;
            }
        }

        _gaq.push( [ '_trackEvent', 'database', resource, section, highlighted ] );

        // delay navigation to ensure event is recorded
        setTimeout( function () {
            window.location = destination;
        }, 200 );
        event.preventDefault();
    }

document.addEventListener( 'DOMContentLoaded', function () {
        var dbLinks = document.querySelectorAll( '#databases-list a' ),
            len = dbLinks.length;
        for ( i = 0; i < len; i++ ) {
            dbLinks[ i ].addEventListener( 'click', databaseTracking, false );
        }
    }, false );
Association of College & Research Libraries. (n.d.). ACRL Value of Academic Libraries. Retrieved January 12, 2013, from http://www.acrl.ala.org/value/
Event Tracking – Web Tracking (ga.js) – Google Analytics — Google Developers. (n.d.). Retrieved January 12, 2013, from https://developers.google.com/analytics/devguides/collection/gajs/eventTrackerGuide
Hess, K. (2012). Discovering Digital Library User Behavior with Google Analytics. The Code4Lib Journal, (17). Retrieved from http://journal.code4lib.org/articles/6942
Marek, K. (2011). Using Web Analytics in the Library a Library Technology Report. Chicago, IL: ALA Editions. Retrieved from http://public.eblib.com/EBLPublic/PublicView.do?ptiID=820360
Sattler, K., & Richard, J. (2012, October 30). Learning Web Analytics from the LITA 2012 National Forum Pre-conference. ACRL TechConnect Blog. Blog. Retrieved January 18, 2013, from http://acrl.ala.org/techconnect/?p=2133
Tracking Code: Event Tracking – Google Analytics — Google Developers. (n.d.). Retrieved January 12, 2013, from https://developers.google.com/analytics/devguides/collection/gajs/methods/gaJSApiEventTracking
window.onbeforeprint – Document Object Model (DOM) | MDN. (n.d.). Mozilla Developer Network. Retrieved January 12, 2013, from https://developer.mozilla.org/en-US/docs/DOM/window.onbeforeprint