Fear No Longer Regular Expressions

Regex, it’s your friend

You may have heard the term, “regular expressions” before. If you have, you will know that it usually comes in a notation that is quite hard to make out like this:

(?=^[0-5\- ]+$)(?!.*0123)\d{3}-\d{3,4}-\d{4}

Despite its appearance, regular expressions (regex) is an extremely useful tool to clean up and/or manipulate textual data.  I will show you an example that is easy to understand. Don’t worry if you can’t make sense of the regex above. We can talk about the crazy metacharacters and other confusing regex notations later. But hopefully, this example will help you appreciate the power of regex and give you some ideas of how to make use of regex to make your everyday library work easier.

What regex can do for you – an example

I looked for the zipcodes for all cities in Miami-Dade County and found a very nice PDF online (http://www.miamifocusgroup.com/usertpl/1vg116-three-column/miami-dade-zip-codes-and-map.pdf). But when I copy and paste the text from the PDF file to my text editor (Sublime), the format immediately goes haywire. The county name, ‘Miami-Dade,’ which should be in one line becomes three lines, each of which lists Miami, -, Dade.

Ugh, right? I do not like this one bit.  So let’s fix this using regular expressions.

(**Click the images to bring up the larger version.)

Screen Shot 2013-07-24 at 1.43.19 PM

Screen Shot 2013-07-24 at 1.43.39 PM

Like many text editors, Sublime offers the find/replace with regex feature. This is a much powerful tool than the usual find/replace in MS Word, for example, because it allows you to match many complex cases that fit a pattern. Regular expressions lets you specify that pattern.

In this example, I capture the three lines each saying Miami,-,Dade with this regex:

Miami\n-\nDade.

When I enter this into the ‘Find What’ field, Sublime starts highlighting all the matches.  I am sure you already guessed that \n means a new line. Now let me enter Miami-Dade in the ‘Replace With’ field and hit ‘Replace All.’

Screen Shot 2013-07-24 at 2.11.43 PM

As you can see below, things are much better now. But I  want each set of three lines – Miami-Dade, zipcode, and city – to be one line and each element to be separated by comma and a space such as ‘Miami-Dade, 33010, Hialeah’. So let’s do some more magic with regex.

Screen Shot 2013-07-24 at 2.18.17 PM

How do I describe the pattern of three lines – Miami-Dade, zipcode, and city? When I look at the PDF, I notice that the zipcode is a 5 digit number and the city name consists of alphabet characters and space. I don’t see any hypen or comma in the city name in this particular case. And the first line is always ‘Miami-Dade.” So the following regular expression captures this pattern.

Miami-Dade\n\d{5}\n[A-Za-z ]+

Can you guess what this means? You already know that \n means a new line. \d{5} means a 5 digit number. So it will match 33013, 33149, 98765, or any number that consists of five digits.  [A-Za-z ] means any alphabet character either in upper or lower case or space (N.B. Note the space at the end right before ‘]’).

Anything that goes inside [ ] is one character. Just like \d is one digit. So I need to specify how many of the characters are to be matched. if I put {5}, as I did in \d{5}, it will only match a city name that has five characters like ‘Miami,’ The pattern should match any length of city name as long as it is not zero. The + sign does that. [A-Za-z ]+ means that any alphabet character either in upper or lower case or space should appear at least or more than once. (N.B. * and ? are also quantifiers like +. See the metacharacter table below to find out what they do.)

Now I hit the “Find” button, and we can see the pattern worked as I intended. Hurrah!

Screen Shot 2013-07-24 at 2.24.47 PM

Now, let’s make these three lines one line each. One great thing about regex is that you can refer back to matched items. This is really useful for text manipulation. But in order to use the backreference feature in regex, we have to group the items with parentheses. So let’s change our regex to something like this:

(Miami-Dade)\n\(d{5})\n([A-Za-z ]+)

This regex shows three groups separated by a new line (\n). You will see that Sublime still matches the same three line sets in the text file. But now that we have grouped the units we want – county name, zipcode, and city name – we can refer back to them in the ‘Replace With’ field. There were three units, and each unit can be referred by backslash and the order of appearance. So the county name is \1, zipcode is \2, and the city name is \3. Since we want them to be all in one line and separated by a comma and a space, the following expression will work for our purpose. (N.B. Usually you can have up to nine backreferences in total from \1 to\9. So if you want to backreference the later group, you can opt not to create a backreference from a group by using (?: ) instead of (). )

\1, \2, \3

Do a few Replaces and then if satisfied, hit ‘Replace All’.

Ta-da! It’s magic.

Screen Shot 2013-07-24 at 2.54.13 PM

Regex Metacharacters

Regex notations look a bit funky. But it’s worth learning them since they enable you to specify a general pattern that can match many different cases that you cannot catch without the regular expression.

We have already learned the four regex metacharacters: \n, \d, { }, (). Not surprisingly, there are many more beyond these. Below is a pretty extensive list of regex metacharacters, which I borrowed from the regex tutorial here: http://www.hscripts.com/tutorials/regular-expression/metacharacter-list.php . I also highly recommend this one-page Regex cheat sheet from MIT (http://web.mit.edu/hackl/www/lab/turkshop/slides/regex-cheatsheet.pdf).

Note that \w will match not only a alphabetical character but also an underscore and a number. For example, \w+ matches Little999Prince_1892. Also remember that a small number of regular expression notations can vary depending on what programming language you use such as Perl, JavaScript, PHP, Ruby, or Python.

Metacharacter Description
\ Specifies the next character as either a special character, a literal, a back reference, or an octal escape.
^ Matches the position at the beginning of the input string.
$ Matches the position at the end of the input string.
* Matches the preceding subexpression zero or more times.
+ Matches the preceding subexpression one or more times.
? Matches the preceding subexpression zero or one time.
{n} Matches exactly n times, where n is a non-negative integer.
{n,} Matches at least n times, n is a non-negative integer.
{n,m} Matches at least n and at most m times, where m and n are non-negative integers and n <= m.
. Matches any single character except “\n”.
[xyz] A character set. Matches any one of the enclosed characters.
x|y Matches either x or y.
[^xyz] A negative character set. Matches any character not enclosed.
[a-z] A range of characters. Matches any character in the specified range.
[^a-z] A negative range characters. Matches any character not in the specified range.
\b Matches a word boundary, that is, the position between a word and a space.
\B Matches a nonword boundary. ‘er\B’ matches the ‘er’ in “verb” but not the ‘er’ in “never”.
\d Matches a digit character.
\D Matches a non-digit character.
\f Matches a form-feed character.
\n Matches a newline character.
\r Matches a carriage return character.
\s Matches any whitespace character including space, tab, form-feed, etc.
\S Matches any non-whitespace character.
\t Matches a tab character.
\v Matches a vertical tab character.
\w Matches any word character including underscore.
\W Matches any non-word character.
\un Matches n, where n is a Unicode character expressed as four hexadecimal digits. For example, \u00A9 matches the copyright symbol
Matching modes

You also need to know about the Regex matching modes. In order to use these modes, you write your regex as shown above, and then at the end you add one or more of these modes. Note that in text editors, these options often appear as checkboxes and may apply without you doing anything by default.

For example, [d]\w+[g] will match only the three lower case words in ding DONG dang DING dong DANG. On the other hand, [d]\w+[g]/i will match all six words whether they are in the upper or the lower case.

Look-ahead and Look-behind

There are also the ‘look-ahead’ and the ‘look-behind’ features in regular expressions. These often cause confusion and are considered to be a tricky part of regex. So, let me show you a simple example of how it can be used.

Below are several lines of a person’s last name, first name, middle name, separated by his or her department name. You can see that this is a snippet from a csv file. The problem is that a value in one field – the department name- also includes a comma, which is supposed to appear only between different fields not inside a field. So the comma becomes an unreliable separator. One way to solve this issue is to convert this csv file into a tab limited file, that is, using a tab instead of a comma as a field separater. That means that I need to replace all commas with tabs ‘except those commas that appear inside a department field.’

How do I achieve that? Luckily, the commas inside the department field value are all followed by a space character whereas the separator commas in between different fields are not so. Using the negative look-ahead regex, I can successfully specify the pattern of a comma that is not followed by (?!) a space \s.

,(?!\s)

Below, you can see that this regex matches all commas except those that are followed by a space.

lookbehind

For another example, the positive look-ahead regex, Ham(?=burg), will match ‘Ham‘ in Hamburg when it is applied to the text:  Hamilton, Hamburg, Hamlet, Hammock.

Below are the complete look-ahead and look-behind notations both positive and negative.

  • (?=pattern)is a positive look-ahead assertion
  • (?!pattern)is a negative look-ahead assertion
  • (?<=pattern)is a positive look-behind assertion
  • (?<!pattern)is a negative look-behind assertion

Can you think of any example where you can successfully apply a look-behind regular expression? (No? Then check out this page for more examples: http://www.rexegg.com/regex-lookarounds.html)

Now that we have covered even the look-ahead and the look-behind, you should be ready to tackle the very first crazy-looking regex that I introduced in the beginning of this post.

(?=^[0-5\- ]+$)(?!.*0123)\d{3}-\d{3,4}-\d{4}

Tell me what this will match! Post in the comment below and be proud of yourself.

More tools and resources for practicing regular expressions

There are many tools and resources out there that can help you practice regular expressions. Text editors such as EditPad Pro (Windows), Sublime, TextWrangler (Mac OS), Vi, EMacs all provide regex support. Wikipedia (https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features) offers a useful comparison chart of many text editors you can refer to. RegexPal.com is a convenient online Javascript Regex tester. FireFox also has Regular Expressions add-on (https://addons.mozilla.org/en-US/firefox/addon/rext/).

For more tools and resources, check out “Regular Expressions: 30 Useful Tools and Resources” http://www.hongkiat.com/blog/regular-expression-tools-resources/.

Library problems you can solve with regex

The best way to learn regex is to start using it right away every time you run into a problem that can be solved faster with regex. What library problem can you solve with regular expressions? What problem did you solve with regular expressions? I use regex often to clean up or manipulate large data. Suppose you have 500 links and you need to add either EZproxy suffix or prefix to each. With regex, you can get this done in a matter of a minute.

To give you an idea, I will wrap up this post with some regex use cases several librarians generously shared with me. (Big thanks to the librarians who shared their regex use cases through Twitter! )

  • Some ebook vendors don’t alert you to new (or removed!) books in their collections but do have on their website a big A-Z list of all of their titles. For each such vendor, each month, I run a script that downloads that page’s HTML, and uses a regex to identify the lines that have ebook links in them. It uses another regex to extract the useful data from those lines, like URL and Title. I compare the resulting spreadsheet against last month’s (using a tool like diff or vimdiff) to discover what has changed, and modify the catalog accordingly. (@zemkat)
  • Sometimes when I am cross-walking data into a MARC record, I find fields that includes irregular spacing that may have been used for alignment in the old setting but just looks weird in the new one. I use a regex to convert instances of more than two spaces into just two spaces. (@zemkat)
  • Recently while loading e-resource records for government documents, we noted that many of them were items that we had duplicated in fiche: the ones with a call number of the pattern “H, S, or J, followed directly by four digits”. We are identifying those duplicates using a regex for that pattern, and making holdings for those at the same time. (@zemkat)
  • I used regex as part of crosswalking metadata schemas in XML. I changed scientific OME-XML into MODS + METS to include research images into the library catalog.  (@KristinBriney)
  • Parsing MARC just has to be one. Simple things like getting the GMD out of a 245 field require an expression like this: |h\[(.*)\]  MARCedit supports regex search and replace, which catalogers can use. (@phette23)
  • I used regex to adjust the printed label in Millennium depending on several factors, including the material type in the MARC record. (@brianslone )
  • I strip out non-Unicode characters from the transformed finding aids daily with  regex and  replace them with Unicode equivalents. (@bryjbrown)

 

Effectively Learning How To Code: Tips and Resources

Librarians’ strong interest in programming is not surprising considering that programming skills are crucial and often essential to making today’s library systems and services more user-friendly and efficient for use. Not only for system-customization, computer-programming skills can also make it possible to create and provide a completely new type of service that didn’t exist before. However, programming skills are not part of most LIS curricula, and librarians often experience difficulty in picking up programming skills.

In this post, I would like to share some effective strategies to obtain coding skills and cover common mistakes and obstacles that librarians make and encounter while trying to learn how to code in the library environment based upon the presentation that I gave at Charleston Conference last month, “Geek out: Adding Coding Skills to Your Professional Repertoire.” (slides: http://www.slideshare.net/bohyunkim/geek-out-adding-coding-skills-to-your-professional-repertoire). At the end of this post, you will also find a selection of learning and community resources.

How To Obtain Coding Skills, Effectively

1. Pick a language and concentrate on it.

There are a huge number of resources available on the Web for those who want to learn how to program. Often librarians start with some knowledge in markup languages such as HTML and CSS. These markup languages determine how a block of text are marked up and presented on the computer screen. On the other hand, programming languages involve programming logic and functions. An understanding of the basic programming concepts and logic can be obtained by learning any programming language. There are many options, and some popular choices are JavaScript, PHP, Python, Ruby, Perl, etc. But there are many more.  For example, if you are interested in automating tasks in Microsoft applications such as Excel, you may want to work with Visual Basic. If you are unsure about which language to pick, search for a few online tutorials for a few languages to see what their different syntaxes and examples are like. Even if you do not understand the content completely, this will help you to pick the language to learn first.

2. Write and run the code.

Once you choose a language to learn, there are many paths that you can follow. Taking classes at a local community college or through an online school may speed up the initial process of learning, but it could be time-consuming and costly. Following online tutorials and trying each example is a good alternative that many people take. You may also pick up a few books along the way to supplement the tutorials and use them for reference purposes.

If you decide on self-study, make sure that you actually write and run the code in the examples as you follow along the books and the tutorials. Most of the examples will appear simple and straightforward. But there is a big difference between reading through a code example and being actually able to write the code on your own and to run it successfully. If you read through programming tutorials and books without actually doing the hands-on examples on your own, you won’t get much benefit out of your investment. Programming is a hands-on skill as much as an intellectual understanding.

3. Continue to think about how coding can be applied to your library.

Also important is to continue to think about how your knowledge can be applied to your library systems and environment, which is often the source of the initial motivation for many librarians who decide to learn how to program. The best way to learn how to program is to program, and the more you program the better you will become at programming. So at every chance of building something with the new programming language that you are learning, no matter how small it is, build it and test out the code to see if it works the way you intended.

4. Get used to debugging.

While many who struggle with learning how to code cite lack of time as a reason, the real cause is likely to be failing to keep up the initial interest and persist in what you decided to learn. Learning how to code can be exciting, but it can also be a huge time-sink and the biggest source of frustration from time to time. Since the computer code is written for a machine to read, not for a human being, one typo or a missing semicolon can make the program non-functional. Finding out and correcting this type of error can be time-consuming and demoralizing. But learning how to debug is half of programming. So don’t be discouraged.

5. Find a community for social learning and support.

Having someone to talk to about coding problems while you are learning can be a great help. Sign up for listservs where coding librarians or library coders frequent, such as code4lib and web4lib to get feedback when you need. Research the cause of the problem that you encounter as much as possible on your own. When you still are unsure about how to go about tackling it, post your question to the sites such as Stack Overflow for suggestions and answers from more experienced programmers. It is also a good idea to organize a study group with like-minded people and get support for both coding-related and learning-related problems. You may also find local meet-ups available in your area using sites like MeetUp.com.

Don’t be intimidated by those who seem to know much more than you in those groups (as you know much more about libraries than they do and you have things to contribute as well), but be aware of the cultural differences between the developer community and the librarian community. Unlike the librarian community that is highly accommodating for new librarians and sometimes not-well-thought-out questions, the developer community that you get to interact with may appear much less accommodating, less friendly, and less patient. However, remember that reading many lines of code, understanding what they are supposed to do, and helping someone to solve a problem occurring in those lines can be time-consuming and difficult even to a professional programmer. So it is polite to do a thorough research on the Web and with some reference resources first before asking for others’ help. Also, always post back a working solution when your problem is solved and make sure to say thank you to people who helped you. This way, you are contributing back to the community.

6. Start working on a real-life problem ‘now.’ Don’t wait!

Librarians are often motivated to learn how to code in order to solve real-life problems they encounter at their workplace. Solving a real-life problem with programming is therefore the most effective way to learn and to keep up the interest in programming. One of the greatest mistake in learning programming is putting off writing one’s own code and waiting to work on a real-life problem for the reason that one doesn’t know yet enough to do so. While it is easy to think that once you learn a bit more, it would be easier to approach a problem, this is actually a counter-productive learning strategy as far as programming is concerned because often the only way to find out what to learn is by trying to solve a problem.

7. Build on what you learned.

Another mistake to avoid in learning how to program is failing to build on what one has learned. Having solved one set of problem doesn’t mean that you will remember that programming solution you created next time when you have to solve a similar problem. Repeating what one has succeeded at and expanding on that knowledge will lead to a stronger foundation for more advanced programming knowledge. Also instead of trying to learn more than one programming language (e.g. Python, PHP, Ruby, etc.) and/or a web framework (e.g. Django, cakePHP, Ruby On Rails, etc.) at the same time, first try to become reasonably good at one. This will make it much easier to pick up another language later in the future.

8. Code regularly and be persistent.

It is important to understand that learning how to program and becoming good at it will take time. Regular coding practice is the only way to get there. Solving a problem is a good way to learn, but doing so on a regular basis as often as possible is the best way to make what you learned stick and stay in your head.

While is it easy to say practice coding regularly and try to apply it as much as possible to the library environment, actually doing so is quite difficult. There are not many well-established communities for fledgling coders in libraries that provide needed guidance and support. And while you may want to work with library systems at your workplace right away, your lack of experience may prove problematic in gaining a necessary permission to tinker with them. Also as a full-time librarian, programming is likely to be thrown to the bottom of your to-do list.

Be aware of these obstacles and try to find a way to overcome them as you go. Set small goals and use them as milestones. Be persistent and don’t be discouraged by poor documentation, syntax errors, and failures. With consistent practice and continuous learning, programming can surely be learned.

Resources

A. Resources for learning

B. Communities

 

More APIs: writing your own code (2)

My last post “The simpest AJAX: writing your own code (1)” discussed a few Javascript and JQuery examples that make the use of the Flickr API. In this post, I try out APIs from providers other than Flickr. The examples will look plain to you since I didn’t add any CSS to dress them up. But remember that the focus here is not in presentation but in getting the data out and re-displaying it on your own. Once you get comfortable with this process, you can start thinking about a creative and useful way in which you can present and mash up the same data. We will go through 5 examples I created with three different APIs.  Before taking a look at the codes, check out the results below first.

 

I. Pinboard API

The first example is Pinboard. Many libraries moved their bookmarks in Del.icio.us to a different site when there was a rumor that Del.cio.us may be shut down by Yahoo. One of those sites were Pinboard. By getting your bookmark feeds from Pinboard and manipulating them, you can easily present a subset of your bookmark as part of your website.

(a) Display bookmarks in Pinboard using its API

The following page uses JQuery to access the JSONP feed of my public bookmarks in Pinboard. $.ajax() method is invoked on line 13. Line 15, jsonp:”cb”, gives the name to a callback function that will wrap the JSON feed data in it. Note line 18 where I print out data received into the console. This way, you can check if you are receiving JSONP feed in the console of Firebug. Line 19-22 uses $.each() function to access each element in the JSONP feed and the .append() method to add each bookmark’s title and url to the “pinboard” div. JQuery documentation has detailed explanation and examples for its functions and methods. So make sure to check it out if you have any questions about a JQuery function or method.

Pinboard API – example 1

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Pinboard: JSONP-AJAX Example</title>
<script type="text/javascript" src="http://jqueryjs.googlecode.com/files/jquery-1.3.2.min.js"></script>
</head>
<body>
<p> This page takes the JSON feed from <a href="http://pinboard.in/u:bohyunkim/">my public links in Pinboard</a> and re-presents them here. See <a href="http://feeds.pinboard.in/json/u:bohyunkim/">the Pinboard's JSON feed of my links</a>.</p>
<div id="pinboard"><h1>All my links in Pinboard</h1></div>
<script>
$(document).ready(function(){	
$.ajax({
  url:'http://feeds.pinboard.in/json/u:bohyunkim',
  jsonp:"cb",
  dataType:'jsonp',
  success: function(data) {
  	console.log(data); //dumps the data to the console to check if the callback is made successfully
    $.each(data, function(index, item){
      $('#pinboard').append('<div><a href="' + item.u + '">' + item.d
+ '</a></div>');
      }); //each
    } //success
  }); //ajax

});//ready
</script>
</body>
</html>

Here is the screenshot of the page. I opened up the Console window of the Firebug (since I dumped the received in line 18) and you can see the incoming data here. (Note. Click the images to see the large version.)

But it is much more convenient to see the organization and hierarchy of the JSONP feed in the Net panel of Firebug.

And each element of the JSONP feed can be drilled down for further details by clicking the object in each row.

(b) Display only a certain number of bookmarks

Now, let’s display only a certain number of bookmarks. In order to do this, one more line is needed. Line 9 checks the position of each element and breaks the loop when the 5th element is processed.

Pinboard API – example 2

$.ajax({
  url:'http://feeds.pinboard.in/json/u:bohyunkim',
  jsonp:"cb",
  dataType:'jsonp',
  success: function(data) {
    $.each(data, function(index, item){
      	$('#pinboard').append('<div><a href="' + item.u + '">' + item.d
+ '</a></div>');
    	if (index == 4) return false; //only display 5 items
      }); //each
    } //success
  }); //ajax

(c) Display bookmarks with a certain tag

Often libraries want to display bookmarks with a particular tag. Here I add a line using JQuery method $.inArray() to display only bookmarks tagged with ‘fiu.’ $.inArray()  method takes value and array as parameters and returns 0 if the value is found in the array otherwise -1. Line 7 checks if the tag array of a bookmark (item.t) does include ‘fiu,’ and only in such case displays the bookmark. As a result, only the bookmarks with the tag ‘fiu’ are shown in the page.

Pinboard API – example 3

$.ajax({
  url:'http://feeds.pinboard.in/json/u:bohyunkim',
  jsonp:"cb",
  dataType:'jsonp',
  success: function(data) {
    $.each(data, function(index, item){
    	if ($.inArray("fiu", item.t)!==-1) // if the tag is 'fiu'
      		$('#pinboard').append('<div><a href="' + item.u + '">' + item.d
+ '</a></div>');
	}); //each
    } //success
  }); //ajax
II. Reddit API

My second example uses Reddit API. Reddit is a site where people comment on news items of interest. Here I used $.getJSON() instead of $.ajax() in order to process the JSONP feed from the Science section of Reddit. In the case of Pinboard API, I could not find out a way to construct a link that includes a call back function in the url. Some of the parameters had to be specified such as jsonp:”cb”, dataType:’jsonp’. For this reason, I needed to use $.ajax() function. On the other hand, in Reddit, getting the JSONP feed url was straightforward: http://www.reddit.com/r/science/.json?jsonp=cb.

Line 19 adds a title of the page. Line 20-22 extracts the title and link to the news article that is being commented and displays it. Under the news item, the link to the comments for that article in Reddit is added as a bullet item. You can see that, in Line 17 and 18, I have used the console to check if I get the right data and targeting the element I want and then commented out later.

This is just an example, and for that reason, the result is a rather simplified version of the original Reddit page with less information. But as long as you are comfortable accessing and manipulating data at different levels of the JSONP feed sent from an API, you can slice and dice the data in a way that suits your purpose best. So in order to make a clever mash-up, not only the technical coding skills but also your creative ideas of what different sets of data and information to connect and present to offer something new that has not been available or apparent before.

Reddit API – example

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Reddit-Science: JSONP-AJAX Example</title>
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.8.0/jquery.min.js"></script>
</head>

<body>
<p> This page takes the JSONP feed from <a href="http://www.reddit.com/r/science/">Reddit's Science section</a> and presents the link to the original article and the comments in Reddit to the article as a pair. See <a href="http://www.reddit.com/r/science/.json?jsonp=?">JSONP feed</a> from Reddit.</p>

<div id="feed"> </div>
<script type="text/javascript">
	//run function to parse json response, grab title, link, and media values, and then place in html tags
$(document).ready(function(){		
	$.getJSON('http://www.reddit.com/r/science/.json?jsonp=?', function(rd){
		//console.log(rd);
		//console.log(rd.data.children[0].data.title);
		$('#feed').html('<h1>*Subredditum Scientiae*</h1>');
		$.each(rd.data.children, function(k, v){
      		$('#feed').append('<div><p>'+(k+1)+': <a href="' + v.data.url + '">' + v.data.title+'</a></p><ul><li style="font-variant:small-caps;font-size:small"><a href="http://www.reddit.com'+v.data.permalink+'">Comments from Reddit</a></li></ul></div>');
      }); //each
	}); //getJSON
});//ready	
</script>

</body>
</html>

The structure of a JSON feed can be confusing to make out particularly. So make sure to use the Firebug Net window to figure out the organization of the feed content and the property name for the value you want.

But what if the site from which you would like to get data doesn’t offer JSONP feed? Fortunately you can convert any RSS or XML feed into JSONP feed. Let’s take a look!

III. PubMed Feed with Yahoo Pipes API

Consider this PubMed search. This is simple search that looks for items in PubMed that has to do with Florida International University College of Medicine where I work. You may want to access the data feed of this search result, manipulate, and display in your library website. So far, we have performed a similar task with the Pinboard and the Reddit API using JQuery. But unfortunately PubMed does not offer any JSON feed. We only get RSS feed instead from PubMed.

This is OK, however. You can either manipulate the RSS feed directly or convert the RSS feed into JSON, which you are more familiar with now. Yahoo Pipes is a handy tool for just that purpose. You can do the following tasks with Yahoo Pipes:

  • combine many feeds into one, then sort, filter and translate it.
  • geocode your favorite feeds and browse the items on an interactive map.
  • power widgets/badges on your web site.
  • grab the output of any Pipes as RSS, JSON, KML, and other formats.

Furthermore, there may be a pipe that has been already created for exactly what you want to do by someone else. PubMed is a popular resource. As I expected, I found a pipe for PubMed search. I tested, copied the pipe, and changed the search term. Here is the screenshot of my Yahoo Pipe.

If you want to change the pipe, you can click “View Source” and make further changes. Here I just changed the search terms and saved the pipe.

After that, you want to get the results of the Pipe as JSON. If you hover over the “Get as JSON” link in the first screenshot above, you will get a link: http://pipes.yahoo.com/pipes/pipe.run?_id=e176c4da7ae8574bfa5c452f9bb0da92&_render=json&limit=100&term=”Florida International University” and “College of Medicine” But this returns JSON, not JSONP.

In order to get that JSON feed wrapped into a callback function, you need to add this bit, &_callback=myCallback, at the end of the url: http://pipes.yahoo.com/pipes/pipe.run?_id=e176c4da7ae8574bfa5c452f9bb0da92&_render=json&limit=10&term=%22Florida+International+University%22+and+%22College+of+Medicine%22&_callback=myCallback. Now the JSON feed appears wrapped in a callback function like this: myCallback( ). See the difference?

Line 25 enables you to bring in this JSONP feed and invokes the callback function named “myCallback.” Line 14-23 defines this callback function to process the received feed. Line 18-20 takes the JSON data received at the level of data.value. item, and prints out each item’s title (item.title) with a link (item.link). Here I am giving a number for each item by (index+1). If you don’t put +1, the index will begin from 0 instead of 1. Line 21 stops the process when the processed item reaches 5 in number.

Yahoo Pipes API/PubMed – example

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>PubMed and Yahoo Pipes: JSONP-AJAX Example</title>
<script type="text/javascript" src="http://jqueryjs.googlecode.com/files/jquery-1.3.2.min.js"></script>
</head>
<body>
<p> This page takes the JSONP feed from <a href="http://pipes.yahoo.com/pipes/pipe.info?_id=e176c4da7ae8574bfa5c452f9bb0da92"> a Yahoo Pipe</a>, which creates JSONP feed out of a PubMed search results and re-presents them here. 
<br/>See <a href="http://pipes.yahoo.com/pipes/pipe.run?_id=e176c4da7ae8574bfa5c452f9bb0da92&_render=json&limit=10&term=%22Florida+International+University%22+and+%22College+of+Medicine%22&_callback=myCallback">the Yahoo Pipe's JSON feed</a> and <a href="http://www.ncbi.nlm.nih.gov/pubmed?term=%22Florida%20International%20University%22%20and%20%22College%20of%20Medicine%22">the original search results in PubMed</a>.</p>
<div id="feed"></div>
<script type="text/javascript">
	//run function to parse json response, grab title, link, and media values - place in html tags
	function myCallback(data) {
		//console.log(data);
		//console.log(data.count);
		$("#feed").html('<h1>The most recent 5 publications from <br/>Florida International University College of Medicine</h1><h2>Results from PubMed</h2>');
		$.each(data.value.items, function(index, item){
      		$('#feed').append('<p>'+(index+1)+': <a href="' + item.link + '">' + item.title
+ '</a></p>');
        if (index == 4) return false; //display the most recent five items
      }); //each
	} //function
	</script>
<script type="text/javascript" src="http://pipes.yahoo.com/pipes/pipe.run?_id=e176c4da7ae8574bfa5c452f9bb0da92&_render=json&limit=10&term=%22Florida+International+University%22+and+%22College+of+Medicine%22&_callback=myCallback"></script>
</body>
</html>

Do you feel more comfortable now with APIs? With a little bit of JQuery and JSON, I was able to make a good use of third-party APIs. Next time, I will try the Worldcat Search API, which is closer to the library world and see how that works.

 

The simplest AJAX: writing your own code (1)

It has been 8 months since the Code Year project started. Back in January, I have provided some tips. Now I want to check in to see if how well you have been following along. Falling behind? You are not alone. Luckily there are still 3-4 months left.

Teaching oneself how to code is not easy. One of the many challenges is keeping at it on a regular basis. Both at home and at work, there seems to be always a dozen things higher in priority than code lessons. Another problem is that often we start a learning project by reading a book with some chosen examples. The Code Year project is somewhat better since it provides interactive tutorials. But at the end of many tutorials, you may have experienced the nagging feeling of doubt about whether you can now go out to the real world and make something that works. Have you done any real-time project yet?

If you are like me, the biggest obstacle in starting your own first small coding project is not so much the lack of knowledge as the fantasy that you still have yet more to learn before trying any such real-life-ish project. I call this ‘fantasy’ because there is never such a time when you are in full possession of knowledge before jumping into a project. In most cases, you discover what you need to learn only after you start a project and run into a problem that you need to solve.

So for this blog post, I tried building something very small. During the process, I had to fight constantly with the feeling that I should go back to the Code Year Project and take those comforting lessons in Javascript and JQeury that I didn’t have time to work through yet. But I also knew that I would be so much more motivated to keep learning if I can just see myself making something on my own. I decided to try some very simple AJAX stuff and started by looking at two examples on the Web.  Here I will share those examples and my review process that enabled me to write my own bit of code. After looking at these, I was able to use different APIs to get the same result. My explanation below is intentionally detailed for beginners. But if you can understand the examples without my line-by-line explanation, feel free to skip and go directly to the last section where the challenge is.  For what would your AJAX skill be useful?  There are a lot of useful data in the cloud. Using AJAX, you can dynamically display your library’s photos stored in Flickr in your library’s website or generate a custom bibliography on the fly using the tags in Pinboard or MESH (Medical Subject Heading) and other filters in PubMed. You can mash up data feeds from multiple providers and create something completely new and interesting such as HealthMap, iSpiecies, and Housing Maps.

Warm-up 1: Jason’s Flickr API example

I found this example, “Flickr API – Display Photos (JSON)” quite useful. This example is at Jason Clark’s website. Jason has many cool code examples and working programs under the Code & Files page. You can see the source of the whole HTML page here . But let’s see the JS part below.

<script type="text/javascript">
//run function to parse json response, grab title, link, and media values - place in html tags
function jsonFlickrFeed(fr) {
    var container = document.getElementById("feed");
    var markup = '<h1>' + '<a href="' + fr.link+ '">' + fr.title + '</a>'+ '</h1>';
    for (var i = 0; i < fr.items.length; i++) {
    markup += '<a title="' + fr.items[i].title + '" href="' + fr.items[i].link + '"><img src="' + fr.items[i].media.m + '" alt="' + fr.items[i].title + '"></a>';
}
container.innerHTML = markup;
}
</script>
<script type="text/javascript" src="http://api.flickr.com/services/feeds/photos_public.gne?tags=cil2009&format=json">
</script>

After spending a few minutes looking at the source of the page, you can figure out the following:

  • Line 12 imports data formatted in JSON from Flickr, and the JSON data is wrapped in a JS function called jsonFlickrFeed. You can find these data source urls in API documentation usually. But many API documentations are often hard to decipher. In this case, this MashupGuide page by Raymond Yee was quite helpful.
  • Line 3-8 are defining the jsonFlickrFeed function that processes the JSON data.

You can think of JSON as a JS object or an associative array of them. Can you also figure out what is going on inside the jsonFlickrFeed function? Let’s go through it line by line.

  • Line 4 creates a variable, container, and sets it to the empty div given the id of the “feed.”
  • Line 5 creates another variable, markup, which will include a link and a title of “fr,” which is an arbitrary name that refers to the JSON data thrown inside the jsonFlickrFeed fucntion.
  • Line 6-8 are a for-loop that goes through every object in the items array and extracts its title and link as well as the image source link and title. The loop also adds the resulting HTML string to the markup variable.
  • Line 9 assigns the content of the markup variable as the value of the HTML content of the variable, container. Since the empty div with the “feed” id was assigned to the variable container, now the feed div has the content of var markup as its HTML content.

So these two JS snippets take an empty div like this:

<div id="feed"></div>

Then they dynamically generate the content inside with the source data from Flickr following some minimal presentation specified in the JS itself. Below is the dynamically generated content for the feed div. The result like this.

<div id="feed">
<h1>
<a href="http://www.flickr.com/photos/tags/cil2009/">Recent Uploads tagged cil2009</a>
</h1>
<a href="http://www.flickr.com/photos/matthew_francis/3458100856/" title="Waiting at Vienna metro (cropped)">
<img alt="Waiting at Vienna metro (cropped)" src="http://farm4.staticflickr.com/3608/3458100856_d01b26cf1b_m.jpg">
</a>
<a href="http://www.flickr.com/photos/libraryman/3448484629/" title="Laptop right before CIL2009 session">
<img alt="Laptop right before CIL2009 session" src="http://farm4.staticflickr.com/3389/3448484629_9874f4ab92_m.jpg">
</a>
<a href="http://www.flickr.com/photos/christajoy42/4814625142/" title="Computers in Libraries 2009">
<img alt="Computers in Libraries 2009" src="http://farm5.staticflickr.com/4082/4814625142_f9d9f90118_m.jpg">
</a>
<a href="http://www.flickr.com/photos/librarianinblack/3613111168/" title="David Lee King">
<img alt="David Lee King" src="http://farm4.staticflickr.com/3354/3613111168_02299f2b53_m.jpg">
</a>
<a href="http://www.flickr.com/photos/librarianinblack/3613111084/" title="Aaron Schmidt">
<img alt="Aaron Schmidt" src="http://farm4.staticflickr.com/3331/3613111084_b5ba9e70bd_m.jpg">
</a>
<a href="http://www.flickr.com/photos/librarianinblack/3612296027/" block"="" libraries"="" in="" computers="" title="The Kids on the ">
<img block"="" libraries"="" in="" computers="" alt="The Kids on the " src="http://farm3.staticflickr.com/2426/3612296027_6f4043077d_m.jpg">
</a>
<a href="http://www.flickr.com/photos/pegasuslibrarian/3460426841/" title="Dave and Greg look down at CarpetCon">
<img alt="Dave and Greg look down at CarpetCon" src="http://farm4.staticflickr.com/3576/3460426841_ef2e57ab49_m.jpg">
</a>
<a href="http://www.flickr.com/photos/pegasuslibrarian/3460425549/" title="Jason and Krista at CarpetCon">
<img alt="Jason and Krista at CarpetCon" src="http://farm4.staticflickr.com/3600/3460425549_55443c5ddb_m.jpg">
</a>
<a href="http://www.flickr.com/photos/pegasuslibrarian/3460422979/" title="Lunch with Dave, Laura, and Matt">
<img alt="Lunch with Dave, Laura, and Matt" src="http://farm4.staticflickr.com/3530/3460422979_96c020a440_m.jpg">
</a>
<a href="http://www.flickr.com/photos/jezmynne/3436564507/" title="IMG_0532">
<img alt="IMG_0532" src="http://farm4.staticflickr.com/3556/3436564507_551c7c5c0d_m.jpg">
</a>
<a href="http://www.flickr.com/photos/jezmynne/3436566975/" title="IMG_0529">
<img alt="IMG_0529" src="http://farm4.staticflickr.com/3328/3436566975_c8bfe9b081_m.jpg">
</a>
<a href="http://www.flickr.com/photos/jezmynne/3436556645/" title="IMG_0518">
<img alt="IMG_0518" src="http://farm4.staticflickr.com/3579/3436556645_9b01df7f93_m.jpg">
</a>
<a href="http://www.flickr.com/photos/jezmynne/3436569429/" title="IMG_0530">
<img alt="IMG_0530" src="http://farm4.staticflickr.com/3371/3436569429_92d0797719_m.jpg">
</a>
<a href="http://www.flickr.com/photos/jezmynne/3436558817/" title="IMG_0524">
<img alt="IMG_0524" src="http://farm4.staticflickr.com/3331/3436558817_3ff88a60be_m.jpg">
</a>
<a href="http://www.flickr.com/photos/jezmynne/3437361826/" title="IMG_0521">
<img alt="IMG_0521" src="http://farm4.staticflickr.com/3371/3437361826_29a38e0609_m.jpg">
</a>
<a href="http://www.flickr.com/photos/jezmynne/3437356988/" title="IMG_0516">
<img alt="IMG_0516" src="http://farm4.staticflickr.com/3298/3437356988_5aaa94452c_m.jpg">
</a>
<a href="http://www.flickr.com/photos/jezmynne/3437369906/" title="IMG_0528">
<img alt="IMG_0528" src="http://farm4.staticflickr.com/3315/3437369906_01015ce018_m.jpg">
</a>
<a href="http://www.flickr.com/photos/jezmynne/3436560613/" title="IMG_0526">
<img alt="IMG_0526" src="http://farm4.staticflickr.com/3579/3436560613_98775afc79_m.jpg">
</a>
<a href="http://www.flickr.com/photos/jezmynne/3437359398/" title="IMG_0517">
<img alt="IMG_0517" src="http://farm4.staticflickr.com/3131/3437359398_7e339cf161_m.jpg">
</a>
<a href="http://www.flickr.com/photos/jezmynne/3436535739/" title="IMG_0506">
<img alt="IMG_0506" src="http://farm4.staticflickr.com/3646/3436535739_c164062d6b_m.jpg">
</a>
</div>

Strictly speaking, Flickr is returning data in JSONP rather than JSON here. You will see what JSONP means in a little bit. But for now, don’t worry about that distinction. What is cool is that you can grab the data from a third party like Flickr and then you can remix and represent them in your own page.

Warm-up 2: Doing the same with JQuery using $.getJSON()

Since I had figured out how to display data from Flickr using Javascript (thanks to Jason’s code example), the next I wanted to try was to do the same with JQuery.  After some googling, I discovered that there is a convenient JQeury method called $.getJSON().  The official JQuery page on this $.getJSON() method includes not only the explanation about JSONP (which allows you to load the data from the domain other than yours in your browser and manipulate it unlike JSON which will be restricted by the same origin policy) but also the JQuery example of processing the same Flickr JSONP data. This is the example from the JQuery website.

$.getJSON("http://api.flickr.com/services/feeds/photos_public.gne?jsoncallback=?",
  {
    tags: "mount rainier",
    tagmode: "any",
    format: "json"
  },
  function(data) {
    $.each(data.items, function(i,item){
      $("<img/>").attr("src", item.media.m).appendTo("#images");
      if ( i == 3 ) return false;
    });
  });

As you can see in the first line, the data feed urls for JSONP response have a part similar to &jasoncallback=? at the end. The function name can vary and the API documentation of a data provider provides that bit of information. Let’s go through the codes line by line:

  • Line 1-6 requests and takes in the data feed from the speicified URL in JSONP format.
  • Once the data is received and ready, the script invokes the anonymous function from line 7-11. This function makes use of the JQuery method $.each().
  • For each of data.items, the anonymous function applies another anonymous function from line 9-10.
  • Line 9 creates an image tag – $(“<img/>”), attaches each item’s media.m element as the source attribute to the image tag – .attr(“src”, item.media.m), and lastly appends the resulting string to the empty div with the id of “images” – .appendTo(“#images”).
  • Line 10 makes sure that no more than 4 items in data.items is processed.

You can see the entire HTML page codes in the JQuery website’s $.getJSON() page.

Your Turn: Try out an API other than Flickr

So far we have looked through two examples.  Not too bad, right? To keep the post at a reasonable length, I will get to the little bit of code that I wrote in the next post.  This means that you can try the same and we can compare the result next time. Now here is the challenge. Both examples we saw used the Flickr API. Could you write code for a different API provider that does the same thing? Remember that you have to pick a data provider that offers feeds in JSONP if you want to avoid dealing with the same origin policy.

Here are a few providers you might want to check out. They all offer their data feeds in JSONP.

First, find out what data URLs you can use to get JSONP responses. Then write several lines of codes in JS and JQuery to process and display the data in the way you like in your own webpage. You may end up with some googling and research while you are at it.

Here are a few tips that will help you along the way:

  • Verify the data feed URL to see if you are getting the right JSONP responses. Just type the source url into the browser window and see if you get something like this.
  • Get the Firebug for debugging if you don’t already have it.
  • Use the Firebug’s NET panel to see if you are receiving the data OK.
  • Use the Console panel for debugging. The part of data that you want to pick up may be in several levels deep. So it is useful to know if you are getting the right item first before trying to manipulate it.

Happy Coding! See the following screenshots for the Firebug NET panel and Console panel. (Click the images to see the bigger and clearer version.) Don’t forget to share your completed project in the comments section as well as any questions, comments, advice, suggestions!

Net panel in Firebug

 

Console panel in Firebug

Library Code Year IG Meeting at ALA Annual Conference 2012

The LITA/ALCTS Library Code Year Interest Group was born from the wide spread interest in computer programming among librarians which coincided with Codecademy‘s Code Year program. The Library Code Year IG is active on both ALA Connect and on the Catcode wiki and held its inaugural meeting at ALA Annual last month.

The meeting started with introductions, which gave the membership an opportunity to share our goals for the group while also learning about common problems and frustrations that people have encountered while learning to code. The group came together over shared concerns and frustrations ranging from getting stuck on problems that can’t be solved alone to finding the lessons too dry when there is no real life application. Members also discussed the sense of frustration that comes from knowing that you need to know more about computer programming to keep up-to-date and simultaneously feeling guilty about time spent on computer programming lessons that aren’t directly related to a specific job duty.

Participants discussed techniques that they found helpful in teaching themselves to program, including:

  • reviewing lesson walkthroughs or keys (though some avoid this because it feels like “cheating”),
  • working through the problem with another student/mentor,
  • setting aside an allotted time daily or weekly to practice coding skills,
  • saving up multiple lessons or projects to work through in a single day of non-stop coding, and
  • finding code online that you can learn from and adapt for your own purposes.

These suggestions highlighted the importance of learning style and schedule flexibility when it comes to successfully teaching oneself to program. Just as importantly, the conversation showed that for most participants, committing to a long-term practice of regularly using these skills was key to success. This discussion provided an excellent foundation for the topics covered in the rest of the meeting.

The second portion of the meeting was devoted to lightning talks. Eric Phetteplace offered an introduction to bookmarklets. These relatively simple programs can be created with just a small amount of Javascript and can allow users to exert a powerful effect on websites through their browser. Bookmarklets run the gamut from the fun, such as Kick Ass, a bookmarklet which allows you to play Asteroids on any website, to Instapaper, which allows you to save and reformat webpages for future reading. Eric discussed some of the possible uses for libraries, including data harvesting or adding proxy server information to all links on a page. Any data on the web page can be accessed and changed with a simple script.

For those inspired to get started writing their own bookmarklets, Eric also provided concrete information on how to get started. He advocates using a template found online, echoing the meeting’s recurring theme that coders, particularly beginners, shouldn’t feel the need to reinvent the wheel for every project. Instead, finding templates online that can be adapted for your purposes is often a much more efficient way to start a project and a great way to learn from the work that other coders have already done. Eric also discussed tricks and tips for bookmarklets, such as having the bookmarklet point to code hosted elsewhere for easy updates, the importance of not making assumptions about the types of websites on which the bookmarklet will be used and the difficulty (to the point of virtual impossibility) of using bookmarklets on mobile browsers.

I gave the second lightning talk, which covered resources that can be used for learning or teaching programming. As was evident from our introductions, members of the group have a wide range of different interests and approaches to learning. While Code Year has worked for some people who want to learn more about Javascript, JQuery and web programming, my talk highlighted other tools that can be used to learn Python, Ruby, Java and other languages through tutorials, videos and exercises. I also discussed options for finding in-person programming classes locally for those who prefer to work with a group in person. Those interested in finding these alternative tools can refer to the handout I prepared for this talk or to my Pearltree on the topic.

The final, and arguably most important, agenda item for the meeting was discussing plans for the future. The group brainstormed and settled on focusing our efforts on a number of different types of how-to projects including:

  • A Python preconference event for beginners based on the curriculum developed by Boston Python Users Group,
  • A project based on OCLC’s APIs,
  • A Git and GitHub how-to session,
  • An IRC how-to session, and
  • A collection of resources to support those who want to host a Hackathon.

You can see the full list of volunteers for these projects on our ALA Connect space, but we are definitely looking for more helpers for these and other projects, so let us know if you want to help out! We also hope to maintain a list of members’ areas of expertise to facilitate helping each other out. If you want to coordinate this project, or if you would just like to be included on the list, add a comment on our ALA Connect space.

This first meeting is just the first step in what we hope will be a long history for this interest group. Even if you weren’t able to attend the meeting, we want you to be able to get as much as possible out of our activities. Be sure to stay in touch and please think about getting involved with us!

About Our Guest Author: Carli Spina is the Emerging Technologies and Research Librarian at the Harvard Law School Library. She has an MSLIS from Simmons College and a JD from the University of Chicago Law School and she is one of the co-chairs of the LITA/ALCTS Library Code Year Interest Group. Her interests include emerging technologies, innovation in libraries and coding. She can be found on Twitter @CarliSpina.

A Gentle Introduction to Modern Version Control

What if I told you there was something very much like Dropbox but had a difficult learning curve? Something that, rather than being easy to install and use, actually required a not-insignificant time investment to learn? And that rather than operating automatically, required you to execute a number of steps beyond hitting save in Word? Would you ask me why I’m telling you about this?

Writing software is hard business — don’t believe anyone who puts forward the front that they can birth fully-functioning code apropos of nothing. Programming is iterative; you write something, you test it, it breaks, you fix it, you write some more, it breaks something else. And it’s a common scenario to have a bug arise in your code that you wrote ages ago, sometimes years ago, and you need to go back to how your code looked at a specific point in time to fix it. Common also is adding features to already mature code; features that may break existing functionality, features that need some working out to actually function, and you want to keep them separate from what’s already working.

Enter version control. In programming, a version control system is a program that, at a very basic level, records the state of a project and has functionality to view past states, commonly called a repository and allows people to collaborate on that repository. Modern version control systems have the ability to incorporate changes to that state from multiple contributors, keep all revisions of a project indefinitely, have as many backups as the project as there are people with copies of the repository, and make it easy to collaborate with others. For simple projects it might be overkill, but when you’re dealing with million-line codebases with hundreds to thousands of contributors, a system that keeps track of who did what when makes administration of a project not easier, but possible. Version control’s been around a number of years, and there have been a number of implementations of it, but modern version control systems break down into basically two camps: decentralized (or distributed) version control and centralized version control (and, like most things in the software world, into proprietary and open source as well).

Dropbox can be thought of as a version of centralized control. You have your client machine, it has a Dropbox folder, and that folder hooks up to Dropbox HQ. Dropbox HQ will always try to have the most up to date version of your Dropbox folder; if you log into Dropbox’s web interface after moving a file into your local folder, you’ll see that file there. If you have a friend you want to share a file with via Dropbox, that person also connects to Dropbox HQ to read the files. If your friend is absolutely unable to connect to Dropbox, they’re not going to be able to read the files you put there as Dropbox is the central repository of your files. If you lose connectivity to Dropbox — let’s say if you’re working on something on the airplane — then your files aren’t backed up until you get back on the network.

A lot of version control systems work like Dropbox. Subversion is the most popular one and it, like Dropbox, requires that you have a central place to put files. Unlike Dropbox, there is no single Subversion company that runs a repository; you can put a Subversion server on your own machine. But in other respects it is largely similar — your friend again needs to be able to reach your Subversion server to be able to read your files. Having one workflow like this has a distinct advantage — one workflow means one way of doing things and cuts down on confusion — but the moment you have an unusual workflow you’ll find yourself working against the system rather than with it.

If you’re a serious Dropbox wonk, you’ve probably noticed that Dropbox keeps the last ten or so saves of each file — handy if you nuke something and need to go back to an earlier revision. But if you’re a compulsive saver, you’ll quickly run out of those last ten revisions. Any modern version control system will keep every single revision ever made to a project, so going back to any point in development is easy. You can see the revision history of this very post right here and take a look at a brief visualization of Jason Griffey’s librarybox project.

Decentralized version control systems are somewhat of a more recent development than centralized. A decentralized version control system, like git, Bazaar, or Mercurial, doesn’t need a central repository. A decentralized version control system can be run entirely locally, without sharing files at all, or it can be run in a manner similar to a centralized one. This workflow flexibility can be liberating, but it can also take some time to work out what workflow works well for a team. In our local development center, we have one developer and one systems administrator, so consequently our workflows tend to be quite simple. The developer has his own desktop machine at which he works, the application being developed runs live on a server in the basement, and another machine contains a central git repository. The developer writes and tests code on his machine and, when he is happy with the state of development, pushes that code to the central repository. He then tells the system admin to pull those changes to the live server where they are incorporated into the code. Because Git keeps all revisions of a project, it’s easy to roll back changes if they don’t work.

Another aspect of modern version control that is used most effectively by the decentralized version control systems is branching. Branching allows you to take your project, save the state that it’s at, and start a new version of the project that may go in a radically different direction; say to test out new features that may break things. Judicious use of branching can result in always having a version of a project that is ready for release while still maintaining flexibility to make new things. Mature software projects will often have multiple branches for various feature sets. When those feature sets can be integrated into the main branch of the software, they can usually be moved painlessly or, if necessary, thrown away without affecting any working code.

Software version control comes with a learning curve, but as libraries start becoming producers of code as well as consumers of it, it will be an important tool to know about.

About Our Guest Author: John Fink is Digital Scholarship Librarian at McMaster University in Hamilton, Ontario. He holds an MLS  from San Jose State University, and has spent most of his working life in library systems departments. His primary research interests are copyright, physical computing, and emerging technologies.