Linked Data in Libraries: Getting into the W3C Library Linked Data Incubator Group

What are libraries doing (or not doing) about linked data? This was the question that the W3C Library Linked Data Incubator Group investigated between May 2010 and August 2011. In this post, I will take a look at the  final report of the W3C Library Linked Data Incubator Group (October 2011) and provide an overview of their recommendations and my own analysis of the issues. Incubator Groups were a program that the W3C ran from 2006-2012 to get work done quickly on innovative ideas where there wasn’t enough to actually begin working on creating the web standards for which the W3C exists. (The Incubator Group program has transitioned into Community and Business Groups).

In this report, the participants in the group made several key recommendations aimed at library leaders, library standards bodies, data and systems designers, and librarian and archivists. The recommendations indicate just how far we are from really being able to implement open linked data in every library but also reveal the current landscape.
Library Leaders

An illustration of the VIAF authority file for Jane Austen

The report calls on library leaders to identify potentially very useful sets of data that can be exposed easily using current practices. That is, they should not try to revolutionize workflows, but to evolve towards more linked data. They mention authority files as an example of a data set that is ideal for this purpose, since authority files are lists of real world people with attributes that connect to real things. Having some semantic context for authority files helps–we could imagine a scenario in which you are searching for a common name, but the system recognizes that you are searching for a twentieth century American author and so does not show you a sixteenth century British author. Catalogers don’t necessarily have to do anything differently, either, since these authority files can link to other data to make a whole
picture. VIAF (Virtual International Authority File) is a project between OCLC and several national libraries to create such a linked international authority file using linked data and enter into the semantic web.

Library leadership must face the issue of rights in an open data world. It is a trope that libraries hold much valuable cultural and bibliographic data. Yet in many cases we have purchased or leased this data from a vendor rather than creating it ourselves (certainly we do in the case of indexes and often with catalog records)–and the license terms may not allow for open sharing of the data. We must be aware that exposing linked data openly is probably not going to mesh well with the way we have done things traditionally. Harvard recently released 12 million bibliographic records under a CC0 (public domain) license. Many libraries might not be in the position to release their own bibliographic records if they did not create them originally. Of course the same goes for indexes or bibliographies, other categories of traditional library materials that seems ripe for linking semantically. Library leadership will have to address this before open linked data is truly possible.
Library Standards Bodies

The report calls on library standards bodies to attack the problem from both sides. First, librarians need to be involved with standardizing semantic web technologies in a way that meets their needs and ensures that the library world stays in line with the way the technology is moving generally. Second, creators of library data standards need to ensure that those standards are compatible with semantic web technologies. Library data, when encoded in MARC, combines meaning and the structure in one unit. This works well for people who are reading the data, but is not easy for computers to parse semantically.  For instance, consider:

245 10|aPride and prejudice /|cby Jane Austen.
which viewed in the browser or on the catalog card like:

Pride and prejudice /
by Jane Austen.

The 245 tells us that this is a main title, and then the 1 tells us there is an added entry, in this case for Jane Austen. The 0 tells us that the title doesn’t begin with an article, or “nonfiling character”. The |a gives the actual title, followed by a / character, and then the |c is the statement of responsibility, followed by a period. Note that there is semantic meaning mixed together with punctuation and words that are helpful for people, such as “by”, which follow the rules of AACR2. There are good reasons for these rules, but the rules were meant to serve the information needs of humans. Given the capabilities of computers to parse and present structured data meaningfully to humans, it seems vital to make library data understandable to computers and know that we can use it to make something more useful to people. You may have noticed that HTML has changed over the past few years in the same way that library data will have to change. If you, for instance, want to give emphasis to a word, you use the <em></em>  tags. People know the word is emphasized because it’s in italics, the computer knows it’s emphasized because you told it that it was. Indicating that a word should be italicized using the <i></i> tags looks the same to a human reader who can understand the context for the use of italics, but doesn’t tell the computer that the word is particularly important. HTML 5 has even more use of semantic tags to make more of the standard ways of presenting information on the web meaningful to computers.

Systems Designers
The recommendations for data and systems designers are to start building tools that use linked data. Without a “killer app”, it’s hard to get excited about semantic technologies. Just after my last post went up, Google released its “Knowledge Graph”. This search takes words that traditionally would be matched as words, and matches them with “things.” For instance, if I type the search string Lincoln Hall into Google. Google guesses that I probably mean a concert venue in Chicago with that name and shows me that as the first result. It also displays a map, transit directions, reviews, and an upcoming schedule on the sidebar–certainly very convenient if that’s what I was looking for. But below the results for the concert venue, I get a box stating “See results about Lincoln Hall, Climber.” When I click on this, my results change to information about the Australian climber who recently died, and the side bar changes to information about him. Now as a librarian, I know that there would have been many ways to improve my search. But because semantic web technologies allow Google’s algorithms to understand that despite having the same name, an entity of a concert venue and a mountaineer are very different. This neatly disposes of the need for sophisticated searching for facts about things.  Whether this is, indeed, revolutionary remains to be seen. But try it as a user. You might be pleasantly surprised by how it makes your search easier. It may be that web-scale discovery will do the same thing for libraries, but this is a tool that remains out of reach of many libraries.
Librarians and Archivists
Librarians and archivists have, as always, a duty to collect and preserve linked data sets. We know how valuable the earliest examples of any piece of data storage are–whether it’s a clay tablet, a book, or an index. We create bibliographies to see how knowledge changed over time or in different contexts. We need to be careful to  preserve important data sets currently being produced, and maintain them over time so they remain accessible for future needs. But there’s another danger inherent in not being scrupulous about data integrity. Maintaining accurate and diverse data sets will help keep future information factual and unbiased. When a fact is one step removed from its source, it becomes even more difficult to check it for accuracy. While outright falsehood or misstatement is possible to correct, it will also be important to present alternate perspectives to ensure that scholarship can progress. (For an example of the issues in only presenting the most mainstream understanding of history, see “The ‘Undue Weight’ of Truth on Wikipedia”). If linked data doesn’t help us find out anything novel, will there have been a point in linking it?
Conclusion 
If you  haven’t yet read it, the report is a quick read and clear to people without a technical background, so I encourage you to take a look at it, particularly with reference to the use cases and data sets already extant. I hope you will get excited about the possibilities, and even if you are not in a position to use linked data yet, be thinking about what the future could hold. As I mentioned in my last post, the LODLAM (International Linked Open Data in Libraries, Archives, and Museums Summit) blog and the Digital Library Federation sponsored LOD-LAM Zotero group have lots of resources. There is also an ALA Library Linked Data Interest Group which sponsors discussions and has a mailing list.

A Gentle Introduction to Modern Version Control

What if I told you there was something very much like Dropbox but had a difficult learning curve? Something that, rather than being easy to install and use, actually required a not-insignificant time investment to learn? And that rather than operating automatically, required you to execute a number of steps beyond hitting save in Word? Would you ask me why I’m telling you about this?

Writing software is hard business — don’t believe anyone who puts forward the front that they can birth fully-functioning code apropos of nothing. Programming is iterative; you write something, you test it, it breaks, you fix it, you write some more, it breaks something else. And it’s a common scenario to have a bug arise in your code that you wrote ages ago, sometimes years ago, and you need to go back to how your code looked at a specific point in time to fix it. Common also is adding features to already mature code; features that may break existing functionality, features that need some working out to actually function, and you want to keep them separate from what’s already working.

Enter version control. In programming, a version control system is a program that, at a very basic level, records the state of a project and has functionality to view past states, commonly called a repository and allows people to collaborate on that repository. Modern version control systems have the ability to incorporate changes to that state from multiple contributors, keep all revisions of a project indefinitely, have as many backups as the project as there are people with copies of the repository, and make it easy to collaborate with others. For simple projects it might be overkill, but when you’re dealing with million-line codebases with hundreds to thousands of contributors, a system that keeps track of who did what when makes administration of a project not easier, but possible. Version control’s been around a number of years, and there have been a number of implementations of it, but modern version control systems break down into basically two camps: decentralized (or distributed) version control and centralized version control (and, like most things in the software world, into proprietary and open source as well).

Dropbox can be thought of as a version of centralized control. You have your client machine, it has a Dropbox folder, and that folder hooks up to Dropbox HQ. Dropbox HQ will always try to have the most up to date version of your Dropbox folder; if you log into Dropbox’s web interface after moving a file into your local folder, you’ll see that file there. If you have a friend you want to share a file with via Dropbox, that person also connects to Dropbox HQ to read the files. If your friend is absolutely unable to connect to Dropbox, they’re not going to be able to read the files you put there as Dropbox is the central repository of your files. If you lose connectivity to Dropbox — let’s say if you’re working on something on the airplane — then your files aren’t backed up until you get back on the network.

A lot of version control systems work like Dropbox. Subversion is the most popular one and it, like Dropbox, requires that you have a central place to put files. Unlike Dropbox, there is no single Subversion company that runs a repository; you can put a Subversion server on your own machine. But in other respects it is largely similar — your friend again needs to be able to reach your Subversion server to be able to read your files. Having one workflow like this has a distinct advantage — one workflow means one way of doing things and cuts down on confusion — but the moment you have an unusual workflow you’ll find yourself working against the system rather than with it.

If you’re a serious Dropbox wonk, you’ve probably noticed that Dropbox keeps the last ten or so saves of each file — handy if you nuke something and need to go back to an earlier revision. But if you’re a compulsive saver, you’ll quickly run out of those last ten revisions. Any modern version control system will keep every single revision ever made to a project, so going back to any point in development is easy. You can see the revision history of this very post right here and take a look at a brief visualization of Jason Griffey’s librarybox project.

Decentralized version control systems are somewhat of a more recent development than centralized. A decentralized version control system, like git, Bazaar, or Mercurial, doesn’t need a central repository. A decentralized version control system can be run entirely locally, without sharing files at all, or it can be run in a manner similar to a centralized one. This workflow flexibility can be liberating, but it can also take some time to work out what workflow works well for a team. In our local development center, we have one developer and one systems administrator, so consequently our workflows tend to be quite simple. The developer has his own desktop machine at which he works, the application being developed runs live on a server in the basement, and another machine contains a central git repository. The developer writes and tests code on his machine and, when he is happy with the state of development, pushes that code to the central repository. He then tells the system admin to pull those changes to the live server where they are incorporated into the code. Because Git keeps all revisions of a project, it’s easy to roll back changes if they don’t work.

Another aspect of modern version control that is used most effectively by the decentralized version control systems is branching. Branching allows you to take your project, save the state that it’s at, and start a new version of the project that may go in a radically different direction; say to test out new features that may break things. Judicious use of branching can result in always having a version of a project that is ready for release while still maintaining flexibility to make new things. Mature software projects will often have multiple branches for various feature sets. When those feature sets can be integrated into the main branch of the software, they can usually be moved painlessly or, if necessary, thrown away without affecting any working code.

Software version control comes with a learning curve, but as libraries start becoming producers of code as well as consumers of it, it will be an important tool to know about.

About Our Guest Author: John Fink is Digital Scholarship Librarian at McMaster University in Hamilton, Ontario. He holds an MLS  from San Jose State University, and has spent most of his working life in library systems departments. His primary research interests are copyright, physical computing, and emerging technologies.


How to make peace with error messages

At last, after all those hours toiling under the glow of the computer screen, your first script is completed. All those hours learning to code have finally paid off. Holding your breath, you enter the command to execute the script… only to have an error message appear on the screen. You shake your fist to the sky and curse whatever deities you believe in, but the error message remains unchanged, almost like it is staring into your soul…

Error messages != fun, or the Art of Debugging

Error messages happen to everyone. The causes of error messages vary; sometimes the error is caused by a bug hidden in the system, other times the error stems from the human typing at the keyboard. In every case, error messages can be frustrating and time consuming (For those of you who have hunted for hours to find that missing closing bracket in your script, you know the pain that I speak of). Error messages are here to stay no matter how thorough you are with your code, so you’ll need to know how to deal with them in an efficient manner.

Making bugs visible

The first step of debugging your script is to make sure that you’re actually receiving error messages when things go wrong. Depending on the language you use, you can change your error settings to output errors in a log, in the command window, on a web page, etc. You can also change how detailed the error messages are or even create custom error messages to help pinpoint where in the script the code is failing. If your script is not running and you don’t see any errors pop up, your best bet is to look at the documentation for that programming language for error reporting.

Speaking of documentation…

RTFM

For the purposes of this post, RTFM means “Read the Friendly Manual”. Most programming languages have various documentations available in both online and print formats. The documentation is a good place to start when you suspect that the error is caused due to syntax errors or a function that is not being used properly.

Example: You have a python script that is returning the following error when you run the script [1]

>>> while True print 'Hello world'
  File "<stdin>", line 1, in ?
    while True print 'Hello world'
                   ^
SyntaxError: invalid syntax

The  “SyntaxError: invalid syntax” tells you that your code is not formatted correctly and you have a little carrot pointing to the last letter on the print function, so one would assume that the error involves the use of print in the while statement in some way. A quick search on the python website shows the proper syntax for constructing the while statement:

while True: print 'Hello world'

Note that we were missing a colon between the expression (while True) and the suite (print ‘Hello world’)

Reading the documentation is also a good way to reduce errors, so it’s worth your time to seek out good resources about the language you’re using and study them before and while you’re building the script.

GIYF

GIYF – Google is your friend. Librarians have a love-hate relationship with Google for reasons that have been covered extensively elsewhere. Google, however, is a staple in the programmer’s life. Copy the error message, paste it in the search box, and you’ll receive a multitude of hits of varying quality. If I do an exact phrase search on the (very common) PHP error “Parse error: syntax error, unexpected T_VARIABLE”, Google comes back to me with around 82,200 results. That’s a lot of results to wade through. A good thing to remember is that the same criteria that many librarians teach their user in evaluating sites for research can be applied while seeking out help with troubleshooting errors. By applying what is taught in many Information Literacy sessions, you will quickly narrow down the number of sites to use in your bug squashing process.

Once you have done a few searches for errors in one language, you hopefully will have found a few web sites that are consistent in providing detailed, accurate information about troubleshooting errors.  Some sites and resources that have been useful when I tracked down various error messages:

Some library-related resources for those dealing with errors while working on various library modules/script libraries in various programming languages:

There are many other places where you can search for help which you will find in your error tracking time.

Now for some audience participation:

  • Do you have a resource that helped you with errors and bugs?
  • What errors have you come across with coding for library related projects, and how did you find a fix?

Please share in the comments below. Happy bug squashing!

 

[1] Error example from http://docs.python.org/tutorial/errors.html#syntax-errors