A Librarian’s Guide to OpenRefine

Academic librarians working in technical roles may rarely see stacks of books, but they doubtless see messy digital data on a daily basis. OpenRefine is an extremely useful tool for dealing with this data without sophisticated scripting skills and with a very low learning curve. Once you learn a few tricks with it, you may never need to force a student worker to copy and paste items onto Excel spreadsheets.

As this comparison by the creator of OpenRefine shows, the best use for the tool is to explore and transform data, and it allows you to make edits to many cells and rows at once while still seeing your data. This allows you to experiment and undo mistakes easily, which is a great advantage over databases or scripting where you can’t always see what’s happening or undo the typo you made. It’s also a lot faster than editing cell by cell like you would do with a spreadsheet.

Here’s an example of a project that I did in a spreadsheet and took hours, but then I redid in Google Refine and took a lot less time. One of the quickest things to do with OpenRefine is spot words or phrases that are almost the same, and possibly are the same thing. Recently I needed to turn a large export of data from the catalog into data that I could load into my institutional repository. There were only certain allowed values that could be used in the controlled vocabulary in the repository, so I had to modify the bibliographic data from the catalog (which was of course in more or less proper AACR2 style) to match the vocabularies available in the repository. The problem was that the data I had wasn’t consistent–there were multiple types of abbreviations, extra spaces, extra punctuation, and outright misspellings. An example is the History Department. I can look at “Department of History”, “Dep. of History”, “Dep of Hist.” and tell these are probably all referring to the same thing, but it’s difficult to predict those potential spellings. While I could deal with much of this with regular expressions in a text editor and find and replace in Excel, I kept running into additional problems that I couldn’t spot until I got an error. It took several attempts of loading the data until I cleared out all the errors.

In OpenRefine this is a much simpler task, since you can use it to find everything that probably is the same thing despite the slight differences in spelling, punctuation and spelling. So rather than trying to write a regular expression that accounts for all the differences between “Department of History”, “Dep. of History”, “Dep of Hist.”, you can find all the clusters of text that include those elements and change them all in one shot to “History”. I will have more detailed instructions on how to do this below.

Installation and Basics

OpenRefine was called, until last October, Google Refine, and while the content from the Google Refine page is being moved to the Open Refine page you should plan to look at both sites. Documentation and video tutorials refer interchangeably to Google Refine and OpenRefine. The official and current documentation is on the OpenRefine GitHub wiki. For specific questions you will probably want to use the OpenRefine Custom Search Engine, which brings together all the mix of documentation and tutorials on the web. OpenRefine is a web app that runs on your computer, so you don’t need an internet connection to run it. You can get the installation instructions on this page.

While you can jump in right away and get started playing around, it is well worth your time to watch the tutorial videos, which will cover the basic actions you need to take to start working with data. As I said, the learning curve is low, but not all of the commands will make sense until you see them in action. These videos will also give you an idea of what you might be able to do with a data set you have lying around. You may also want to browse the “recipes” on the OpenRefine site, as well search online for additional interesting things people have done. You will probably think of more ideas about what to try. The most important thing to know about OpenRefine is that you can undo anything, and go back to the beginning of the project before you messed up.

A basic understanding of the Google Refine Expression Language, or GREL will improve your ability to work with data. There isn’t a whole lot of detailed documentation, so you should feel free to experiment and see what happens when you try different functions. You will see from the tutorial videos the basics you need to know. Another essential tool is regular expressions. So much of the data you will be starting with is structured data (even if it’s not perfectly structured) that you will need to turn into something else. Regular expressions help you find patterns which you can use to break apart strings into something else. Spending a few minutes understanding regular expression syntax will save hours of inefficient find and replace. There are many tutorials–my go-to source is this one. The good news for librarians is that if you can construct a Dewey Decimal call number, you can construct a regular expression!

Some ideas for librarians

 

(A) Typos

Above I described how you would use OpenRefine to clean up messy and inconsistent catalog data. Here’s how to do it. Load in the data, and select “Text Facet” on the column in question. OpenRefine will show clusters of text that is similar and probably the same thing.

AcademicDept Text Facet

AcademicDept Text Facet

 

Click on Cluster to get a menu for working with multiple values. You can click on the “Merge” check box and then edit the text to whatever you need it to be. You can also edit each text cluster to be the correct text.

Cluster and Edit

Cluster and Edit

You can merge and re-cluster until you have fixed all the typos. Back on the first Text Facet, you can hover over any value to edit it. That way even if the automatic clustering misses some you can edit the errors, or change anything that is the same but you need to look different–for instance, change “Dept. of English” to just “English”.

(B) Bibliographies

The main thing that I have used OpenRefine for in my daily work is to change a bibliography in plain text into columns in a spreadsheet that I can run against an API. This was inspired by this article in the Code4Lib Journal: “Using XSLT and Google Scripts to Streamline Populating an Institutional Repository” by Stephen X. Flynn, Catalina Oyler, and Marsha Miles. I wanted to find a way to turn a text CV into something that would work with the SHERPA/RoMEO API, so that I could find out which past faculty publications could be posted in the institutional repository. Since CVs are lists of data presented in a structured format but with some inconsistencies, OpenRefine makes it very easy to present the data in a certain way as well as remove the inconsistencies, and then to extend the data with a web service. This is a very basic set of instructions for how to accomplish this.

The main thing to accomplish is to put the journal title in its own column. Here’s an example citation in APA format, in which I’ve colored all the “separator” punctuation in red:

Heller, M. (2011). A Review of “Strategic Planning for Social Media in Libraries”. Journal of Electronic Resources Librarianship, 24 (4), 339-240)

From the drop-down menu at the top of the column click on “Split into several columns…” from the “Edit Column” menu. You will get a menu like the one below. This example finds the opening parenthesis and removes that in creating a new column. The author’s name is its own column, and the rest of the text is in another column.

Spit into columns

 

The rest of the column works the same way–find the next text, punctuation, or spacing that indicates a separation. You can then rename the column to be something that makes sense. In the end, you will end up with something like this:

Split columns

When you have the journal titles separate, you may want to cluster the text and make sure that the journals have consistent titles or anything else to clean up the titles. Now you are a ready to build on this data with fetching data from a web service. The third video tutorial posted above will explain the basic idea, and this tutorial is also helpful. Use the pull-down menu at the top of the journal column to select “Edit column” and then “Add column by fetching URLs…”. You will get a box that will help you construct the right URL. You need to format your URL in the way required by SHERPA/RoMEO, and will need a free API key. For the purposes of this example, you can use 'http://www.sherpa.ac.uk/romeo/api29.php?ak=[YOUR API KEY HERE]&qtype=starts&jtitle=' + escape(value,'url'). Note that it will give you a preview to see if the URL is formatted in the way you expect. Give your column a name, and set the Throttle delay, which will keep the service from rejecting too many requests in a short time. I found 1000 worked fine.

refine7

After this runs, you will get a new column with the XML returned by SHERPA/RoMEO. You can use this to pull out anything you need, but for this example I want to get pre-archiving and post-archiving policies, as well as the conditions. A quick way to to this is to use the Googe Refine Expression Language parseHtml function. To use this, click on “Add column based on this column” from the “Edit Column” menu, and you will get a menu to fill in an expression.

refine91

In this example I use the code value.parseHtml().select("prearchiving")[0].htmlText(), which selects just the text from within the prearchving element. Conditions are a little different, since there are multiple conditions for each journal. In that case, you would use the following syntax (after join you can put whatever separator you want): forEach(value.parseHtml().select("condition"),v,v.htmlText()).join(". ")"

So in the end, you will end up with a neatly structured spreadsheet from your original CV with all the bibliographic information in its own column and the publisher conditions listed. You can imagine the possibilities for additional APIs to use–for instance, the WorldCat API could help you determine which faculty published books the library owns.

Once you find a set of actions that gets your desired result, you can save them for the future or to share with others. Click on Undo/Redo and then the Extract option. You will get a description of the actions you took, plus those actions represented in JSON.

refine13

Unselect the checkboxes next to any mistakes you made, and then copy and paste the text somewhere you can find it again. I have the full JSON for the example above in a Gist here. Make sure that if you save your JSON publicly you remove your personal API key! When you want to run the same recipe in the future, click on the Undo/Redo tab and then choose Apply. It will run through the steps for you. Note that if you have a mistake in your data you won’t catch it until it’s all finished, so make sure that you check the formatting of the data before running this script.

Learning More and Giving Back

Hopefully this quick tutorial got you excited about OpenRefine and thinking about what you can do. I encourage you to read through the list of External Resources to get additional ideas, some of which are library related. There is lots more to learn and lots of recipes you can create to share with the library community.

Have you used OpenRefine? Share how you’ve used it, and post your recipes.

 


Revisiting PeerJ

A few months ago as part of a discussion on open peer review, I described the early stages of planning for a new type of journal, called PeerJ. Last month on February 12 PeerJ launched with its first 30 articles. By last week, the journal had published 53 articles. There are a number of remarkable attributes of the journal so far, so in this post I want to look at what PeerJ is actually doing, and some lessons that academic libraries can take away–particularly for those who are getting into publishing.

What PeerJ is Doing

On the opening day blog post (since there are no editorials or issues in PeerJ, communication from the editors has to be done via blog post 1), the PeerJ team outlined their mission under four headings: to make their content open and help to make that standard practice, to practice constant innovation, to “serve academia”, and to make this happen at minimal cost to researchers and no cost to the public. The list of advisory board and academic editors is impressive–it is global and diverse, and includes some big names and Nobel laureates. To someone judging the quality of the work likely to be published, this is a good indication. The members of PeerJ range in disciplines, with the majority in Molecular Biology. To submit and/or publish work requires a fee, but there is a free plan that allows one pre-print to be posted on the forthcoming PeerJ PrePrints.

PeerJ’s publication methods are based on PLoS ONE, which publishes articles based on subjective scientific and methodological soundness rather with no emphasis placed on subjective measures of novelty or interest (see more on this). Like all peer-reviewed journals, articles are sent to an academic editor in the field, who then sends the article to peer reviewers. Everything is kept confidential until the article actually is published, but authors are free to talk about their work in other venues like blogs.

Look and Feel
PeerJ on an iPhone size screen

PeerJ on an iPhone size screen

There are several striking dissimilarities between PeerJ and standard academic journals. The home page of the journal emphasizes striking visuals and is responsive to devices, so the large image scales to a small screen for easy reading. The “timeline” display emphasizes new and interesting content. 2 The code they used to make this all happen is available openly on the PeerJ Github account. The design of the page reflects best practices for non-profit web design, as described by the non-profit social media guide Nonprofit Tech 2.0. The page tells a story, makes it easy to get updates, works on all devices, and integrates social media. The design of the page has changed iteratively even in the first month to reflect the realities of what was actually being published and how people were accessing it. 3 PDFs of articles were designed to be readable on screens, especially tablets, so rather than trying to fit as much text as possible on one page as many PDFs are designed, they have single columns with left margins, fewer words per line, and references hyperlinked in the text. 4

How Open Peer Review Works

One of the most notable features of PeerJ is open peer review. This is not mandatory, but approximately half the reviewers and authors have chosen to participate. 5 This article is an example of open peer review in practice. You can read the original article, the (in this case anonymous) reviewer’s comments, the editors comments and the author’s rebuttal letter. Anyone who has submitted an article to a peer reviewed journal before will recognize this structure, but if you have not, this might be an exciting glimpse of something you have never seen before. As a non-scientist, I personally find this more useful as a didactic tool to show the peer review process in action, but I can imagine how helpful it would be to see this process for articles about areas of library science in which I am knowledgeable.

With only 53 articles and in existence for such a short time, it is difficult to measure what impact open peer review has on articles, or to generalize about which authors and reviewers choose an open process. So far, however, PeerJ reports that several authors have been very positive about their experience publishing with the journal. The speed of review is very fast, and reviewers have been constructive and kind in their language. One author goes into more detail in his original post, “One of the reviewers even signed his real name. Now, I’m not totally sure why they were so nice to me. They were obvious experts in the system that I studied …. But they were nice, which was refreshing and encouraging.” He also points out that the exciting thing about PeerJ for him is that all it requires are projects that were technically well-executed and carefully described, so that this encourages publication of negative or unexpected results, thus avoiding the file drawer effect.6

This last point is perhaps the most important to note. We often talk of peer-reviewed articles as being particularly significant and “high-impact.” But in the case of PeerJ, the impact is not necessarily due to the results of the research or the type of research, but that it was well done. One great example of this is the article “Significant Changes in the Skin Microbiome Mediated by the Sport of Roller Derby”. 7 This was a study about the transfer of bacteria during roller derby matches, and the study was able to prove its hypothesis that contact sports are a good environment in which to study movements of bacteria among people. The (very humorous) review history indicates that the reviewers were positive about the article, and felt that it had promise for setting a research paradigm. (Incidentally, one of the reviewers remained anonymous , since he/she felt that this could “[free] junior researchers to openly and honestly critique works by senior researchers in their field,” and signed the letter “Diligent but human postdoc reviewer”.) This article was published the beginning of March, and already has 2,307 unique visits to the page, and has been shared widely on social media. We can assume that one of the motivations for sharing this article was the potential for roller derby jokes or similar, but will this ultimately make the article’s long term impact stronger? This will be something to watch.

What Can Academic Libraries Learn?

A recent article In the Library With the Lead Pipe discussed the open ethos in two library publications, In the Library With the Lead Pipe and Code4Lib Journal. 8 This article concluded that more LIS publications need to open the peer review process, though the publications mentioned are not peer reviewed in the traditional sense. There are very few, if any, open peer reviewed publications in the nature of PeerJ outside of the sciences. Could libraries or library-related publications match this process? Would they want to?

I think we can learn a few things from PeerJ. First, the rapid publication cycle means that more work is getting published more quickly. This is partly because they have so many reviewers and so any one reviewer isn’t overburdened–and due to their membership model, it is in the best financial interests of potential future authors to be current reviewers. As In the Library With the Lead Pipe points out that a central academic library journal, College & Research Libraries, is now open access and early content is available as a pre-print, the pre-prints reflect content that will be published in some cases well over a year from now. A year is a long time to wait, particularly for work that looks at current technology. Information Technology in Libraries (ITAL), the LITA journal is also open access and provides pre-prints as well–but this page appears to be out of date.

Another thing we can learn is making reading easier and more convenient while still maintaining a professional appearance and clean visuals. Blogs like ACRL Tech Connect and In the Library with the Lead Pipe deliver quality content fairly quickly, but look like blogs. Journals like the Journal of Librarianship and Scholarly Communication have a faster turnaround time for review and publication (though still could take several months), but even this online journal is geared for a print world. Viewing the article requires downloading a PDF with text presented in two columns–hardly the ideal online reading experience. In these cases, the publication is somewhat at the mercy of the platform (WordPress in the former, BePress Digital Commons in the latter), but as libraries become publishers, they will have to develop platforms that meet the needs of modern researchers.

A question put to the ACRL Tech Connect contributors about preferred reading methods for articles suggests that there is no one right answer, and so the safest course is to release content in a variety of formats or make it flexible enough for readers to transform to a preferred format. A new journal to watch is Weave: Journal of Library User Experience, which will use the Digital Commons platform but present content in innovative ways. 9 Any libraries starting new journals or working with their campuses to create new journals should be aware of who their readers are and make sure that the solutions they choose work for those readers.

 

 

  1. “The Launch of PeerJ – PeerJ Blog.” Accessed February 19, 2013. http://blog.peerj.com/post/42920112598/launch-of-peerj.
  2. “Some of the Innovations of the PeerJ Publication Platform – PeerJ Blog.” Accessed February 19, 2013. http://blog.peerj.com/post/42920094844/peerj-functionality.
  3. http://blog.peerj.com/post/45264465544/evolution-of-timeline-design-at-peerj
  4. “The Thinking Behind the Design of PeerJ’s PDFs.” Accessed March 18, 2013. http://blog.peerj.com/post/43558508113/the-thinking-behind-the-design-of-peerjs-pdfs.
  5. http://blog.peerj.com/post/43139131280/the-reception-to-peerjs-open-peer-review
  6. “PeerJ Delivers: The Review Process.” Accessed March 18, 2013. http://edaphics.blogspot.co.uk/2013/02/peerj-delivers-review-process.html.
  7. Meadow, James F., Ashley C. Bateman, Keith M. Herkert, Timothy K. O’Connor, and Jessica L. Green. “Significant Changes in the Skin Microbiome Mediated by the Sport of Roller Derby.” PeerJ 1 (March 12, 2013): e53. doi:10.7717/peerj.53.
  8. Ford, Emily, and Carol Bean. “Open Ethos Publishing at Code4Lib Journal and In the Library with the Lead Pipe.” In the Library with the Lead Pipe (December 12, 2012). http://www.inthelibrarywiththeleadpipe.org/2012/open-ethos-publishing/.
  9. Personal communication with Matthew Reidsma, March 19, 2013.

Reflections on Code4Lib 2013

Disclaimer: I was on the planning committee for Code4Lib 2013, but this is my own opinion and does not reflect other organizers of the conference.

We have mentioned Code4Lib before on this blog, but for those who are unfamiliar, it is a loose collective of programmers working in libraries, librarians, and others interested in code and libraries. (You can read more about it on the website.) The Code4Lib conference has emerged as a venue to share very new technology and have discussions with a wide variety of people who might not attend conferences more geared to librarians. Presentations at the conference are decided by the votes of anyone interested in selecting the program, and additionally lightning talks and breakout sessions allow wide participation and exposure to extremely new projects that have not made it into the literature or to conferences with a longer lead time. The Code4Lib 2013 conference ran February 11-14 at University of Illinois Chicago. You can see a list of all programs here, which includes links to the video archive of the conference.

While there were many types of projects presented, I want to focus on those talks which illustrated what I saw as thread running through the conference–care and emotion. This is perhaps unexpected for a technical conference. Yet those themes underlie a great deal of the work that takes place in academic library technology and the types of projects presented at Code4Lib. We tend to work in academic libraries because we care about the collections and the people using those collections. That intrinsic motivation focuses our work.

Caring about the best way to display collections is central to successful projects. Most (though not all) the presenters and topics came out of academic libraries, and many of the presentations dealt with creating platforms for library and archival metadata and collections. To highlight a few: Penn State University has developed their own institutional repository application called ScholarSphere that provides a better user experience for researchers and managers of the repository. The libraries and archives of the Rock and Roll Hall of Fame dealt with the increasingly common problem of wanting to present digital content alongside more traditional finding aids, and so developed a system for doing so. Corey Harper from New York University presented an extremely interesting and still experimental project to use linked data to enrich interfaces for interacting with library collections. Note that all these projects combined various pieces of open source software and library/web standards to create solutions that solve a problem facing academic or research libraries for a particular setting. I think an important lesson for most academic librarians looking at descriptions of projects like this is that it takes more than development staff to make projects like this. It takes purpose, vision, and dedication to collecting and preserving content–in other words, emotion and care. A great example of this was the presentation about DIYHistory from the University of Iowa. This project started out initially as an extremely low-tech solution for crowdsourcing archival transcription, but got so popular that it required a more robust solution. They were able to adapt open source tools to meet their needs, still keeping the project very within the means of most libraries (the code is here).

Another view of emotion and care came from Mark Matienzo, who did a lightning talk (his blog post gives a longer version with more details). His talk discussed the difficulties of acknowledging and dealing with the emotional content of archives, even though emotion drives interactions with materials and collections. The records provided are emotionless and affectless, despite the fact that they represent important moments in history and lives. The type of sharing of what someone “likes” on Facebook does not satisfactorily answer the question of what they care about,or represent the emotion in their lives. Mark suggested that a tool like Twine, which allows writing interactive stories could approach the difficult question of bringing together the real with the emotional narrative that makes up experience.

One of the ways we express care for our work and for our colleagues is by taking time to be organized and consistent in code. Naomi Dushay of Stanford University Library presented best practices for code handoffs, which described some excellent practices for documenting and clarifying code and processes. One of the major takeaways is that being clear, concise, and straightforward is always preferable, even as much as we want to create cute names for our servers and classes. To preserve a spirit of fun, you can use the cute name and attach a description of what the item actually does.

Originally Bess Sadler, also from Stanford, was going to present with Naomi, but ended up presenting a different talk and the last one of the conference on Creating a Commons (the full text is available here). This was a very moving look at what motivates her to create open source software and how to create better open source software projects. She used the framework of the Creative Commons licenses to discuss open source software–that it needs to be “[m]achine readable, human readable, and lawyer readable.” Machine readable means that code needs to be properly structured and allow for contributions from multiple people without breaking, lawyer readable means that the project should have the correct structure and licensing to collaborate across institutions. Bess focused particularly on the “human readable” aspect of creating communities and understanding the “hacker epistemology,” as she so eloquently put it, “[t]he truth is what works.” Part of understanding that requires being willing to reshape default expectations–for instance, the Code4Lib community developed a Code of Conduct at Bess’s urging to underline the fact that the community aims at inclusion and creating a safe space. She encouraged everyone to keep working to do better and “file bug reports” about open source communities.

This year’s Code4Lib conference was a reminder to me about why I do the work I do as an academic librarian working in a technical role. Even though I may spend a lot of time sitting in front of a computer looking at code, or workflows, or processes, I know it makes access to the collections and exploration of those collections better.


What Should Technology Librarians Be Doing About Alternative Metrics?

Bibliometrics– used here to mean statistical analyses of the output and citation of periodical literature–is a huge and central field of library and information science. In this post, I want to address the general controversy surrounding these metrics when evaluating scholarship and introduce the emerging alternative metrics (often called altmetrics) that aim to address some of these controversies and how these can be used in libraries. Librarians are increasingly becoming focused on the publishing side of the scholarly communication cycle, as well as supporting faculty in new ways (see, for instance, David Lankes’s thought experiment of the tenure librarian). What is the reasonable approach for technology-focused academic librarians to these issues? And what tools exist to help?

There have been many articles and blog posts expressing frustration with the practice of using journal impact factors for judging the quality of a journal or an individual researcher (see especially Seglen). One vivid illustration of this frustration is in a recent blog post by Stephen Curry titled “Sick of Impact Factors”. Librarians have long used journal impact factors in making purchasing decisions, which is one of the less controversial uses of these metrics 1 The essential message of all of this research about impact factors is that traditional methods of counting citations or determining journal impact do not answer questions about what articles are influential and how individual researchers contribute to the academy. For individual researchers looking to make a case for promotion and tenure, questions of use of metrics can be all or nothing propositions–hence the slightly hysterical edge in some of the literature. Librarians, too, have become frustrated with attempting to prove the return on investment for decisions–see “How ROI Killed the Academic Library”–going by metrics alone potentially makes the tools available to researchers more homogeneous and ignores niches. As the alt metrics manifesto suggests, the traditional “filters” in scholarly communication of peer review, citation metrics, and journal impact factors are becoming obsolete in their current forms.

Traditional Metrics

It would be of interest to determine, if possible, the part which men of different calibre [sic] contribute to the progress of science.

Alfred Lotka (a statistician at the Metropolitan Life Insurance Company, famous for his work in demography) wrote these words in reference to his 1926 statistical analysis of the journal output of chemists 2 Given the tools available at the time, it was a fairly limited sample size, looking at just the first two letters of an author index for the period of 16 years compared with a slim 100 page volume of important works “from the beginning of history to 1900.” His analysis showed that the more articles published in a field, the less likely it is for an individual author to publish more than one article. As Per Seglen puts it, this showed the “skewness” of science 3

The original journal impact factor was developed by Garfield in the 1970s, and used the “mean number of citations to articles published in two preceding years” 4.   Quite clearly, this is supposed to measure the general amount that a journal was cited, and hence a guide to how likely a researcher was to read and immediately find useful the body of work in this journal in his or her own work. This is helpful for librarians trying to make decisions about how to stretch a budget, but the literature has not found that a journal’s impact has much to do with an individual article’s citedness and usefulness 5 As one researcher suggests, using it for anything other than its intended original use constitutes pseudoscience 6 Another issue with which those at smaller institutions are very familiar is the cost of accessing traditional metrics. The major resources that provide these are Thomson Reuters’ Journal Citation Reports and Web of Science, and Elsevier’s Scopus, and both are outside the price range of many schools.

Metrics that attempt to remedy some of these difficulties have been developed. At the journal level, the Eigenfactor® and Article Influence Score™ use network theory to estimate “the percentage of time that library users spend with that journal”, and the Article Influence Score tracks the influence of the journal over five years. 7. At the researcher level, the h-index tracks the impact of specific researchers (it was developed with physicists in mind). The h-index takes into account the number of papers the researcher has published in how much time when looking at citations. 8

These are included under the rubric of alternative metrics since they are an alternative to the JCR, but rely on citations in traditional academic journals, something which the “altmetric” movement wants to move beyond.

Alt Metrics

In this discussion of alt metrics I will be referring to the arguments and tools suggested by Altmetrics.org. In the alt metrics manifesto, Priem et al. point to several manifestations of scholarly communication that are unlike traditional article publications, including raw data, “nanopublication”, and self-publishing via social media (which was predicted as so-called “scholarly skywriting” at the dawn of the World Wide Web 9). Combined with sharing of traditional articles more readily due to open access journals and social media, these all create new possibilities for indicating impact. Yet the manifesto also cautions that we must be sure that the numbers which alt metrics collect “really reflect impact, or just empty buzz.”  The research done so far is equally cautious. A 2011 study suggests that tweets about articles (tweetations) do correlate with citations but that we cannot say that number of tweets about an article really measures the impact. 10

A criticism expressed in the media about alt metrics is that alternative metrics are no more likely to be able to judge the quality or true impact of a scientific paper than traditional metrics. 11 As Per Seglen noted in 1992, “Once the field boundaries are broken there is virtually no limit to the number of citations an article may accrue.” 12 So an article that is interdisciplinary in nature is likely to do far better in the alternative metrics realm than a specialized article in a discipline that still may be very important. Mendeleley’s list of top research papers demonstrates this–many (though not all) the top articles are about scientific publication in general rather than about specific scientific results.

What can librarians use now?

Librarians are used to questions like “What is the impact factor of Journal X?” For librarians lucky enough to have access to Journal Citation Reports, this is a matter of looking up the journal and reporting the score. They could answer “How many times has my article been cited?” in Web of Science or Scopus using some care in looking for typos. Alt metrics, however, remind us that these easy answers are not telling the whole story. So what should librarians be doing?

One thing that librarians can start doing is helping their campus community get signed up for the many different services that will promote their research and provide article level citation information. Below are listed a small number (there are certainly others out there) of services that you may want to consider using yourself or having your campus community use. Some, like PubMed, won’t be relevant to all disciplines. Altmetrics.org lists several tools beyond what is listed below to provide additional ideas.

These tools offer various methods for sharing. PubMed allows one to embed “My Bibliography” in a webpage, as well as to create delegates who can help curate the bibliography. A developer can use the APIs provided by some of these services to embed data for individuals or institutions on a library website or institutional repository. ImpactStory has an API that makes it relatively easy to embed data for individuals or institutions on a library website or institutional repository. Altmetric.com also has an API that is free for non-commercial use. Mendeley has many helpful apps that integrate with popular content management systems.

Since this is such a new field, it’s a great time to get involved. Altmetrics.org held a hackathon in November 2012 and has a Google Doc with the ideas for the hackathon. This is an interesting overview of what is going on with open source hacking on alt metrics.

Conclusion

The altmetrics manifesto program calls for a complete overhaul of scholarly communication–alternative research metrics are just a part of their critique. And yet, for librarians trying to help researchers, they are often the main concern. While science in general calls for a change to the use of these metrics, we can help to shape the discussion through educating and using alternative metrics.

 

Works Cited and Suggestions for Further Reading
Bourg, Chris. 2012. “How ROI Killed the Academic Library.” Feral Librarian. http://chrisbourg.wordpress.com/2012/12/18/how-roi-killed-the-academic-library/.
Cronin, Blaise, and Kara Overfelt. 1995. “E-Journals and Tenure.” Journal of the American Society for Information Science 46 (9) (October): 700-703.
Curry, Stephen. 2012. “Sick of Impact Factors.” Reciprocal Space. http://occamstypewriter.org/scurry/2012/08/13/sick-of-impact-factors/.
“Methods”, 2012. Eigenfactor.org.
Eysenbach, Gunther. 2011. “Can Tweets Predict Citations? Metrics of Social Impact Based on Twitter and Correlation with Traditional Metrics of Scientific Impact.” Journal Of Medical Internet Research 13 (4) (December 19): e123-e123.
Gisvold, Sven-Erik. 1999. “Citation Analysis and Journal Impact Factors – Is the Tail Wagging the Dog?” Acta Anaesthesiologica Scandinavica 43 (November): 971-973.
Hirsch, J. E. “An Index to Quantify an Individual’s Scientific Research Output.” Proceedings of the National Academy of Sciences of the United States of America 102, no. 46 (November 15, 2005): 16569–16572. doi:10.1073/pnas.0507655102.
Howard, Jennifer. 2012. “Scholars Seek Better Ways to Track Impact Online.” The Chronicle of Higher Education, January 29, sec. Technology. http://chronicle.com/article/As-Scholarship-Goes-Digital/130482/.
Jump, Paul. 2012. “Alt-metrics: Fairer, Faster Impact Data?” Times Higher Education, August 23, sec. Research Intelligence. http://www.timeshighereducation.co.uk/story.asp?storycode=420926.
Lotka, Alfred J. 1926. “The Frequency Distribution of Scientific Productivity.” Journal of the Washington Academy of Sciences 26 (12) (June 16): 317-324.
Mayor, Julien. 2010. “Are Scientists Nearsighted Gamblers? The Misleading Nature of Impact Factors.” Frontiers in Quantitative Psychology and Measurement: 215. doi:10.3389/fpsyg.2010.00215.
Oransky, Ivan. 2012. “Was Elsevier’s Peer Review System Hacked to Get More Citations?” Retraction Watch. http://retractionwatch.wordpress.com/2012/12/18/was-elseviers-peer-review-system-hacked-to-get-more-citations/.
Priem, J., D. Taraborelli, P. Groth, and C. Neylon. 2010. “Altmetrics: A Manifesto.” Altmetrics.org. http://altmetrics.org/manifesto/.
Seglen, Per O. 1992. “The Skewness of Science.” Journal of the American Society for Information Science 43 (9) (October): 628-638.
———. 1994. “Causal Relationship Between Article Citedness and Journal Impact.” Journal of the American Society for Information Science 45 (1) (January): 1-11.
Vanclay, Jerome K. 2011. “Impact Factor: Outdated Artefact or Stepping-stone to Journal Certification?” Scientometrics 92 (2) (November 24): 211-238. doi:10.1007/s11192-011-0561-0.
Notes
  1. Jerome K. Vanclay,  “Impact Factor: Outdated Artefact or Stepping-stone to Journal Certification?” Scientometrics 92 (2) (2011):  212.
  2. Alfred Lotka, “The Frequency Distribution of Scientific Productivity.” Journal of the Washington Academy of Sciences 26 (12) (1926)): 317.
  3. Per Seglen, “The Skewness of Science.” Journal of the American Society for Information Science 43 (9) (1992): 628.
  4. Vanclay, 212.
  5. Per Seglen, “Causal Relationship Between Article Citedness and Journal Impact.” Journal of the American Society for Information Science 45 (1) (1994): 1-11.
  6. Vanclay, 211.
  7. “Methods”, Eigenfactor.org, 2012.
  8. J.E. Hirsch, “An Index to Quantify an Individual’s Scientific Research Output.” Proceedings of the National Academy of Sciences of the United States of America 102, no. 46 (2005): 16569–16572.
  9. Blaise Cronin and Kara Overfelt, “E-Journals and Tenure.” Journal of the American Society for Information Science 46 (9) (1995): 700.
  10. Gunther Eysenbach, “Can Tweets Predict Citations? Metrics of Social Impact Based on Twitter and Correlation with Traditional Metrics of Scientific Impact.” Journal Of Medical Internet Research 13 (4) (2011): e123.
  11. see in particular Jump.
  12. Seglen, 637.

Taking Google Forms to the Next Level

Many libraries use Google Forms for collecting information from patrons, particularly for functions like registering for a one-time event or filling out a survey. It’s a popular option because these forms are very easy to set up and start using with no overhead. With a little additional effort and a very small amount of code you can make these forms even more functional.

In this post, we’ll look at the process for adapt a simple library workshop registration form to send a confirmation email and introduce you to the Google Apps Scripts documentation. This is adapted from a tutorial for creating a help desk application, which you can see here. I talked about the overall process of creating simple applications for free with minimal coding skills at this year’s LITA Forum, and you can see the complete presentation here. In this post I will focus on the Google Forms tricks.

A few things to keep in mind before you get started. Use a library account when you actually deploy the applications, since that will remain “owned” by the library even if the person who creates it moved on. These instructions are also intended for regular “consumer” Google accounts–there are additional tools available for Google Apps business customers, which I don’t address here.

Creating Your Form

Create a form as you normally would. Here’s an example of a simple workshop registration form.

There are a few potential problems with the way this form is set up, but here’s an even bigger problem. Once the person signing up clicks submit, the form disappears, and he receives a page saying “Thank you for registering!”

If this person did not record the workshop, he now has no real idea of what he signed up for. What he intended to do and what he actually did may not be the same thing!

What comes next? You, the librarian hosting the workshop, goes into the spreadsheet to see if anyone has signed up. If you want to confirm the sign-up, you can copy the patron’s email address into your email program, and then copy in a message to confirm the sign-up. If you only have a few people signed up, this may not take long, but it adds many unnecessary  steps and requires you to remember to do it.

Luckily, Google has provided all the tools you need to add in an email confirmation function, and it’s not hard to use as long as you know some basic Javascript. Let’s look at an example.

Adding in an email confirmation

To access these functions, visit your spreadsheet, and click on Script Editor in the Tools menu.

You will get many options, which you can use, or you can simply create a script for  a Blank Project (first option) You will get this in your blank project:

function myFunction() {

}

Change the name of the function to be something meaningful. Now you can fill in the details for the function. Basically we use the built-in Google Spreadsheet functions to grab the value of each column we want to include and store these in a variable. You just put in the column number–but remember we are starting from 0 (which is the Timestamp column in our current example).

function emailConfirm(e) {
  var userEmail = e.values[3];
  var firstName = e.values[1];
  var lastName = e.values[2];
  var workshopDate = e.values[4];
  MailApp.sendEmail(userEmail, 
                    "Registration confirmation", 
                    "Thanks for registering for the library workshop on " + workshopDate + " \n\nYou will " +
                    "recieve a reminder email 24 hours prior. \n\nLibrary",                    
                    {name:"Library"});
}

The MailApp class is another built-in Google Apps script. The sendEmail method takes the following arguments: recipients, subject, body, optAdvancedArgs. You can see in the above example that the userEmail variable (the patron’s email address in the form) is the recipient, the subject is “Registration confirmation”, the body contains a generic thank you plus the date of the workshop, which we’ve stored in workshopDate variable. Then we’ve put in advanced arguments the name “Library”–this is optional, particularly if it’s coming from a library email account.

Note that if a patron hits “reply” to cancel or ask a question, the email will automatically go to the email account that deployed the application. But you may want reply emails to go somewhere else. You can modify the last “advanced” argument to be some other email address with the replyto argument. (Note that this doesn’t always work–and that people can see that the email comes from elsewhere, so make sure that someone is checking the email from which the application is deployed).

 {name:"Library", replyto:"mheller@dom.edu"});
Running the script

Once you’ve filled in your script and hit save (it will do a quick debug when you save), you have to set up when the script should run. Select “Current script’s triggers…” from the Resources menu.

Now select the trigger “On form submit”. While you’re here, also click on notifications.

The notifications will tell you any time your script fails to run. For your first script, choose “immediately” so you can see what went wrong if it didn’t work. In the future you can select daily or weekly.

Before you can save either your trigger or failure notifications, you need to authorize that Google can run the script for you.

Now your script will work! Next time a patron fills out your form to register for a workshop, he will receive this email:

Doing More

After working with this very basic script you can explore the Google Apps Script documentation. If you are working with Google Forms, you will find the Spreadsheet Services classes very useful. There are also some helpful tutorials you can work through to learn how to use all the features. This one will teach you how to send emails from the spreadsheet–something you can use when it’s time to remind patrons of which workshops they have signed up for!


PeerJ: Could it Transform Open Access Publishing?

Open access publication makes access to research free for the end reader, but in many fields it is not free for the author of the article. When I told a friend in a scientific field I was working on this article, he replied “Open access is something you can only do if you have a grant.” PeerJ, a scholarly publishing venture that started up over the summer, aims to change this and make open access publication much easier for everyone involved.

While the first publication isn’t expected until December, in this post I want to examine in greater detail the variation on the “gold” open-access business model that PeerJ states will make it financially viable 1, and the open peer review that will drive it. Both of these models are still very new in the world of scholarly publishing, and require new mindsets for everyone involved. Because PeerJ comes out of funding and leadership from Silicon Valley, it can more easily break from traditional scholarly publishing and experiment with innovative practices. 2

PeerJ Basics

PeerJ is a platform that will host a scholarly journal called PeerJ and a pre-print server (similar to arXiv) that will publish biological and medical scientific research. Its founders are Peter Binfield (formerly of PLoS ONE) and Jason Hoyt (formerly of Mendeley), both of whom are familiar with disruptive models in academic publishing. While the “J” in the title stands for Journal, Jason Hoyt explains on the PeerJ blog that while the journal as such is no longer a necessary model for publication, we still hold on to it. “The journal is dead, but it’s nice to hold on to it for a little while.” 3. The project launched in June of this year, and while no major updates have been posted yet on the PeerJ website, they seem to be moving towards their goal of publishing in late 2012.

To submit a paper for consideration in PeerJ, authors must buy a “lifetime membership” starting at $99. (You can submit a paper without paying, but it costs more in the end to publish it). This would allow the author to publish one paper in the journal a year. The lifetime membership is only valid as long as you meet certain participation requirements, which at minimum is reviewing at least one article a year. Reviewing in this case can mean as little as posting a comment to a published article. Without that, the author might have to pay the $99 fee again (though as yet it is of course unclear how strictly PeerJ will enforce this rule). The idea behind this is to “incentivize” community participation, a practice that has met with limited success in other arenas. Each author on a paper, up to 12 authors, must pay the fee before the article can be published. The Scholarly Kitchen blog did some math and determined that for most lab setups, publication fees would come to about $1,124 4, which is equivalent to other similar open access journals. Of course, some of those researchers wouldn’t have to pay the fee again; for others, it might have to be paid again if they are unable to review other articles.

Peer Review: Should it be open?

PeerJ, as the name and the lifetime membership model imply, will certainly be peer-reviewed. But, keeping with its innovative practices, it will use open peer review, a relatively new model. Peter Binfield explained in this interview PeerJ’s thinking behind open peer review.

…we believe in open peer review. That means, first, reviewer names are revealed to authors, and second, that the history of the peer review process is made public upon publication. However, we are also aware that this is a new concept. Therefore, we are initially going to encourage, but not require, open peer review. Specifically, we will be adopting a policy similar to The EMBO Journal: reviewers will be permitted to reveal their identities to authors, and authors will be given the choice of placing the peer review and revision history online when they are published. In the case of EMBO, the uptake by authors for this latter aspect has been greater than 90%, so we expect it to be well received. 5

In single blind peer review, the reviewers would know the name of the author(s) of the article, but the author would not know who reviewed the article. The reviewers could write whatever sorts of comments they wanted to without the author being able to communicate with them. For obvious reasons, this lends itself to abuse where reviewers might not accept articles by people they did not know or like or tend to accept articles from people they did like 6 Even people who are trying to be fair can accidentally fall prey to bias when they know the names of the submitters.

Double blind peer review in theory takes away the ability for reviewers to abuse the system. A link that has been passed around library conference planning circles in the past few weeks is the JSConf EU 2012 which managed to improve its ratio of female presenters by going to a double-blind system. Double blind is the gold standard for peer review for many scholarly journals. Of course, it is not a perfect system either. It can be hard to obscure the identity of a researcher in a small field in which everyone is working on unique topics. It also is a much lengthier process with more steps involved in the review process.  To this end, it is less than ideal for breaking medical or technology research that needs to be made public as soon as possible.

In open peer review, the reviewers and the authors are known to each other. By allowing for direct communication between reviewer and researcher, this speeds up the process of revisions and allows for greater clarity and speed 7.  Open peer review doesn’t affect the quality of the reviews or the articles negatively, it does make it more difficult to find qualified reviewers to participate, and it might make a less well-known researcher more likely to accept the work of a senior colleague or well-known lab.  8.

Given the experience of JSConf and a great deal of anecdotal evidence from women in technical fields, it seems likely that open peer review is open to the same potential abuse of single peer review. While  open peer review might make the rejected author able to challenge unfair rejections, this would require that the rejected author feels empowered enough in that community to speak up. Junior scholars who know they have been rejected by senior colleagues may not want to cause a scene that could affect future employment or publication opportunities. On the other hand, if they can get useful feedback directly from respected senior colleagues, that could make all the difference in crafting a stronger article and going forward with a research agenda. Therein lies the dilemma of open peer review.

Who pays for open access?

A related problem for junior scholars exists in open access funding models, at least in STEM publishing. As open access stands now, there are a few different models that are still being fleshed out. Green open access is free to the author and free to the reader; it is usually funded by grants, institutions, or scholarly societies. Gold open access is free to the end reader but has a publication fee charged to the author(s).

This situation is very confusing for researchers, since when they are confronted with a gold open access journal they will have to be sure the journal is legitimate (Jeffrey Beall has a list of Predatory Open Access journals to aid in this) as well as secure funding for publication. While there are many schemes in place for paying publication fees, there are no well-defined practices in place that illustrate long-term viability. Often this is accomplished by grants for the research, but not always. The UK government recently approved a report that suggests that issuing “block grants” to institutions to pay these fees would ultimately cost less due to reduced library subscription fees.  As one article suggests, the practice of “block grants” or other funding strategies are likely to not be advantageous to junior scholars or those in more marginal fields 9. A large research grant for millions of dollars with the relatively small line item for publication fees for a well-known PI is one thing–what about the junior humanities scholar who has to scramble for a few thousand dollar research stipend? If an institution only gets so much money for publication fees, who gets the money?

By offering a $99 lifetime membership for the lowest level of publication, PeerJ offers hope to the junior scholar or graduate student to pursue projects on their own or with a few partners without worrying about how to pay for open access publication. Institutions could more readily afford to pay even $250 a year for highly productive researchers who were not doing peer review than the $1000+ publication fee for several articles a year. As above, some are skeptical that PeerJ can afford to publish at those rates, but if it is possible, that would help make open access more fair and equitable for everyone.

Conclusion

Open access with low-cost paid up front could be very advantageous to researchers and institutional  bottom lines, but only if the quality of articles, peer reviews, and science is very good. It could provide a social model for publication that will take advantage of the web and the network effect for high quality reviewing and dissemination of information, but only if enough people participate. The network effect that made Wikipedia (for example) so successful relies on a high level of participation and engagement very early on to be successful [Davis]. A community has to build around the idea of PeerJ.

In almost the opposite method, but looking to achieve the same effect, this last week the Sponsoring Consortium for Open Access Publishing in Particle Physics (SCOAP3) announced that after years of negotiations they are set to convert publishing in that field to open access starting in 2014. 10 This means that researchers (and their labs) would not have to do anything special to publish open access and would do so by default in the twelve journals in which most particle physics articles are published. The fees for publication will be paid upfront by libraries and funding agencies.

So is it better to start a whole new platform, or to work within the existing system to create open access? If open (and through a commenting s system, ongoing) peer review makes for a lively and engaging network and low-cost open access  makes publication cheaper, then PeerJ could accomplish something extraordinary in scholarly publishing. But until then, it is encouraging that organizations are working from both sides.

  1. Brantley, Peter. “Scholarly Publishing 2012: Meet PeerJ.” PublishersWeekly.com, June 12, 2012. http://www.publishersweekly.com/pw/by-topic/digital/content-and-e-books/article/52512-scholarly-publishing-2012-meet-peerj.html.
  2. Davis, Phil. “PeerJ: Silicon Valley Culture Enters Academic Publishing.” The Scholarly Kitchen, June 14, 2012. http://scholarlykitchen.sspnet.org/2012/06/14/peerj-silicon-valley-culture-enters-academic-publishing/.
  3. Hoyt, Jason. “What Does the ‘J’ in ‘PeerJ’ Stand For?” PeerJ Blog, August 22, 2012. http://blog.peerj.com/post/29956055704/what-does-the-j-in-peerj-stand-for.
  4. http://scholarlykitchen.sspnet.org/2012/06/14/is-peerj-membership-publishing-sustainable/
  5. Brantley
  6. Wennerås, Christine, and Agnes Wold. “Nepotism and sexism in peer-review.” Nature 387, no. 6631 (May 22, 1997): 341–3.
  7. For an ingenious way of demonstrating this, see Leek, Jeffrey T., Margaret A. Taub, and Fernando J. Pineda. “Cooperation Between Referees and Authors Increases Peer Review Accuracy.” PLoS ONE 6, no. 11 (November 9, 2011): e26895.
  8. Mainguy, Gaell, Mohammad R Motamedi, and Daniel Mietchen. “Peer Review—The Newcomers’ Perspective.” PLoS Biology 3, no. 9 (September 2005). http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1201308/.
  9. Crotty, David. “Are University Block Grants the Right Way to Fund Open Access Mandates?” The Scholarly Kitchen, September 13, 2012. http://scholarlykitchen.sspnet.org/2012/09/13/are-university-block-grants-the-right-way-to-fund-open-access-mandates/.
  10. Van Noorden, Richard. “Open-access Deal for Particle Physics.” Nature 489, no. 7417 (September 24, 2012): 486–486.

The Digital Public Library of America: What Does a New Platform Mean for Academic Research?

Robert Darnton asked in the New York Review of Books blog nearly two years ago: “Can we create a National Digital Library?” 1 Anyone who recalls reference homework exercises checking bibliographic information for United States imprints versus British or French will certainly remember the United States does not have a national library in the sense of a library that collects all the works of that country and creates a national bibliography 2 Certain libraries, such as the Library of Congress, have certain prerogatives for collection and dissemination of standards 3, but there is no one library that creates a national bibliography. Such it was for print, and so it remains even more so for digital. So when Darnton asks that–as he goes on to illuminate further in his article–he is asking a much larger question about  libraries in the United States. European and Asian countries have created national digital libraries as part of or in addition to their national print libraries.  The question is: if others can do it, why can’t we? Furthermore, why can’t we join those libraries with our national digital library? The DPLA has  announced collaboration with Europeana, which has already had notable successes with digitizing content and making it and its metadata freely available. This indicates that we could potentially create a useful worldwide digital library, or at least a North American/European one.The dream of Paul Otlet’s universal bibliography seems once again to be just out of reach.

In this post, I want to examine what the Digital Public Library of America claims to do, and what approaches it is taking. It is still new enough and there are still enough unanswered questions to give any sort of final answer to whether this will actually be the national digital library. Nonetheless, there seems to be enough traction and, perhaps more importantly, funding that we should pay close attention to what is delivered in April 2013.

Can we reach a common vision about the nature of the DPLA?

The planning for the DPLA started in the  fall of 2010 when Harvard’s Berkman Center received a grant from the Sloan Foundation to begin planning the project in earnest. The initial idea was to digitize all the materials which it was legal to digitize, and create a platform that would be accessible to all people in the US (or nationally). Google had already proved that it was possible, so it seemed that with many libraries working together it would be concievable to repeat their sucesses, but with solely non-commerical motives  4.

The initials stages of planning brought out many different ideas and perspectives about the philosophical and practical components of the DPLA, many of which are still unanswered. The theme of debate that has emerged are whether the DPLA would be a true “public” library, and what in fact ought to be in such a library. David Rothman argues that the DPLA as described by Darnton would be a wonderful tool for making humanities research easy and viable for more people, but would not solve the problems of making popular e-books  accessible through libraries or getting students up-to-date textbooks. The latter two aims are much more challenging than getting access to public domain or academic materials because a lot more money is at stake 5.

One of the projects for the Audience and Content workstream is to figure out how average Americans might actually use a digital public library of America. One of the potential use cases is a student who can just use DPLA to write a whole paper on the Iriquois Nations. Teachers and librarians posted some questions about this in the comments, including questioning whether it is appropriate to tell students to use one portal for all research. We generally counsel students to check multiple sources–and getting students used to searching one place that happens to be appropriate for searching one topic may not work if the DPLA has nothing available on say, the latest computer technology.

Digital content and the DPLA

What content the DPLA will provide will surely become more clear over the following months. They have appointed Emily Gore as Director of Content, and continue to hold further working groups on content and audience. The DPLA website promises a remarkable vision for content:

The DPLA will incorporate all media types and formats including the written record—books, pamphlets, periodicals, manuscripts, and digital texts—and expanding into visual and audiovisual materials in concert with existing repositories. In order to lay a solid foundation for its collections, the DPLA will begin with works in the public domain that have already been digitized and are accessible through other initiatives. Further material will be added incrementally to this basic foundation, starting with orphan works and materials that are in copyright but out-of-print. The DPLA will also explore models for digital lending of in-copyright materials. The content that is contributed to or funded by the DPLA will be made available, including through bulk download, with no new restrictions, via a service available to libraries, museums, and archives in the United States, with use and reuse governed only by public law.  6

All of these models exist in one way or another already, however, so how is this something new?

The major purveyors of out of copyright digital book content are Google Books and HathiTrust. The potential problems with Google Books are obvious just in the name–Google is a publicly traded company with aspirations to be the hub of all world information. Privacy and availability, not to mention legality, are a few of the concerns. HathiTrust is a collective of research universities digitizing collections, many in concert with Google Books, but the full text of these books in a convenient format is generally only available to members of HathiTrust. HathiTrust faced a lawsuit from the Authors Guild about its digitization of orphan works, which is an issue the DPLA is also planning to address.

Other projects exist trying to make currently in copyright digital books more accessible, of which Unglue.it is probably best known. This requires a critical mass of people to actively work to pay to release a book into the public domain, and so may not serve the scholar with a unique research project. Some future plans for the DPLA include to obtain funds to pay authors for use–but this may or may not include releasing books into the public domain.

DPLA is not meant to include books alone. Planning so far suggests that books make a logical jumping off point. The “Concept Note” points out that “if it takes the sky as its limit, it will never get off the ground.” Despite this caution, ideally it would eventually be a portal to all types of materials already made available by cultural institutions, including datasets and government information.

Do we need another platform?

The first element of the DPLA is code–it will use open source technologies in developing a platform, and will release all code (and the tools and services this code builds) as open source software.  The so-called “Beta Sprint” that took place last year invited people to “grapple, technically and creatively, with what has already been accomplished and what still need to be developed…” 7. The winning “betas” deal largely with issues of interoperability and linked data. Certainly if a platform could be developed that solved these problems, this would be a huge boon to the library world.

Getting involved withe DPLA and looking to the future

While the governance structure is becoming more formal, there are plenty of opportunities to become involved with the DPLA. Six working groups (called workstreams) were formed to discuss content, audience, legal issues, business models, governance, and technical issues. Becoming involved with the DPLA is as easy as signing up for an  account on the wiki and noting your name and comments on the working group page in which are interested. You can also sign up mailing lists to stay involved in the project. Like many such projects, the work is done by the people who show up and speak up. If you read this and have an opinion on the direction the DPLA should take, it is not difficult to make sure your opinion gets heard by the right people.

Like all writing about the DPLA since the planning began, turning to a thought experiment seems the next logical rhetorical step. Let’s say that the DPLA succeeds to the point where all public domain books in the United States are digitized and available in multiple formats to any person in the country, and a significant number of in copyright works are also available. What does this mean for libraries as a whole? Does it make public libraries research libraries? How does it change the nature of research libraries? And lastly, will all this information create a new desire for knowledge among the American people?

References
  1. Darnton, Robert. “A Library Without Walls.” NYRblog, October 4, 2010. http://www.nybooks.com/blogs/nyrblog/2010/oct/04/library-without-walls/.
  2. McGowan, Ian. “National Libraries.” In Encyclopedia of Library and Information Sciences, Third Edition, 3850–3863.
  3. “Frequently Asked Questions – About the Library (Library of Congress).” Text, n.d. http://www.loc.gov/about/faqs.html#every_book
  4. Dillon, Cy. “Planning the Digital Public Library of America.” College & Undergraduate Libraries 19, no. 1 (March 2012): 101–107.
  5. Rothman, David H. “It’s Time for a National Digital-Library System.” The Chronicle of Higher Education, February 24, 2011, sec. The Chronicle Review. http://chronicle.com/article/Its-Time-for-a-National/126489/.
  6. “Elements of the DPLA.” Digital Public Library of America, n.d. http://dp.la/about/elements-of-the-dpla/.
  7. “Digital Public Library of America Steering Committee Announces ‘Beta Sprint’ ”, May 20, 2011. http://cyber.law.harvard.edu/newsroom/Digital_Public_Library_America_Beta_Sprint.

Creating quick solutions and having fun: the joy of hackathons

Women Who Code

Photo credit Adria Richards, used CC BY-SA 2.0

Hackathons– aka “hackfests”, “codefests”, or “codeathons”, are time periods dedicated to “hacking” on a problem, or creating a quick and dirty technical solution. (They have nothing to do with “hackers” in the virus or breaking into computers sense of the word). Traditionally, hackathons gave developers a chance to meet in person to work on specific technologies or platforms.  But increasingly, the concept of hackathons are used to work on solving technical problems or developing new ideas using technology in fields such as  law, public data, water supply, and making the world a better place. Academic librarians should be thinking about hackathons for several reasons: first, we help researchers to learn about innovative tools and resources in their areas, and these days a lot of this work is happening in hackathon settings. Second, hackathons are often improve library technology in open source and proprietary products alike. And third, hackathons are sometimes taking place in academic libraries (such as the University of Michigan and the University of Florida). Even non-coders can and should keep an eye on what’s going on with hackathons and start getting involved.

Origins of hackathons

People have, of course, hacked at technical problems and created innovative technical solutions since the beginning of computing. But the first known use of the term “hackathon” to describe a specific event was in June of 1999 when a group of OpenBSD developers met in Calgary to work on cryptography (see more on the record of OpenBSD hackathons). Later that same month, Sun Microsystems used the term on a Palm V project. 1 Just as in a marathon, individuals came together to accomplish a very challenging project in a short and fixed amount of time.

The term and concept became increasingly popular over the course of the first decade of the 2000s. The concept can vary widely, but is usually understood to mean a short time period (often a weekend) during which a specific problem is addressed by a group of developers working together, often by themselves but in close enough proximity to each other to meet and discuss issues. They usually are in person events where everyone meets in one location, but can be distributed virtual events. Often hackathons have prizes for best solution, and are a chance for developers to show off their talent to potential employers–sometimes companies sponsor them specifically to find new employees. But they can also be an opportunity for incubating new and learning developers (Layer 7).

Hackathons can be organized around an existing open source software community, but also frequently take place within a company to give developers a chance to come up with innovative ideas. One notable example is Facebook. In Pedram Keyani’s post, he describes the excitement that regular hackathons provide for Facebook’s engineers by giving them a chance to work on an idea without worrying about whether it scales to 900 million people. After the hackathon, developers present their prototypes to the rest of the team and have two minutes to prove that they should be part of Facebook. Some features that were developed during hackathons include the “Like” button and the ability to tag users in comments–huge pieces of functionality that might not be there without hackathons.

Hackathons in library technology

The first library technology hackathon we know about happened at the Access 2002 conference, and was modeled after PyCon code sprints (Art Rhyno, email message to author, July 18, 2012). The developers at this hackathon worked on projects related to content management systems for cultural content, citation digests, and EZProxy tools. Since then, each Access conference has had a hackathon as part of the conference. The Code4Lib conference has also had elements of hackathons (often as pre-conferences) throughout the years.

Another example of hackathons those sponsored by library vendors to promote the use of their  products’ API’s. Simply put, APIs are ways that data can go between platforms or programs so that you can create new tools with pieces of data from other systems. In 2008, OCLC sponsored a hackathon in New York City where they provided special access to various pieces of WorldCat and other OCLC products. Staff from OCLC were on hand to answer questions and facilitate breakout sessions. Hacks included work with controlled vocabularies, “find more like this” recommendation services, and several other items (Morgan). Eric Morgan, one of the participants, described  the event as a success partly because it was a good example of how librarians can take control of their vendor provided tools by learning how to get the data out and use in other ways.

How to get involved with hackathons

It’s easy to be discouraged or overwhelmed about the idea of participating in a hackathon if you are new to the open source software world. First of all, it’s important to remember that librarians who work with technology on a daily basis have a lot of ideas about how to improve the tools in their libraries. An example of this are the ideas submitted for the Access 2011 Hackfest. Ideas included bookmarklets, augmented reality in the library, and using iPads for self-checkout among many others. Reading that list may start to jog your own memory for tools you would love to see in your library but didn’t have a chance to work on yet or don’t completely understand.

But how to take those ideas and get involved with fellow developers who can help complete those projects? Many resources exist to help with this, but there are a few specifically geared at hackathons. First,  OpenHatch  is an open source project with the mission to make it easier to participate in open source software. One feature helpful to those just starting out are “Training Missions” that walk through basic skills you need such as working on the command line and using version control systems. Another area of OpenHatch shows lists of projects suitable for beginners and information on how non-coders can participate in projects. Keep an eye on the events listed there to find events geared for beginners or people still learning. Another resource for finding out and signing up for hackathons is  Hackathon.io.

Try to participate in a hackathon at the next technical library conference you attend. You can also start small by meeting up with librarians in your area for a very informal library technology hackathon. Make sure that you document what you work on and what the results were. Don’t worry about having judges or prizes–just make it a fun and collaborative event that allows everyone to participate and learn something new. You don’t need to create something new, either. This could be a great opportunity to learn how to work all the bells and whistles of a vendor platform or a social media tool.

Don’t worry–just start hacking

You can approach hackathons in whatever way works for you. For some, hackathons provide  the excitement of competing for prizes or great jobs by staying up all night coding amongst fellow developers. If the idea of staying up all night looking at a computer screen leaves you cold, don’t worry. In a April blog post, Andromeda Yelton shared her experience attending her first hackathon, and encouraged those new to this type of event to “sit at the table” both physically and by understanding that they have something to contribute even if they are not experts. She suggests that the minimum it should take to be involved in hackathons or similar projects is “interest, aptitude… [and a] drive to contribute.” (Yelton)

There are a lot of problems out there in the library world. Hackathons show us that sometimes all it takes is a weekend to get closer to a solution. But don’t worry about solving all the problems. Just pick the one you are most concerned about, find some friends, and start hacking on it.

Works cited
Layer 7 Technologies. “How to Run a Successful Hackathon for Your Open APIs”. July 12, 2012. http://www.slideshare.net/rnewton/how-to-run-a-successful-hackathon-for-your-open-apis.
Morgan, Eric Lease. “WorldCat Hackathon « Infomotions Mini-Musings.” Infomotions Mini-Musings, November 9, 2008. http://infomotions.com/blog/2008/11/worldcat-hackathon/.
Yelton, Andromeda. “My First Hackathon; or, Gender, Status, Code, and Sitting at the Table.” Across Divided Networks, April 6, 2012. http://andromedayelton.com/blog/2012/04/06/my-first-hackathon-or-gender-status-code-and-sitting-at-the-table/.
  1. This information comes from Wikipedia, but does not have a citation and I am unable to independently verify it. This is presented as common knowledge in a variety of sources, but not cited.

Linked Data in Libraries: Getting into the W3C Library Linked Data Incubator Group

What are libraries doing (or not doing) about linked data? This was the question that the W3C Library Linked Data Incubator Group investigated between May 2010 and August 2011. In this post, I will take a look at the  final report of the W3C Library Linked Data Incubator Group (October 2011) and provide an overview of their recommendations and my own analysis of the issues. Incubator Groups were a program that the W3C ran from 2006-2012 to get work done quickly on innovative ideas where there wasn’t enough to actually begin working on creating the web standards for which the W3C exists. (The Incubator Group program has transitioned into Community and Business Groups).

In this report, the participants in the group made several key recommendations aimed at library leaders, library standards bodies, data and systems designers, and librarian and archivists. The recommendations indicate just how far we are from really being able to implement open linked data in every library but also reveal the current landscape.
Library Leaders

An illustration of the VIAF authority file for Jane Austen

The report calls on library leaders to identify potentially very useful sets of data that can be exposed easily using current practices. That is, they should not try to revolutionize workflows, but to evolve towards more linked data. They mention authority files as an example of a data set that is ideal for this purpose, since authority files are lists of real world people with attributes that connect to real things. Having some semantic context for authority files helps–we could imagine a scenario in which you are searching for a common name, but the system recognizes that you are searching for a twentieth century American author and so does not show you a sixteenth century British author. Catalogers don’t necessarily have to do anything differently, either, since these authority files can link to other data to make a whole
picture. VIAF (Virtual International Authority File) is a project between OCLC and several national libraries to create such a linked international authority file using linked data and enter into the semantic web.

Library leadership must face the issue of rights in an open data world. It is a trope that libraries hold much valuable cultural and bibliographic data. Yet in many cases we have purchased or leased this data from a vendor rather than creating it ourselves (certainly we do in the case of indexes and often with catalog records)–and the license terms may not allow for open sharing of the data. We must be aware that exposing linked data openly is probably not going to mesh well with the way we have done things traditionally. Harvard recently released 12 million bibliographic records under a CC0 (public domain) license. Many libraries might not be in the position to release their own bibliographic records if they did not create them originally. Of course the same goes for indexes or bibliographies, other categories of traditional library materials that seems ripe for linking semantically. Library leadership will have to address this before open linked data is truly possible.
Library Standards Bodies

The report calls on library standards bodies to attack the problem from both sides. First, librarians need to be involved with standardizing semantic web technologies in a way that meets their needs and ensures that the library world stays in line with the way the technology is moving generally. Second, creators of library data standards need to ensure that those standards are compatible with semantic web technologies. Library data, when encoded in MARC, combines meaning and the structure in one unit. This works well for people who are reading the data, but is not easy for computers to parse semantically.  For instance, consider:

245 10|aPride and prejudice /|cby Jane Austen.
which viewed in the browser or on the catalog card like:

Pride and prejudice /
by Jane Austen.

The 245 tells us that this is a main title, and then the 1 tells us there is an added entry, in this case for Jane Austen. The 0 tells us that the title doesn’t begin with an article, or “nonfiling character”. The |a gives the actual title, followed by a / character, and then the |c is the statement of responsibility, followed by a period. Note that there is semantic meaning mixed together with punctuation and words that are helpful for people, such as “by”, which follow the rules of AACR2. There are good reasons for these rules, but the rules were meant to serve the information needs of humans. Given the capabilities of computers to parse and present structured data meaningfully to humans, it seems vital to make library data understandable to computers and know that we can use it to make something more useful to people. You may have noticed that HTML has changed over the past few years in the same way that library data will have to change. If you, for instance, want to give emphasis to a word, you use the <em></em>  tags. People know the word is emphasized because it’s in italics, the computer knows it’s emphasized because you told it that it was. Indicating that a word should be italicized using the <i></i> tags looks the same to a human reader who can understand the context for the use of italics, but doesn’t tell the computer that the word is particularly important. HTML 5 has even more use of semantic tags to make more of the standard ways of presenting information on the web meaningful to computers.

Systems Designers
The recommendations for data and systems designers are to start building tools that use linked data. Without a “killer app”, it’s hard to get excited about semantic technologies. Just after my last post went up, Google released its “Knowledge Graph”. This search takes words that traditionally would be matched as words, and matches them with “things.” For instance, if I type the search string Lincoln Hall into Google. Google guesses that I probably mean a concert venue in Chicago with that name and shows me that as the first result. It also displays a map, transit directions, reviews, and an upcoming schedule on the sidebar–certainly very convenient if that’s what I was looking for. But below the results for the concert venue, I get a box stating “See results about Lincoln Hall, Climber.” When I click on this, my results change to information about the Australian climber who recently died, and the side bar changes to information about him. Now as a librarian, I know that there would have been many ways to improve my search. But because semantic web technologies allow Google’s algorithms to understand that despite having the same name, an entity of a concert venue and a mountaineer are very different. This neatly disposes of the need for sophisticated searching for facts about things.  Whether this is, indeed, revolutionary remains to be seen. But try it as a user. You might be pleasantly surprised by how it makes your search easier. It may be that web-scale discovery will do the same thing for libraries, but this is a tool that remains out of reach of many libraries.
Librarians and Archivists
Librarians and archivists have, as always, a duty to collect and preserve linked data sets. We know how valuable the earliest examples of any piece of data storage are–whether it’s a clay tablet, a book, or an index. We create bibliographies to see how knowledge changed over time or in different contexts. We need to be careful to  preserve important data sets currently being produced, and maintain them over time so they remain accessible for future needs. But there’s another danger inherent in not being scrupulous about data integrity. Maintaining accurate and diverse data sets will help keep future information factual and unbiased. When a fact is one step removed from its source, it becomes even more difficult to check it for accuracy. While outright falsehood or misstatement is possible to correct, it will also be important to present alternate perspectives to ensure that scholarship can progress. (For an example of the issues in only presenting the most mainstream understanding of history, see “The ‘Undue Weight’ of Truth on Wikipedia”). If linked data doesn’t help us find out anything novel, will there have been a point in linking it?
Conclusion 
If you  haven’t yet read it, the report is a quick read and clear to people without a technical background, so I encourage you to take a look at it, particularly with reference to the use cases and data sets already extant. I hope you will get excited about the possibilities, and even if you are not in a position to use linked data yet, be thinking about what the future could hold. As I mentioned in my last post, the LODLAM (International Linked Open Data in Libraries, Archives, and Museums Summit) blog and the Digital Library Federation sponsored LOD-LAM Zotero group have lots of resources. There is also an ALA Library Linked Data Interest Group which sponsors discussions and has a mailing list.