Last month we got the long-awaited ruling in favor of Google in the Authors Guild vs. Google Books case, which by now has been analyzed extensively. Ultimately the judge in the case decided that Google’s digitization was transformative and thus constituted fair use. See InfoDocket for detailed coverage of the decision.
The Google Books project was part of the Google mission to index all the information available, and as such could never have taken place without libraries, which hold all those books. While most, if not all, the librarians I know use Google Books in their work, there has always been a sense that the project should not have been started by a commercial enterprise using the intellectual resources of libraries, but should have been started by libraries themselves working together. Yet libraries are often forced to be more conservative about digitization than we might otherwise be due to rules designed to protect the college or university from litigation. This ruling has made it seem as though we could afford to be less cautious. As Eric Hellman points out, the decision seems to imply that with copyright the ends are the important part, not the means. “In Judge Chin’s analysis, copyright is concerned only with the ends, not the means. Copyright seems not to be concerned with what happens inside the black box.” 1 As long as the end use of the books was fair, which was deemed to be the case, the initial digitization was not a problem.
Looking at this from the perspective of repository manager, I want to address a few of the theoretical and logistical issues behind such a conclusion for libraries.
What does this mean for digitization at libraries?
At the beginning of 2013 I took over an ongoing digitization project, and as a first-time manager of a large-scale long-term project, I learned a lot about the processes involved in such a project. The project I work with is extremely small-scale compared with many such projects, but even at this scale the project is expensive and time-consuming. What makes it worth it is that long-buried works of scholarship are finally being used and read, sometimes for reasons we do not quite understand. That gets at the heart of the Google Books decision—digitizing books in library stacks and making them more widely available does contribute to education and useful arts.
There are many issues that we need to address, however. Some of the most important ones are what access can and should be provided to what works, and making mass digitization more available to smaller and international cultural heritage institutions. Google Books could succeed because it had the financial and computing resources of Google matched with the cultural resources of the participating research libraries. This problem is international in scope. I encourage you to read this essay by Amelia Sanz, in which she argues that digitization efforts so far have been inherently unequal and a reflection of colonialism. 2 But is there a practical way of approaching this desire to make books available to a wider audience?
There are several separate issues in providing access. Books that are in the public domain are unquestionably fine to digitize, though differences in international copyright law make it difficult to determine what can be provided to whom. As Amelia Sanz points out, Google can only digitize Spanish works prior to 1870 in Spain, but may digitize the complete work in the United States. The complete work is not available to Spanish researchers, but it is available in full to US researchers.
That aside, there are several reasons why it is useful to digitize works still unquestionably under copyright. One of the major reasons is textual corpus analysis–you need to have every word of many texts available to draw conclusions about use of words and phrases in those texts. Google Books ngram viewer is one such tool that comes out of mass digitization. Searching for phrases in Google and finding that phrase as a snippet in a book is an important way to find information in books that might otherwise be ignored in favor of online sources. Some argue that this means that those books will not be purchased when they might have otherwise been, but it is equally possible that this leads to greater discovery and more purchases, which research into music piracy suggests may be the case.
Another reason to digitize works still under copyright is to highlight the work of marginalized communities, though in that case it is imperative to work with those communities to ensure that the digitization is not exploitative. Many orphan works, for whom a rights-holder cannot be located, fall under this, and I know from some volunteer work that I have done that small cultural heritage institutions are eager to digitize material that represents the cultural and intellectual output of their communities.
In all the above cases, it is crucial to put into place mechanisms for ensuring that works under copyright are not abused. Google Books uses an algorithm that makes it impossible to read an entire book, which is probably beyond the abilities of most institutions. (If anyone has an idea for how to do this, I would love to hear it.) Simpler and more practical solutions to limiting access are to only make a chapter or sample of a book available for public use, which many publishers already allow. For instance, Oxford University Press allows up to 10% of a work (within certain limits) on personal websites or institutional repositories. (That is, of course, assuming you can get permission from the author). Many institutions maintain “dark archives“, which are digitized and (usually) indexed archives of material inaccessible to the public, whether institutional or research information. For instance, the US Department of Energy Office of Scientific and Technical Information maintains a dark archive index of technical reports comprising the equivalent of 6 million pages, which makes it possible to quickly find relevant information.
In any case where an institution makes the decision to digitize and make available the full text of in-copyright materials for reasons they determine are valid, there are a few additional steps that institutions should take. Institutions should research rights-holders or at least make it widely known to potential rights-holders that a project is taking place. The Orphan Works project at the University of Michigan is an example of such a project, though it has been fraught with controversy. Another important step is to have a very good policy for taking down material when a rights-holder asks–it should be clear to the rights-holder whether any copies of the work will be maintained and for what purposes (for instance archival or textual analysis purposes).
Digitizing, Curating, Storing, Oh My!
The above considerations are only useful when it is even possible for institutions without the resources of Google to start a digitization program. There are many examples of DIY digitization by individuals, for instance see Public Collectors, which is a listing of collections held by individuals open for public access–much of it digitized by passionate individuals. Marc Fischer, the curator of Public Collectors, also digitizes important and obscure works and posts them on his site, which he funds himself. Realistically, the entire internet contains examples of digitization of various kinds and various legal statuses. Most of this takes place on cheap and widely available equipment such as flatbed scanners. But it is possible to build an overhead book scanner for large-scale digitization with individual parts and at a reasonable cost. For instance, the DIY Book Scanning project provides instructions and free software for creating a book scanner. As they say on the site, all the process involves is to “[p]oint a camera at a book and take pictures of each page. You might build a special rig to do it. Process those pictures with our free programs. Enjoy reading on the device of your choice.”
“Processing the pictures” is a key problem to solve. Turning images into PDF documents is one thing, but providing high quality optical character recognition is extremely challenging. Free tools such as FreeOCR make it possible to do OCR from image or PDF files, but this takes processing power and results vary widely, particularly if the scan quality is lower. Even expensive tools like Adobe Acrobat or ABBYY FineReader have the same problems. Karen Coyle points out that uncorrected OCR text may be sufficient for searching and corpus analysis, but does not provide a faithful reproduction of the text and thus, for instance, provide access to visually impaired persons 3 This is a problem well known in the digital humanities world, and one solved by projects such as Project Gutenberg with the help of dedicated volunteer distributed proofreaders. Additionally, a great deal of material clearly in the public domain is in manuscript form or has text that modern OCR cannot recognize. In that case, crowdsourcing transcriptions is the only financially viable way for institutions to make text of the material available. 4 Examples of successful projects using volunteer transcriptors or proofreaders include Ancient Lives to transcribe ancient papyrus, What’s on the Menu at the New York Public Library, and DIYHistory at the University of Iowa libraries. (The latter has provided step by step instructions for building your own version using open source tools).
So now you’ve built your low-cost DIY book scanner, and put together a suite of open source tools to help you process your collections for free. Now what? The whole landscape of storing and preserving digital files is far beyond the scope of this post, but the cost of accomplishing this is probably the highest of anything other than staffing a digitization project, and it is here where Google clearly has the advantage. The Internet Archive is a potential solution to storing public domain texts (though they are not immune to disaster), but if you are making in-copyright works available in any capacity you will most likely have to take the risk on your own servers. I am not a lawyer, but I have never rented server space that would allow copyrighted materials to be posted.
Conclusion: Is it Worth It?
Obviously from this post I am in favor of taking on digitization projects of both public domain and copyrighted materials when the motivations are good and the policies are well thought out. From this perspective, I think the Google Books decision was a good thing for libraries and for providing greater access to library collections. Libraries should be smart about what types of materials to digitize, but there are more possibilities for large-scale digitization, and by providing more access, the research community can determine what is useful to them.
If you have managed a DIY book scanning project, please let me know in the comments, and I can add links to your project.
- Hellman, Eric. “Google Books and Black-Box Copyright Jurisprudence.” Go To Hellman, November 18, 2013. http://go-to-hellman.blogspot.com/2013/11/google-books-and-black-box-copyright.html. ↩
- Sanz, Amelia. “Digital Humanities or Hypercolonial Studies?” Responsible Innovation in ICT (June 26, 2013). http://responsible-innovation.org.uk/torrii/resource-detail/1249#_ftnref13. ↩
- Coyle, Karen. “It’s FAIR!” Coyle’s InFormation, November 14, 2013. http://kcoyle.blogspot.com/2013/11/its-fair.html. ↩
- For more on this, see Ben Brumfield’s work on crowdsourced transcription, for example Brumfield, Ben W. “Collaborative Manuscript Transcription: ‘The Landscape of Crowdsourcing and Transcription’ at Duke University.” Collaborative Manuscript Transcription, November 23, 2013. http://manuscripttranscription.blogspot.com/2013/11/the-landscape-of-crowdsourcing-and.html. ↩
In honor of Open Access Week, I want to look at some troubling recent discussions about open access, and what academic librarians who work with technology can do. As the manager of an open access institutional repository, I strongly believe that providing greater access to academic research is a good worth pursuing. But I realize that this comes at a cost, and that we have a responsibility to ensure that open access also means integrity and quality.
On “stings” and quality
By now, the article by John Bohannon in Science has been thoroughly dissected in the blogosphere 1. This was not a study per se, but rather a piece of investigative journalism looking into the practices of open access journals. Bohannon submitted variations on an article written under African pseudonyms from fake universities that “any reviewer with more than a high-school knowledge of chemistry…should have spotted the paper’s short-comings immediately.” Over the course of 10 months, he submitted these articles to 304 open access journals whose names he drew from the Directory of Open Access Journals and Jeffrey Beall’s list of predatory open access publishers. Ultimately 157 of the journals accepted the article and 98 rejected it, when any real peer review would have meant that it was rejected in all cases. It is very worth noting that in an analysis of the raw data that Bohannon supplied some publishers on Beall’s list rejected the paper immediately, which is a good reminder to take all curative efforts with an appropriate amount of skepticism 2.
There are certainly many methodological flaws in this investigation, which Mike Taylor outlines in detail in his post 3, and which he concludes was specifically aimed at discrediting open access journals in favor of journals such as Science. As Michael Eisen outlines, Science has not been immune to publishing articles that should have been rejected after peer review–though Bohannon informed Eisen that he intended to look at a variety of journals but this was not practical, and this decision was not informed by editors at Science. Eisen’s conclusion is that “peer review is a joke” and that we need to stop regarding the publication of an article in any journal as evidence that the article is worthwhile 4. Phil Davis at the Scholarly Kitchen took issue with this conclusion (among others noted above), since despite the flaws, this did turn up incontrovertible evidence that “a large number of open access publishers are willfully deceiving readers and authors that articles published in their journals passed through a peer review process…” 5. His conclusion is that open access agencies such as OASPA and DOAJ should be better at policing themselves, and that on the other side Jeffrey Beall should be cautious about suggesting a potential for guilt without evidence. I think one of the more level-headed responses to this piece comes from outside the library and scholarly publishing world in Steven Novella’s post on Neurologica, a blog focused on science and skepticism written by an academic neurologist. He is a fan of open access and wider access to information, but makes the point familiar to all librarians that the internet creates many more opportunities to distribute both good and bad information. Open access journals are one response to the opportunities of the internet, and in particular author-pays journals like “all new ‘funding models’ have the potential of creating perverse incentives.” Traditional journals fall into the same trap when they rely on impact factor to drive subscriptions, which means they may end up publishing “sexy” studies of questionable validity or failing to publish replication studies which are the backbone of the scientific method–and in fact the only real way to establish results no matter what type of peer review has been done 6.
More “perverse incentives”
So far the criticisms of open access have revolved around one type of “gold” open access, wherein the author (or a funding agency) pays article publication fees. “Green” open access, in which a version of the article is posted in a repository is not susceptible to abuse in quite the same way. Yet a new analysis of embargo policies by Shan Sutton shows that some publishers are targeting green open access through new policies. Springer used to have a 12 month embargo for mandated deposit in repositories such as PubMed, but now has extended it to all institutional repositories. Emerald changed its policy so that any mandated deposit to a repository (whether by funder or institutional mandate) was subject to a 24 month embargo 7.
In both cases, paid immediate open access is available for $1,595 (Emerald) or $3,000 (Springer). It seems that the publishers are counting that a “mandate” means that funds are available for this sort of hyrbid gold open access, but that ignores the philosophy behind such mandates. While federal open access mandates do in theory have a financial incentive that the public should not have to pay twice for research, Sutton argues that open access “mandates” at institutions are actually voluntary initiatives by the faculty, and provide waivers without question 8. Additionally, while this type of open access does provide public access to the article, it does not address larger issues of reuse of the text or data in the true sense of open access.
What should a librarian do?
The issues above are complex, but there are a few trends that we can draw on to understand our responsibilities to open access. First, there is the issue of quality, both in terms of researcher experience in working with a journal, and that of being able to trust the validity of an individual article. Second, we have to be aware of the terms under which institutional policies may place authors. As with many such problems, the technological issues are relatively trivial. To actually address them meaningfully will not happen with technology alone, but with education, outreach, and network building.
The major thing we can take away from Bohannon’s work is that we have to help faculty authors to make good choices about where they submit articles. Anyone who works with faculty has stories of extremely questionable practices by journals of all types, both open access and traditional. Speaking up about those practices on an individual basis can result in lawsuits, as we saw earlier this year. Are there technical solutions that can help weed out predatory publishers and bad journals and articles? The Library Loon points out that many factors, some related to technology, have meant that both positive and negative indicators of journal quality have become less useful in recent years. The Loon suggests that “[c]reating a reporting mechanism where authors can rate and answer relatively simple questions about their experiences with various journals seems worthwhile.” 9
The comments to this post have some more suggestions, including open peer review and a forum backed by a strong editor that could be a Yelp-type site for academic publisher reputation. I wrote about open peer review earlier this year in the context of PeerJ, and participants in that system did indeed find the experience of publishing in a journal with quick turnarounds and open reviews pleasant. (Bohannon did not submit a fake article to PeerJ). This solution requires that journals have a more robust technical infrastructure as well as a new philosophy to peer review. More importantly, this is not a solution librarians can implement for our patrons–it is something that has to come from the journals.
The idea that seems to be catching on more is the “Yelp” for scholarly publishers. This seems like a good potential solution, albeit one that would require a great deal of coordinated effort to be truly useful. The technical parts of this type of solution would be relatively easy to carry out. But how to ensure that it is useful for its users? The Yelp analog may be particularly helpful here. When it launched in 2004, it asked users who were searching for some basic information about their question, and to provide the email addresses of additional people whom they would have traditionally asked for this information. Yelp then emailed those people as well as others with similar searches to get reviews of local businesses to build up its base of information. 10 Yelp took a risk in pursuing content in that way, since it could have been off-putting to potential users. But local business information was valuable enough to early users that they were willing to participate, and this seems like a perfect model to build up a base of information on journal publisher practices.
This helps address the problem of predatory publishers and shifting embargoes, but it doesn’t help as much with the issue of quality assurance for the article content. Librarians teach students how to find articles that claim to be peer reviewed, but long before Bohannon we knew that peer review quality varies greatly, and even when done well tells us nothing about the validity of the research findings. Education about the scholarly communication cycle, the scientific method, and critical thinking skills are the most essential tools to ensure that students are using appropriate articles, open access or not. However, those skills are difficult to bring to bear for even the most highly experienced researchers trying to keep up with a large volume of published research. There are a few technical solutions that may be of help here. Article level metrics, particularly alternative metrics, can aid in seeing how articles are being used. (For more on altmetrics, see this post from earlier this year).
One of the easiest options for article level metrics is the Altmetric.com bookmarklet. This provides article level metrics for many articles with a DOI, or articles from PubMed and arXiv. Altmetric.com offers an API with a free tier to develop your own app. An open source option for article level metrics is PLOS’s Article-Level Metrics, a Ruby on Rails application. These solutions do not guarantee article quality, of course, but hopefully help weed out more marginal articles.
No one needs to be afraid of open access
For those working with institutional repositories or other open access issues, it sometimes seems very natural for Open Access Week to fall so near Halloween. But it does not have to be frightening. Taking responsibility for thoughtful use of technical solutions and on-going outreach and education is essential, but can lead to important changes in attitudes to open access and changes in scholarly communication.
- Bohannon, John. “Who’s Afraid of Peer Review?” Science 342, no. 6154 (October 4, 2013): 60–65. doi:10.1126/science.342.6154.60. ↩
- “Who Is Afraid of Peer Review: Sting Operation of The Science: Some Analysis of the Metadata.” Scholarlyoadisq, October 9, 2013. http://scholarlyoadisq.wordpress.com/2013/10/09/who-is-afraid-of-peer-review-sting-operation-of-the-science-some-analysis-of-the-metadata/. ↩
- Taylor, Mike. “Anti-tutorial: How to Design and Execute a Really Bad Study.” Sauropod Vertebra Picture of the Week. Accessed October 17, 2013. http://svpow.com/2013/10/07/anti-tutorial-how-to-design-and-execute-a-really-bad-study/. ↩
- Eisen, Michael. “I Confess, I Wrote the Arsenic DNA Paper to Expose Flaws in Peer-review at Subscription Based Journals.” It Is NOT Junk, October 3, 2013. http://www.michaeleisen.org/blog/?p=1439. ↩
- Davis, Phil. “Open Access ‘Sting’ Reveals Deception, Missed Opportunities.” The Scholarly Kitchen. Accessed October 17, 2013. http://scholarlykitchen.sspnet.org/2013/10/04/open-access-sting-reveals-deception-missed-opportunities/. ↩
- Novella, Steven. “A Problem with Open Access Journals.” Neurologica Blog, October 7, 2013. http://theness.com/neurologicablog/index.php/a-problem-with-open-access-journals/. ↩
- Sutton, Shan C. “Open Access, Publisher Embargoes, and the Voluntary Nature of Scholarship: An Analysis.” College & Research Libraries News 74, no. 9 (October 1, 2013): 468–472. ↩
- Ibid., 469 ↩
- Loon, Library. “A Veritable Sting.” Gavia Libraria, October 8, 2013. http://gavialib.com/2013/10/a-veritable-sting/. ↩
- Cringely, Robert. “The Ears Have It.” I, Cringely, October 14, 2004. http://www.pbs.org/cringely/pulpit/2004/pulpit_20041014_000829.html. ↩
You may not think much about cryptography on a daily basis, but it underpins your daily work and personal existence. In this post I want to talk about a few realms of cryptography that affect the work of academic librarians, and talk about some interesting facets you may never have considered. I won’t discuss the math or computer science basis of cryptography, but look at it from a historical and philosophical point of view. If you are interested in the math and computer science, I have a few a resources listed at the end in addition to a bibliography.
Note that while I will discuss some illegal activities in this post, neither I nor anyone connected with the ACRL TechConnect blog is suggesting that you actually do anything illegal. I think you’ll find the intellectual part of it stimulation enough.
What is cryptography?
Keeping information secret is as simple as hiding it from view in, say, an envelope, and trusting that only the person to whom it is addressed will read that information and then not tell anyone else. But we all know that this doesn’t actually work. A better system would only allow a person with secret credentials to open the envelope, and then for the information inside to be in a code that only she could know.
The idea of codes to keep important information secret goes back thousands of years , but for the purposes of computer science, most of the major advances have been made since the 1970s. In the 1960s with the advent of computing for business and military uses, it was necessary to come up with ways to encrypt data. In 1976, the concept of public-key cryptography was developed, but it wasn’t realized practically until 1978 with the paper by Rivest, Shamir, and Adleman–if you’ve ever wondered what RSA stood for, there’s the answer. There were some advancements to this system, which resulted in the digital signature algorithm as the standard used by the federal government.1 Public-key systems work basically by creating a private and a public key–the private one is known only to each individual user, and the public key is shared. Without the private key, however, the public key can’t open anything. See the resources below for more on the math that makes up these algorithms.
Another important piece of cryptography is that of cryptographic hash functions, which were first developed in the late 1980s. These are used to encrypt blocks of data– for instance, passwords stored in databases should be encrypted using one of these functions. These functions ensure that even if someone unauthorized gets access to sensitive data that they cannot read it. These can also be used to verify the identify of a piece of digital content, which is probably how most librarians think about these functions, particularly if you work with a digital repository of any kind.
Why do you care?
You probably send emails, log into servers, and otherwise transmit all kinds of confidential information over a network (whether a local network or the internet). Encrypted access to these services and the data being transmitted is the only way that anybody can trust that any of the information is secret. Anyone who has had a credit card number stolen and had to deal with fraudulent purchases knows first-hand how upsetting it can be when these systems fail. Without cryptography, the modern economy could not work.
Of course, we all know a recent example of cryptography not working as intended. It’s no secret (see above where keeping something a secret requires that no one who knows the information tells anyone else) by now that the National Security Agency (NSA) has sophisticated ways of breaking codes or getting around cryptography though other methods 2 Continuing with our envelope analogy from above, the NSA coerced companies to allow them to view the content of messages before the envelopes were sealed. If the messages were encoded, they got the keys to decode the data, or broke the code using their vast resources. While these practices were supposedly limited to potential threats, there’s no denying that this makes it more difficult to trust any online communications.
Librarians certainly have a professional obligation to keep data about their patrons confidential, and so this is one area in which cryptography is on our side. But let’s now consider an example in which it is not so much.
Breaking DRM: e-books and DVDs
Librarians are exquisitely aware of the digital rights management realm of cryptography (for more on this from the ALA, see The ALA Copyright Office page on digital rights ). These are algorithms that encode media in such a way that you are unable to copy or modify the material. Of course, like any code, once you break it, you can extract the material and do whatever you like with it. As I covered in a recent post, if you purchase a book from Amazon or Apple, you aren’t purchasing the content itself, but a license to use it in certain proscribed ways, so legally you have no recourse to break the DRM to get at the content. That said, you might have an argument under fair use, or some other legitimate reason to break the DRM. It’s quite simple to do once you have the tools to do so. For e-books in proprietary formats, you can download a plug-in for the Calibre program and follow step by step instructions on this site. This allows you to change proprietary formats into more open formats.
As above, you shouldn’t use software like that if you don’t have the rights to convert formats, and you certainly shouldn’t use it to pirate media. But just because it can be used for illegal purposes, does that make the software itself illegal? Breaking DVD DRM offers a fascinating example of this (for a lengthy list of CD and DVD copy protection schemes, see here and for a list of DRM breaking software see here). The case of CSS (Content Scramble System) descramblers illustrates some of the strange philosophical territory into which this can end up. The original code was developed in 1999, and distributed widely, which was initially ruled to be illegal. This was protested in a variety of ways; the Gallery of CSS Descramblers has a lot more on this 3. One of my favorite protest CSS descramblers is the “illegal” prime number, which is a prime number that contains the entire code for breaking the CSS DRM. The first illegal prime number was discovered in 2001 by Phil Carmody (see his description here) 4. This number is, of course, only illegal inasmuch as the information it represents is illegal–in this case it was a secret code that helped break another secret code.
In 2004, after years of court hearings, the California Court of Appeal overturned one of the major injunctions against posting the code, based on the fact that source code is protected speech under the first amendment , and that the CSS was no longer a trade secret. So you’re no longer likely to get in trouble for posting this code–but again, using it should only be done for reasons protected under fair use. 5 One of the major reasons you might legitimately need to break the DRM on a DVD is to play DVDs on computers running the Linux operating system, which still has no free legal software that will play DVDs (there is legal software with the appropriate license for $25, however). Given that DVDs are physical media and subject to the first sale doctrine, it is unfair that they are manufactured with limitations to how they may be played, and therefore this is a code that seems reasonable for the end consumer to break. That said, as more and more media is streamed or otherwise licensed, that argument no longer applies, and the situation becomes analogous to e-book DRM.
The Gambling With Secrets video series explains the basic concepts of cryptography, including the mathematical proofs using colors and other visual concepts that are easy to grasp. This comes highly recommended from all the ACRL TechConnect writers.
Since it’s a fairly basic part of computer science, you will not be surprised to learn that there are a few large open courses available about cryptography. This Cousera class from Stanford is currently running, and this Udacity class from University of Virginia is a self-paced course. These don’t require a lot of computer science or math skills to get started, though of course you will need a great deal of math to really get anywhere with cryptography.
A surprising but fun way to learn a bit about cryptography is from the NSA’s Kids website–I discovered this years ago when I was looking for content for my X-Files fan website, and it is worth a look if for nothing else than to see how the NSA markets itself to children. Here you can play games to learn basics about codes and codebreaking.
- Menezes, A., P. van Oorschot, and S. Vanstone. Handbook of Applied Cryptography. CRC Press, 1996. http://cacr.uwaterloo.ca/hac/. 1-2. ↩
- See the New York Times and The Guardian for complete details. ↩
- Touretzky, D. S. (2000) Gallery of CSS Descramblers. Available: http://www.cs.cmu.edu/~dst/DeCSS/Gallery, (September 18, 2013). ↩
- For more, see Caldwell, Chris. “The Prime Glossary: Illegal Prime.” Accessed September 17, 2013. http://primes.utm.edu/glossary/xpage/Illegal.html. ↩
- “DVDCCA v Bunner and DVDCCA v Pavlovich.” Electronic Frontier Foundation. Accessed September 23, 2013. https://www.eff.org/cases/dvdcca-v-bunner-and-dvdcca-v-pavlovich. ↩
Omeka is an easy to use content management system for digital exhibits created by the Ray Rosenzweig Center for History and New Media. It’s very modular, so you can customize it for various functions. I won’t go into the details here on how to set up Omeka, but you can read documentation and see example collections at Omeka.org. If you want to experiment with Omeka without installing it on your own server, you can set up a hosted account at Omeka.net
Earlier this year Omeka was completely rewritten and released a 2.0 version (now 2.1). Like with many open source content management systems, it took awhile for the contributed plug-ins and themes to catch up to the new release. As of July, most of the crucial contributed plug-ins were available, and if you haven’t yet updated your installation this is a good time to think about doing so. In this post I’m going to focus on the process of customizing Omeka 2.0 to your institution, and specifically creating a custom theme. While there are now several good themes available for 2.0, you will probably want to make a default theme that matches the rest of your website. One of the nice features of Omeka that is quite different from other content management systems is that it is very easy for the people who create exhibits to pick a custom theme that differs from the default theme. That said, providing a custom theme for your institution makes it easy for visitors to know where they are, and will also make it easier on the staff who are creating exhibits since you can adapt the theme to their needs.
Like any design project, you should start with a discussion with the people who use the system most. (If you are new to design, check the ACRL TechConnect posts on design). In my case, there are two archives on campus who both use Omeka for their exhibits. Mock up what the layout should look like–you may not be able to get it perfectly, but use this as a guide to future development. We came up with a rough sketch based on what the archivist liked and didn’t like about templates available, and worked together on determining the priorities for the design. (Side note: if you can get your whole wall painted with whiteboard paint this is a very fun collaborative project.)
Development is very easy to start when you are modifying an existing theme. Start with a theme (there are only a few that are 2.0 compatible) that is close to what you need. Rather than the subtheme system you may be used to with Drupal or WordPress, with Omeka you can pick the theme you want to hack on and copy the entire directory and rename it.
Here was the process I followed to build my theme. I suggest that you set up a local development environment (I used XAMPP) to do this work, but make sure that you have at least one exhibit to test, since some of the CSS is different for exhibits than for the rest of the site.
- Pick a theme
I started with the Seasons theme. I copied the seasons directory from the themes directory and pasted it back with a new name of luctest (which I renamed when it was time to move it to a production environment).
- Modify theme.ini
This is what you will start with. You really only need to edit the author, title, and description unless you want to edit the rest.
[theme] author = "Roy Rosenzweig Center for History and New Media" title = "Seasons" description = "A colorful theme with a configuration option to switch style sheets for a particular season, plus 'night'." license = "GPLv3" website = "<a href="http://omeka.org">http://omeka.org</a>" support_link = "<a href="http://omeka.org/forums/forum/themes-and-public-display">http://omeka.org/forums/forum/themes-and-public-display</a>" omeka_minimum_version="2.0" omeka_target_version="2.0" version="2.1.1" tags="yellow, blue, summer, season, fall, orange, green, dark"
- Modify config.ini
Check which elements are set in the configuration (i.e. the person such as an archivist who is creating the exhibit can set them) and which you need to set in the theme. This can cause a lot of frustration when you attempt to style an element whose value is actually set by the user. If you don’t want to allow the user to change anything, you can take that option out of the config.ini, just make sure you’ve set it elsewhere.
[config] ; Style Sheet style_sheet.type = "select" style_sheet.options.label = "Style Sheet" style_sheet.options.description = "Choose a style sheet" style_sheet.options.multiOptions.spring = "Spring" style_sheet.options.multiOptions.summer = "Summer" style_sheet.options.multiOptions.autumn = "Autumn" style_sheet.options.multiOptions.winter = "Winter" style_sheet.options.multiOptions.night = "Night" style_sheet.options.value = "winter" logo.type = "file" logo.options.label = "Logo File" logo.options.description = "Choose a logo file. This will replace the site title in the header of the theme. Recommended maximum width for the logo is 500px." logo.options.validators.count.validator = "Count" logo.options.validators.count.options.max = "1" display_featured_item.type = "checkbox" display_featured_item.options.label = "Display Featured Item" display_featured_item.options.description = "Check this box if you wish to show the featured item on the homepage." display_featured_item.options.value = "1" display_featured_collection.type = "checkbox" display_featured_collection.options.label = "Display Featured Collection" display_featured_collection.options.description = "Check this box if you wish to show the featured collection on the homepage." display_featured_collection.options.value = "1" display_featured_exhibit.type = "checkbox" display_featured_exhibit.options.label = "Display Featured Exhibit" display_featured_exhibit.options.description = "Check this box if you wish to show the featured exhibit on the homepage." display_featured_exhibit.options.value = "1" homepage_recent_items.type = "text" homepage_recent_items.options.label = "Homepage Recent Items" homepage_recent_items.options.description = "Choose a number of recent items to be displayed on the homepage." homepage_recent_items.options.maxlength = "2" homepage_text.type = "textarea" homepage_text.options.label = "Homepage Text" homepage_text.options.description = "Add some text to be displayed on your homepage." homepage_text.options.rows = "5" homepage_text.options.attribs.class = "html-input"
(This is just a sample of part of the config.ini file).
- Modify CSS
Open up css/style.css and check which elements you need to modify (note that some themes may have the style sheets divided up differently.) Some items are obvious, some you will have to use Firebug or another tool to determine which class styles the element. You can always ask in the Omeka themes and display forum if you can’t figure it out.
The Seasons theme has different styles for each color scheme, and in the interests of time I picked the color scheme closest to the color scheme I wanted to end with. You could use the concept of different schemes to identify the collections and/or exhibits of different units. Make sure you read through the whole style sheet first to determine which elements are theme-wide, and which are set in the color scheme.
- Test, test, test
The 2.0 themes that I’ve experimented with are all responsive and work well with different browsers. This probably goes without saying, but if you have changed the spacing at all, make sure you test your design in multiple window sizes and devices.
We have a few additional items to add to this design, but it’s met our immediate needs very well, and most importantly matches the design of the Archives and Special Collections website so it’s clear to users that they are still in the right place.
Since this was a new content management system to me, I still have a lot to learn about the best ways to do certain things. This experience was helpful not just in learning Omeka, but also as a small-scale test of planning a new theme for our entire library website, which runs on Drupal.
No matter whether a small university press focusing on niche markets to the Big Six giants looking for the next massive bestseller, the publishing industry has been struggling to come to terms with the reality of new distribution models. Those models tends to favor cheaper and faster production with a much lower threshold for access, which generally has been good news for consumers. Several recent rulings and statements have brought the issues to the forefront of conversation and perhaps indicated some common themes in publishing which are relevant to all libraries and their ability to purchase and/or provide digital content.
Academic Publishing: Dissertation == Monograph?
On July 22 the American Historical Association issued a “Statement on Policies Regarding the Embargoing of Completed History PhD Dissertations”. In this statement, the American Historical Association recommended that all libraries and graduate programs allow dissertations to be embargoed for up to six years. This is, in theory, to allow junior scholars enough time to publish a monograph based on the dissertation in order to receive tenure. This would be under the assumption that academic publishers would not publish a book based on a dissertation freely available online. Reactions to this statement prompted the AHA to release a Q & A page to clarify and support their position, including pointing out that publishers’ positions are too unclear to be sure there is no risk to an open access dissertation, and “like it or not”, junior faculty must produce a monograph to get tenure. They claim that in some cases that this benefits junior scholars to give them more time to revise their work before publication–while this is true, it indicates that a dissertation is not equivalent to a published scholarly monograph. The argument from the publisher’s side appears to be that libraries (who are the main purchasers of scholarly monographs) will not purchase books based on revised dissertations freely available online, the truth of which has been debated widely. Libraries do purchase print copies of titles (both monographs and serials) which are freely available online.
From my personal experience as an institutional repository manager, I know the attitude to embargoing dissertations varies widely by advisor and department. Like most people making an argument about this topic, I do not have much more than anecdotes to provide. I checked the most commonly downloaded dissertations from the past year, and it appeared the most frequently downloaded title (over 2000 over 2012-2013) is also the only one that has been published as a book that has been purchased by at least one library. Clearly this does not control for all variables and warrants further study, but it is a useful clue that open access availability does not always affect publication and later purchase. Further, from the point of view of open access creating more equal access to resources across the world, Google Analytics for that dissertation indicates that the sessions over the past year with the most engaged users came from, in order, the UK, the United States, Mauritius, and Sri Lanka.
What Should a Digital Book Cost?
In mid-July Denise Cote, the judge in the Apple e-book price fixing case, issued an opinion stating that Apple did collude with the publishers to set prices on ebooks. Reading the story of the negotiations in the opinion is a thrilling behind the scenes look at companies trying to get a handle on a fairly new market and trying to understand how they will make money. Below I summarize the 160 page opinion, which is well worth reading in its entirety.
The problem with ebook pricing started with Amazon, which set a price of $9.99 for new releases that normally would have had list prices of $25-$30. This was frustrating to the major publishing houses, who worried (probably rightly so) that consumers would be unwilling to pay more than $10 for books after getting used to this low price point. Amazon would effectively price everyone else out of the market. Even after publishers raised the wholesale price of new releases, Amazon would sell them at loss to preserve the $9.99 price. The publishers spent 2009 developing strategies to combat Amazon, but it wasn’t until late 2009 with the entry of Apple into the ebook market that they saw a real opportunity.
Apple agreed with the Big Six publishers that setting all books at $9.99 was too low, but was unwilling to enter into a market in which they could not compete with Amazon. To accomplish this, they wanted the publishers to agree to the same terms, which included lower wholesale prices for ebooks. The negotiations that followed over late 2009 and early 2010 started positively, but ended in dissatisfaction. Because Apple was unwilling to sell anything as a loss leader, they felt that a wholesale model would leave them too vulnerable to Amazon. To address that, they proposed to sell books with an agency model (which several publishers had suggested). With an agency model, Apple would collect a 30% commission on sales just as they did with the App Store. To ensure that publishers did not set unrealistically high prices, Apple would set pricing caps. The other crucial move that Apple made was to insist that publishers move all retailers of ebooks to the agency model in order to ensure Apple would be able to compete on price across the board. Amazon had no interest in the agency model, and in early 2010 had a series of meeting with the publishers that made this clear. After all the agreements were signed with Apple (the only Big Six publisher who did not participate was Random House), the publishers needed to move Amazon to an agency model to fulfill the terms of their contract. Macmillan was the first publisher to set up a meeting with Amazon to discuss this requirement. The response to the meeting was for Amazon to remove the “buy” button from all Macmillan books, both print and Kindle editions. Amazon eventually had to capitulate to the publishers to move to an agency model, which was complete by mid-2010, but submitted a complaint to the Federal Trade Commission. Random House finally agreed to an agency model with Apple in early 2011, thanks to a spot of blackmail on Apple’s part (it wouldn’t allow any Random House apps without a agency deal).
Ultimately the court determined that Apple violated the Sherman Act by conspiring with the publishers to force all their retailers to sell books at the same prices and thus removing competition. A glance at Amazon’s Kindle store bestsellers today shows books priced from $1.99 to $13.99 for the newest Stephanie Plum mystery (the same price as it is in the Apple bookstore). For all titles priced higher than $9.99, Amazon notes that the “price is set by the publisher.” Whether this means anything to the average consumer is debatable. Compare these negotiations to the on-going struggle libraries have had with availability of ebooks for lending–publishers have a lot to learn about libraries in addition to new models for digital sales, some of which was covered at the series of talks with the Big Six publishers that Maureen Sullivan held in early 2012. Over recent months publishers have made more ebooks available to libraries. But some libraries, most notably the Douglas County, Colorado libraries, are setting their own terms for purchasing and lending ebooks.
What Can You Do With a Digital File?
The last ruling I want to address is about the music resale service ReDigi, about which Kevin Smith goes into detail. This was was a service that provided a way for people to re-sell purchased MP3s, but ultimately the judge ruled that it was impossible to transfer the original file and so this did not fit under the first sale doctrine. The first sale doctrine (17 USC § 109) holds that “the owner of a particular copy or phonorecord lawfully made … is entitled, without the authority of the copyright owner, to sell or otherwise dispose of the possession of that copy or phonorecord.” Another case that was decided in April by the Supreme Court, Kirtsaeng v. Wiley, upheld this in the case of international sales of physical items, which was an important decision for libraries. But digital materials are more complicated. First sale applies to computer programs on physical media (except in certain circumstances), but does not cover material that has been licensed rather than sold, which is how most digital files are distributed. (For how the US Attorney’s Office approaches this in criminal investigations, see this document.) So when you “buy” that Kindle book from Amazon or load a book onto your iPad you are licensing the product for limited use on a limited number of devices and no legal recourse for lending or getting rid of the content, even if you try hard to follow the law as ReDigi did. Librarians are well aware of this and its implications, and license quite a bit of content that we can loan and/or distribute under limited circumstances. Libraries are safest in the long term if they can own the content outright rather than licensing, as are consumers. But it will be a long time before there is clarity about the legal way to transfer owner of a digital file at the consumer level.
Librarians and publishers have a complicated relationship. We need each other if either is to succeed, but even if our ends are the ultimately the same, our means are very different. These recent events indicate that there is still much in flux and plenty of room for constructive dialog with content creators and publishers.
Many of us have had conversations in the past few weeks about data collection due to the reports about the NSA’s PRISM program, but ever since April and the bombings at the Boston Marathon, there has been an increased awareness of how much data is being collected about people in an attempt to track down suspects–or, increasingly, stop potential terrorist events before they happen. A recent Nova episode about the manhunt for the Boston bombers showed one such example of this at the New York Police Department. This program is called the Domain Awareness System at the New York Police Department, and consists of live footage from almost every surveillance camera in the New York City playing in one room, with the ability to search for features of individuals and even the ability to detect people acting suspiciously. Added to that a demonstration of cutting edge facial recognition software development at Carnegie Mellon University, and reality seems to be moving ever closer to science fiction movies.
Librarians focused on technical projects love to collect data and make decisions based on that data. We try hard to get data collection systems as close to real-time as possible, and work hard to make sure we are collecting as much data as possible and analyzing it as much as possible. The idea of a series of cameras to track in real-time exactly what our patrons are doing in the library in real-time might seem very tempting. But as librarians, we value the ability of our patrons to access information with as much privacy as possible–like all professions, we treat the interactions we have with our patrons (just as we would clients, patients, congregants, or sources) with care and discretion (See Item 3 of the Code of Ethics of the American Library Association). I will not address the national conversation about privacy versus security in this post–I want to address the issue of data collection right where most of us live on a daily basis inside analytics programs, spreadsheets, and server logs.
What kind of data do you collect?
Let’s start with an exercise. Write a list of all the statistical reports you are expected to provide your library–for most of us, it’s probably a very long list. Now, make a list of all the tools you use to collect the data for those statistics.
Here are a few potential examples:
Website visitors and user experience
- Google Analytics or some other web analytics tool
- Heat map tool
- Server logs
Electronic resource access reports
- Electronic resources management application
- Vendor reports (COUNTER and other)
- Link resolver click-through report
- Proxy server logs
How much is enough?
Think about with these tools what type of data you are collecting about your users. Some of it may be very private indeed. For instance, the heat map tool I’ve recently started using (Inspectlet) not only tracks clicks, but actually records sessions as patrons use the website. This is fascinating information–we had, for instance, one session that was a patron opening the library website, clicking the Facebook icon on the page, and coming back to the website nearly 7 hours later. It was fun to see that people really do visit the library’s Facebook page, but the question was immediately raised whether it was a visit from on campus. (It was–and wouldn’t have taken long to figure out if it was a staff machine and who was working that day and time). IP addresses from off campus are very easy to track, sometimes down to the block–again, easy enough to tie to an individual. We like to collect IP addresses for abusive or spamming behavior and block users based on IP address all the time. But what about in this case? During the screen recordings I can see exactly what the user types in the search boxes for the catalog and discovery system. Luckily, Inspectlet allows you to obscure the last two octets (which is legally required some places) of the IP address, so you can have less information collected. All similar tools should allow you the same ability.
Consider another case: proxy server logs. In the past when I did a lot of EZProxy troubleshooting, I found the logs extremely helpful in figuring out what went wrong when I got a report of trouble, particularly when it had occurred a day or two before. I could see the username, what time the user attempted to log in or succeeded in logging in, and which resources they accessed. Let’s say someone reported not being able to log in at midnight– I could check to see the failed logins at midnight, and then that username successfully logging in at 1:30 AM. That was a not infrequent occurrence, as usually people don’t think to write back and say they figured out what they did wrong! But I could also see everyone else’s logins and which articles they were reading, so I could tell (if I wanted) which grad students were keeping up with their readings or who was probably sharing their login with their friend or entire company. Where I currently work, we don’t keep the logs for more than a day, but I know a lot of people are out there holding on to EZProxy logs with the idea of doing “something” with them someday. Are you holding on to more than you really want to?
Let’s continue our exercise. Go through your list of tools, and make a list of all the potentially personally identifying information the tool collects, whether or not you use them. Are you surprised by anything? Make a plan to obscure unused pieces of data on a regular basis if it can’t be done automatically. Consider also what you can reasonably do with the data in your current job requirements, rather than future study possibilities. If you do think the data will be useful for a future study, make sure you are saving anonymized data sets unless it is absolutely necessary to have personally identifying information. In the latter case, you should clear your study in advance with your Institutional Review Board and follow a data management plan.
A privacy and data management policy should include at least these items:
- A statement about what data you are collecting and why.
- Where the data is stored and who has access to it.
- A retention timeline.
What we can do with data
In all this I don’t at all mean to imply that we shouldn’t be collecting this data. In both the examples I gave above, the data is extremely useful in improving the patron experience even while giving identifying details away. Not collecting data has trade-offs. For years, libraries have not retained a patron’s borrowing record to protect his or her privacy. But now patrons who want to have an online record of what they’ve borrowed from the library must use third-party services with (most likely) much less stringent privacy policies than libraries. By not keeping records of what users have checked out or read through databases, we are unable to provide them personalized automated suggestions about what to read next. Anyone who uses Amazon regularly knows that they will try to tempt you into purchases based on your past purchases or books you were reading the preview of–even if you would rather no one know that you were reading that book and certainly don’t want suggestions based on it popping up when you are doing a collection development project at work and are logged in on your personal account. In all the decisions we make about collecting or not collecting data, we have to consider trade-offs like these. Is the service so important that the benefits of collecting the data outweigh the risks? Or, is there another way to provide the service?
We can see some examples of this trade-off in two similar projects coming out of Harvard Library Labs. One, Library Hose, was a Twitter stream with the name of every book being checked out. The service ran for part of 2010, and has been suspended since September of 2010. In addition to daily tweet limits, this also was a potential privacy violation–even if it was a fun idea (this blog post has some discussion about it). A newer project takes the opposite approach–books that a patron thinks are “awesome” can be returned to the Awesome Box at the circulation desk and the information about the book is collected on the Awesome Box website. This is a great tweak to the earlier project, since this advertises material that’s now available rather than checked out, and people have to opt in by putting the item in the box.
In terms of personal recommendations, librarians have the advantage of being able to form close working relationships with faculty and students so they can make personal recommendations based on their knowledge of the person’s work and interests. But how to automate this without borrowing records? One example is a project that Ian Chan at California State University San Marcos has done to use student enrollment data to personalize the website based on a student’s field of study. (Slides). This provides a great deal of value for the students, who need to log in to check their course reserves and access articles from off campus anyway. This adds on top of that basic need a list of recommended resources for students, which they can choose to star as favorites.
Work to educate your patrons about privacy, particularly online privacy. ALA has a Choose Privacy Week, which is always the first week in May. The site for that has a number of resources you might want to consult in planning programming. Academic librarians may find it easiest to address college students in terms of their presence on social media when it comes to future job hunting, but this is just an opening to larger conversations about data. Make sure that when you ask patrons to use a third party service (such as a social network) or recommend a service (such as a book recommending site) that you make sure they are aware of what information they are sharing.
We all know that Google’s slogan is “Don’t be evil”, but it’s not always clear if they are sticking to that. Make sure that you are not being evil in your own data collection.
In April of this year, the two most popular free citation managers–Mendeley and Zotero–both underwent some big changes. On April 8th, TechCrunch announced that Elsevier had purchased Mendeley, which had been surmised in January. 1 Just a few days later, Zotero announced the release of version 4, with a number of new features. 2 Just as with the sunsetting of Google Reader, this has prompted many to consider what citation managers they have been using and think about switching or changing practices. I will not address subscription or paid products like RefWorks and EndNote specifically, though there are certainly many reasons you might prefer one of those products.
Mendeley: a new Star Wars movie in the making?
The rhetoric surrounding Elsevier’s acquisition of Mendeley was generally alarmist in nature, and the hashtag “#mendelete” that popped up immediately after the announcement suggests that many people’s first instinct was to abandon Mendeley. Elsevier has been held up as a model of anti-open access, and Mendeley as a model for open access. Yet Mendeley has always been a for-profit company, and, like Google, benefits itself and its users (particularly the science community) by knowing what they are reading and sharing. After all, the social features of Mendeley wouldn’t have any value if there was no public sharing. Institutional Mendeley accounts allow librarians to see what their users in aggregate are reading and saving, which helps them make collection development decisions– a service beyond what the average institutional citation manager product accomplishes. Victor Henning promises on the Mendeley blog that nothing will change, and that this will give them more freedom to develop more features 3. As for Elsevier, Oliver Dumon promises that Mendeley will remain independent and allowed to follow their own course–and that bringing it together with ScienceDirect and Scopus will create a “central workflow and collaboration site for authors”.4
There are two questions to be answered in this. First, is it realistic to assume that the Mendeley team will have the creative freedom they say they will have? And second, are users comfortable with their data being available to Elsevier? For many, the answers to both these questions seem to be “no” and “no.” A more optimistic point of view is that if Elsevier must placate Mendeley users who are open access advocates, they will allow more openness than before.
It’s too early to say, but I remain hopeful that Mendeley can continue to create a more open spirit in academic publishing. Peter Hoyt (a former employee of Mendeley and founder of PeerJ) suggests that much of the work that he oversaw to open up Mendeley was being stymied by Elsevier specifically. For him, this went against his personal ethos and so he was unable to stay at Mendeley–but he is confident in the character and ability of the people remaining at Mendeley. 5. I have never been a heavy user of Mendeley, but I have maintained a free account for the past few years. I use it mainly to create a list of my publications on my personal website, using a WordPress plug-in that uses the Mendeley API.
What’s new with Zotero
Zotero is a very different product than Mendeley. First, it is open-source software, with lots of ways to participate in development. Zotero was developed by the Roy Rosenzweig Center for History and New Media at George Mason University, with foundation and user support. It was developed specifically to support the research work of humanists. Originally a Firefox plug-in, Zotero now works as a standalone piece of software that interacts with Firefox, Chrome, and Safari to recognize bibliographic data on websites and pull them into a database that can be synced across computers (and even some third party mobile software). The newest version of Zotero includes several improvements. The one I am most excited about is detailed download display, which tells you what folder you’re saving a reference into, which is crucial for my workflow. Zotero is the citation manager I use on a daily basis, and I rely on it for formatting the footnotes you see on ACRL TechConnect posts or other research articles I produce. Since much of my research is on the open web, books, or other non-journal article resources, I find the ability of Zotero to pick up library catalog records and similar metadata more useful than the Mendeley import bookmarklet.
Both Zotero and Mendeley offer free storage for metadata and PDFs, with a cost for storage above the free level. (It is also possible to use a WebDAV server for syncing Zotero files).
|2 GB||$20 / year||2 GB||Free|
|6 GB||$60 / year||5 GB||$55 / year|
|10 GB||$100 / year||10 GB||$110 / year|
|25 GB||$240 / year||Unlimited||$165 / year|
Some concluding thoughts
Several graduate students in science 6 have written blog posts about switching away from Mendeley to Zotero. But they aren’t the same thing at all, and given the backgrounds of their creators, Mendeley is more skewed to the sciences, and Zotero more to the humanities.
Nor, as I like to point out, must they be mutually exclusive. I use Zotero for my daily citation management since I much prefer it for grabbing citations online, but sync my Zotero library with Mendeley to use the social and API features in Mendeley. I can choose to do this as an individual, but consider carefully the implications of your choice if you are considering an institutional subscription or requiring students or members of a research group to use a particular service.
- Lunden, Ingrid. “Confirmed: Elsevier Has Bought Mendeley For $69M-$100M To Expand Its Open, Social Education Data Efforts.” TechCrunch, April 18, 2013. http://techcrunch.com/2013/04/08/confirmed-elsevier-has-bought-mendeley-for-69m-100m-to-expand-open-social-education-data-efforts/. ↩
- Takats, Sean. “Zotero 4.0 Launches.” Zotero, April 11, 2013. http://www.zotero.org/blog/zotero-4-0-launches/. ↩
- Henning, Victor. “Mendeley and Elsevier – Here’s More Info.” Mend, April 19, 2013. http://blog.mendeley.com/community-relations/mendeley-and-elsevier-heres-more-info/ ↩
- Dumon, Oliver. “Elsevier Welcomes Mendeley.” Elsevier Connect, April 8, 2013. http://elsevierconnect.com/elsevier-welcomes-mendeley/. ↩
- Hoyt, Jason. “My Thoughts on Mendeley/Elsevier & Why I Left to Start PeerJ,” April 9, 2013. http://enjoythedisruption.com/post/47527556151/my-thoughts-on-mendeley-elsevier-why-i-left-to-start. ↩
- For one, see “Mendeley Sells Out; I’m Moving to Zotero.” LJ Villanueva’s Research Blog. Accessed May 20, 2013. http://research.coquipr.com/archives/492. ↩
Academic librarians working in technical roles may rarely see stacks of books, but they doubtless see messy digital data on a daily basis. OpenRefine is an extremely useful tool for dealing with this data without sophisticated scripting skills and with a very low learning curve. Once you learn a few tricks with it, you may never need to force a student worker to copy and paste items onto Excel spreadsheets.
As this comparison by the creator of OpenRefine shows, the best use for the tool is to explore and transform data, and it allows you to make edits to many cells and rows at once while still seeing your data. This allows you to experiment and undo mistakes easily, which is a great advantage over databases or scripting where you can’t always see what’s happening or undo the typo you made. It’s also a lot faster than editing cell by cell like you would do with a spreadsheet.
Here’s an example of a project that I did in a spreadsheet and took hours, but then I redid in Google Refine and took a lot less time. One of the quickest things to do with OpenRefine is spot words or phrases that are almost the same, and possibly are the same thing. Recently I needed to turn a large export of data from the catalog into data that I could load into my institutional repository. There were only certain allowed values that could be used in the controlled vocabulary in the repository, so I had to modify the bibliographic data from the catalog (which was of course in more or less proper AACR2 style) to match the vocabularies available in the repository. The problem was that the data I had wasn’t consistent–there were multiple types of abbreviations, extra spaces, extra punctuation, and outright misspellings. An example is the History Department. I can look at “Department of History”, “Dep. of History”, “Dep of Hist.” and tell these are probably all referring to the same thing, but it’s difficult to predict those potential spellings. While I could deal with much of this with regular expressions in a text editor and find and replace in Excel, I kept running into additional problems that I couldn’t spot until I got an error. It took several attempts of loading the data until I cleared out all the errors.
In OpenRefine this is a much simpler task, since you can use it to find everything that probably is the same thing despite the slight differences in spelling, punctuation and spelling. So rather than trying to write a regular expression that accounts for all the differences between “Department of History”, “Dep. of History”, “Dep of Hist.”, you can find all the clusters of text that include those elements and change them all in one shot to “History”. I will have more detailed instructions on how to do this below.
Installation and Basics
OpenRefine was called, until last October, Google Refine, and while the content from the Google Refine page is being moved to the Open Refine page you should plan to look at both sites. Documentation and video tutorials refer interchangeably to Google Refine and OpenRefine. The official and current documentation is on the OpenRefine GitHub wiki. For specific questions you will probably want to use the OpenRefine Custom Search Engine, which brings together all the mix of documentation and tutorials on the web. OpenRefine is a web app that runs on your computer, so you don’t need an internet connection to run it. You can get the installation instructions on this page.
While you can jump in right away and get started playing around, it is well worth your time to watch the tutorial videos, which will cover the basic actions you need to take to start working with data. As I said, the learning curve is low, but not all of the commands will make sense until you see them in action. These videos will also give you an idea of what you might be able to do with a data set you have lying around. You may also want to browse the “recipes” on the OpenRefine site, as well search online for additional interesting things people have done. You will probably think of more ideas about what to try. The most important thing to know about OpenRefine is that you can undo anything, and go back to the beginning of the project before you messed up.
A basic understanding of the Google Refine Expression Language, or GREL will improve your ability to work with data. There isn’t a whole lot of detailed documentation, so you should feel free to experiment and see what happens when you try different functions. You will see from the tutorial videos the basics you need to know. Another essential tool is regular expressions. So much of the data you will be starting with is structured data (even if it’s not perfectly structured) that you will need to turn into something else. Regular expressions help you find patterns which you can use to break apart strings into something else. Spending a few minutes understanding regular expression syntax will save hours of inefficient find and replace. There are many tutorials–my go-to source is this one. The good news for librarians is that if you can construct a Dewey Decimal call number, you can construct a regular expression!
Some ideas for librarians
Above I described how you would use OpenRefine to clean up messy and inconsistent catalog data. Here’s how to do it. Load in the data, and select “Text Facet” on the column in question. OpenRefine will show clusters of text that is similar and probably the same thing.
Click on Cluster to get a menu for working with multiple values. You can click on the “Merge” check box and then edit the text to whatever you need it to be. You can also edit each text cluster to be the correct text.
You can merge and re-cluster until you have fixed all the typos. Back on the first Text Facet, you can hover over any value to edit it. That way even if the automatic clustering misses some you can edit the errors, or change anything that is the same but you need to look different–for instance, change “Dept. of English” to just “English”.
The main thing that I have used OpenRefine for in my daily work is to change a bibliography in plain text into columns in a spreadsheet that I can run against an API. This was inspired by this article in the Code4Lib Journal: “Using XSLT and Google Scripts to Streamline Populating an Institutional Repository” by Stephen X. Flynn, Catalina Oyler, and Marsha Miles. I wanted to find a way to turn a text CV into something that would work with the SHERPA/RoMEO API, so that I could find out which past faculty publications could be posted in the institutional repository. Since CVs are lists of data presented in a structured format but with some inconsistencies, OpenRefine makes it very easy to present the data in a certain way as well as remove the inconsistencies, and then to extend the data with a web service. This is a very basic set of instructions for how to accomplish this.
The main thing to accomplish is to put the journal title in its own column. Here’s an example citation in APA format, in which I’ve colored all the “separator” punctuation in red:
Heller, M. (2011). A Review of “Strategic Planning for Social Media in Libraries”. Journal of Electronic Resources Librarianship, 24 (4), 339-240)
From the drop-down menu at the top of the column click on “Split into several columns…” from the “Edit Column” menu. You will get a menu like the one below. This example finds the opening parenthesis and removes that in creating a new column. The author’s name is its own column, and the rest of the text is in another column.
The rest of the column works the same way–find the next text, punctuation, or spacing that indicates a separation. You can then rename the column to be something that makes sense. In the end, you will end up with something like this:
When you have the journal titles separate, you may want to cluster the text and make sure that the journals have consistent titles or anything else to clean up the titles. Now you are a ready to build on this data with fetching data from a web service. The third video tutorial posted above will explain the basic idea, and this tutorial is also helpful. Use the pull-down menu at the top of the journal column to select “Edit column” and then “Add column by fetching URLs…”. You will get a box that will help you construct the right URL. You need to format your URL in the way required by SHERPA/RoMEO, and will need a free API key. For the purposes of this example, you can use
'http://www.sherpa.ac.uk/romeo/api29.php?ak=[YOUR API KEY HERE]&qtype=starts&jtitle=' + escape(value,'url'). Note that it will give you a preview to see if the URL is formatted in the way you expect. Give your column a name, and set the Throttle delay, which will keep the service from rejecting too many requests in a short time. I found 1000 worked fine.
After this runs, you will get a new column with the XML returned by SHERPA/RoMEO. You can use this to pull out anything you need, but for this example I want to get pre-archiving and post-archiving policies, as well as the conditions. A quick way to to this is to use the Googe Refine Expression Language parseHtml function. To use this, click on “Add column based on this column” from the “Edit Column” menu, and you will get a menu to fill in an expression.
In this example I use the code
value.parseHtml().select("prearchiving").htmlText(), which selects just the text from within the prearchving element. Conditions are a little different, since there are multiple conditions for each journal. In that case, you would use the following syntax (after join you can put whatever separator you want):
So in the end, you will end up with a neatly structured spreadsheet from your original CV with all the bibliographic information in its own column and the publisher conditions listed. You can imagine the possibilities for additional APIs to use–for instance, the WorldCat API could help you determine which faculty published books the library owns.
Once you find a set of actions that gets your desired result, you can save them for the future or to share with others. Click on Undo/Redo and then the Extract option. You will get a description of the actions you took, plus those actions represented in JSON.
Unselect the checkboxes next to any mistakes you made, and then copy and paste the text somewhere you can find it again. I have the full JSON for the example above in a Gist here. Make sure that if you save your JSON publicly you remove your personal API key! When you want to run the same recipe in the future, click on the Undo/Redo tab and then choose Apply. It will run through the steps for you. Note that if you have a mistake in your data you won’t catch it until it’s all finished, so make sure that you check the formatting of the data before running this script.
Learning More and Giving Back
Hopefully this quick tutorial got you excited about OpenRefine and thinking about what you can do. I encourage you to read through the list of External Resources to get additional ideas, some of which are library related. There is lots more to learn and lots of recipes you can create to share with the library community.
Have you used OpenRefine? Share how you’ve used it, and post your recipes.
- OpenRefine on Twitter. If you post something about OpenRefine on Twitter, they will usually retweet you as a way to showcase what people are doing.
- OpenRefine Google Group
A few months ago as part of a discussion on open peer review, I described the early stages of planning for a new type of journal, called PeerJ. Last month on February 12 PeerJ launched with its first 30 articles. By last week, the journal had published 53 articles. There are a number of remarkable attributes of the journal so far, so in this post I want to look at what PeerJ is actually doing, and some lessons that academic libraries can take away–particularly for those who are getting into publishing.
What PeerJ is Doing
On the opening day blog post (since there are no editorials or issues in PeerJ, communication from the editors has to be done via blog post 1), the PeerJ team outlined their mission under four headings: to make their content open and help to make that standard practice, to practice constant innovation, to “serve academia”, and to make this happen at minimal cost to researchers and no cost to the public. The list of advisory board and academic editors is impressive–it is global and diverse, and includes some big names and Nobel laureates. To someone judging the quality of the work likely to be published, this is a good indication. The members of PeerJ range in disciplines, with the majority in Molecular Biology. To submit and/or publish work requires a fee, but there is a free plan that allows one pre-print to be posted on the forthcoming PeerJ PrePrints.
PeerJ’s publication methods are based on PLoS ONE, which publishes articles based on subjective scientific and methodological soundness rather with no emphasis placed on subjective measures of novelty or interest (see more on this). Like all peer-reviewed journals, articles are sent to an academic editor in the field, who then sends the article to peer reviewers. Everything is kept confidential until the article actually is published, but authors are free to talk about their work in other venues like blogs.
Look and Feel
There are several striking dissimilarities between PeerJ and standard academic journals. The home page of the journal emphasizes striking visuals and is responsive to devices, so the large image scales to a small screen for easy reading. The “timeline” display emphasizes new and interesting content. 2 The code they used to make this all happen is available openly on the PeerJ Github account. The design of the page reflects best practices for non-profit web design, as described by the non-profit social media guide Nonprofit Tech 2.0. The page tells a story, makes it easy to get updates, works on all devices, and integrates social media. The design of the page has changed iteratively even in the first month to reflect the realities of what was actually being published and how people were accessing it. 3 PDFs of articles were designed to be readable on screens, especially tablets, so rather than trying to fit as much text as possible on one page as many PDFs are designed, they have single columns with left margins, fewer words per line, and references hyperlinked in the text. 4
How Open Peer Review Works
One of the most notable features of PeerJ is open peer review. This is not mandatory, but approximately half the reviewers and authors have chosen to participate. 5 This article is an example of open peer review in practice. You can read the original article, the (in this case anonymous) reviewer’s comments, the editors comments and the author’s rebuttal letter. Anyone who has submitted an article to a peer reviewed journal before will recognize this structure, but if you have not, this might be an exciting glimpse of something you have never seen before. As a non-scientist, I personally find this more useful as a didactic tool to show the peer review process in action, but I can imagine how helpful it would be to see this process for articles about areas of library science in which I am knowledgeable.
With only 53 articles and in existence for such a short time, it is difficult to measure what impact open peer review has on articles, or to generalize about which authors and reviewers choose an open process. So far, however, PeerJ reports that several authors have been very positive about their experience publishing with the journal. The speed of review is very fast, and reviewers have been constructive and kind in their language. One author goes into more detail in his original post, “One of the reviewers even signed his real name. Now, I’m not totally sure why they were so nice to me. They were obvious experts in the system that I studied …. But they were nice, which was refreshing and encouraging.” He also points out that the exciting thing about PeerJ for him is that all it requires are projects that were technically well-executed and carefully described, so that this encourages publication of negative or unexpected results, thus avoiding the file drawer effect.6
This last point is perhaps the most important to note. We often talk of peer-reviewed articles as being particularly significant and “high-impact.” But in the case of PeerJ, the impact is not necessarily due to the results of the research or the type of research, but that it was well done. One great example of this is the article “Significant Changes in the Skin Microbiome Mediated by the Sport of Roller Derby”. 7 This was a study about the transfer of bacteria during roller derby matches, and the study was able to prove its hypothesis that contact sports are a good environment in which to study movements of bacteria among people. The (very humorous) review history indicates that the reviewers were positive about the article, and felt that it had promise for setting a research paradigm. (Incidentally, one of the reviewers remained anonymous , since he/she felt that this could “[free] junior researchers to openly and honestly critique works by senior researchers in their field,” and signed the letter “Diligent but human postdoc reviewer”.) This article was published the beginning of March, and already has 2,307 unique visits to the page, and has been shared widely on social media. We can assume that one of the motivations for sharing this article was the potential for roller derby jokes or similar, but will this ultimately make the article’s long term impact stronger? This will be something to watch.
What Can Academic Libraries Learn?
A recent article In the Library With the Lead Pipe discussed the open ethos in two library publications, In the Library With the Lead Pipe and Code4Lib Journal. 8 This article concluded that more LIS publications need to open the peer review process, though the publications mentioned are not peer reviewed in the traditional sense. There are very few, if any, open peer reviewed publications in the nature of PeerJ outside of the sciences. Could libraries or library-related publications match this process? Would they want to?
I think we can learn a few things from PeerJ. First, the rapid publication cycle means that more work is getting published more quickly. This is partly because they have so many reviewers and so any one reviewer isn’t overburdened–and due to their membership model, it is in the best financial interests of potential future authors to be current reviewers. As In the Library With the Lead Pipe points out that a central academic library journal, College & Research Libraries, is now open access and early content is available as a pre-print, the pre-prints reflect content that will be published in some cases well over a year from now. A year is a long time to wait, particularly for work that looks at current technology. Information Technology in Libraries (ITAL), the LITA journal is also open access and provides pre-prints as well–but this page appears to be out of date.
Another thing we can learn is making reading easier and more convenient while still maintaining a professional appearance and clean visuals. Blogs like ACRL Tech Connect and In the Library with the Lead Pipe deliver quality content fairly quickly, but look like blogs. Journals like the Journal of Librarianship and Scholarly Communication have a faster turnaround time for review and publication (though still could take several months), but even this online journal is geared for a print world. Viewing the article requires downloading a PDF with text presented in two columns–hardly the ideal online reading experience. In these cases, the publication is somewhat at the mercy of the platform (WordPress in the former, BePress Digital Commons in the latter), but as libraries become publishers, they will have to develop platforms that meet the needs of modern researchers.
A question put to the ACRL Tech Connect contributors about preferred reading methods for articles suggests that there is no one right answer, and so the safest course is to release content in a variety of formats or make it flexible enough for readers to transform to a preferred format. A new journal to watch is Weave: Journal of Library User Experience, which will use the Digital Commons platform but present content in innovative ways. 9 Any libraries starting new journals or working with their campuses to create new journals should be aware of who their readers are and make sure that the solutions they choose work for those readers.
- “The Launch of PeerJ – PeerJ Blog.” Accessed February 19, 2013. http://blog.peerj.com/post/42920112598/launch-of-peerj. ↩
- “Some of the Innovations of the PeerJ Publication Platform – PeerJ Blog.” Accessed February 19, 2013. http://blog.peerj.com/post/42920094844/peerj-functionality. ↩
- http://blog.peerj.com/post/45264465544/evolution-of-timeline-design-at-peerj ↩
- “The Thinking Behind the Design of PeerJ’s PDFs.” Accessed March 18, 2013. http://blog.peerj.com/post/43558508113/the-thinking-behind-the-design-of-peerjs-pdfs. ↩
- http://blog.peerj.com/post/43139131280/the-reception-to-peerjs-open-peer-review ↩
- “PeerJ Delivers: The Review Process.” Accessed March 18, 2013. http://edaphics.blogspot.co.uk/2013/02/peerj-delivers-review-process.html. ↩
- Meadow, James F., Ashley C. Bateman, Keith M. Herkert, Timothy K. O’Connor, and Jessica L. Green. “Significant Changes in the Skin Microbiome Mediated by the Sport of Roller Derby.” PeerJ 1 (March 12, 2013): e53. doi:10.7717/peerj.53. ↩
- Ford, Emily, and Carol Bean. “Open Ethos Publishing at Code4Lib Journal and In the Library with the Lead Pipe.” In the Library with the Lead Pipe (December 12, 2012). http://www.inthelibrarywiththeleadpipe.org/2012/open-ethos-publishing/. ↩
- Personal communication with Matthew Reidsma, March 19, 2013. ↩
Disclaimer: I was on the planning committee for Code4Lib 2013, but this is my own opinion and does not reflect other organizers of the conference.
We have mentioned Code4Lib before on this blog, but for those who are unfamiliar, it is a loose collective of programmers working in libraries, librarians, and others interested in code and libraries. (You can read more about it on the website.) The Code4Lib conference has emerged as a venue to share very new technology and have discussions with a wide variety of people who might not attend conferences more geared to librarians. Presentations at the conference are decided by the votes of anyone interested in selecting the program, and additionally lightning talks and breakout sessions allow wide participation and exposure to extremely new projects that have not made it into the literature or to conferences with a longer lead time. The Code4Lib 2013 conference ran February 11-14 at University of Illinois Chicago. You can see a list of all programs here, which includes links to the video archive of the conference.
While there were many types of projects presented, I want to focus on those talks which illustrated what I saw as thread running through the conference–care and emotion. This is perhaps unexpected for a technical conference. Yet those themes underlie a great deal of the work that takes place in academic library technology and the types of projects presented at Code4Lib. We tend to work in academic libraries because we care about the collections and the people using those collections. That intrinsic motivation focuses our work.
Caring about the best way to display collections is central to successful projects. Most (though not all) the presenters and topics came out of academic libraries, and many of the presentations dealt with creating platforms for library and archival metadata and collections. To highlight a few: Penn State University has developed their own institutional repository application called ScholarSphere that provides a better user experience for researchers and managers of the repository. The libraries and archives of the Rock and Roll Hall of Fame dealt with the increasingly common problem of wanting to present digital content alongside more traditional finding aids, and so developed a system for doing so. Corey Harper from New York University presented an extremely interesting and still experimental project to use linked data to enrich interfaces for interacting with library collections. Note that all these projects combined various pieces of open source software and library/web standards to create solutions that solve a problem facing academic or research libraries for a particular setting. I think an important lesson for most academic librarians looking at descriptions of projects like this is that it takes more than development staff to make projects like this. It takes purpose, vision, and dedication to collecting and preserving content–in other words, emotion and care. A great example of this was the presentation about DIYHistory from the University of Iowa. This project started out initially as an extremely low-tech solution for crowdsourcing archival transcription, but got so popular that it required a more robust solution. They were able to adapt open source tools to meet their needs, still keeping the project very within the means of most libraries (the code is here).
Another view of emotion and care came from Mark Matienzo, who did a lightning talk (his blog post gives a longer version with more details). His talk discussed the difficulties of acknowledging and dealing with the emotional content of archives, even though emotion drives interactions with materials and collections. The records provided are emotionless and affectless, despite the fact that they represent important moments in history and lives. The type of sharing of what someone “likes” on Facebook does not satisfactorily answer the question of what they care about,or represent the emotion in their lives. Mark suggested that a tool like Twine, which allows writing interactive stories could approach the difficult question of bringing together the real with the emotional narrative that makes up experience.
One of the ways we express care for our work and for our colleagues is by taking time to be organized and consistent in code. Naomi Dushay of Stanford University Library presented best practices for code handoffs, which described some excellent practices for documenting and clarifying code and processes. One of the major takeaways is that being clear, concise, and straightforward is always preferable, even as much as we want to create cute names for our servers and classes. To preserve a spirit of fun, you can use the cute name and attach a description of what the item actually does.
Originally Bess Sadler, also from Stanford, was going to present with Naomi, but ended up presenting a different talk and the last one of the conference on Creating a Commons (the full text is available here). This was a very moving look at what motivates her to create open source software and how to create better open source software projects. She used the framework of the Creative Commons licenses to discuss open source software–that it needs to be “[m]achine readable, human readable, and lawyer readable.” Machine readable means that code needs to be properly structured and allow for contributions from multiple people without breaking, lawyer readable means that the project should have the correct structure and licensing to collaborate across institutions. Bess focused particularly on the “human readable” aspect of creating communities and understanding the “hacker epistemology,” as she so eloquently put it, “[t]he truth is what works.” Part of understanding that requires being willing to reshape default expectations–for instance, the Code4Lib community developed a Code of Conduct at Bess’s urging to underline the fact that the community aims at inclusion and creating a safe space. She encouraged everyone to keep working to do better and “file bug reports” about open source communities.
This year’s Code4Lib conference was a reminder to me about why I do the work I do as an academic librarian working in a technical role. Even though I may spend a lot of time sitting in front of a computer looking at code, or workflows, or processes, I know it makes access to the collections and exploration of those collections better.