Last month we got the long-awaited ruling in favor of Google in the Authors Guild vs. Google Books case, which by now has been analyzed extensively. Ultimately the judge in the case decided that Google’s digitization was transformative and thus constituted fair use. See InfoDocket for detailed coverage of the decision.
The Google Books project was part of the Google mission to index all the information available, and as such could never have taken place without libraries, which hold all those books. While most, if not all, the librarians I know use Google Books in their work, there has always been a sense that the project should not have been started by a commercial enterprise using the intellectual resources of libraries, but should have been started by libraries themselves working together. Yet libraries are often forced to be more conservative about digitization than we might otherwise be due to rules designed to protect the college or university from litigation. This ruling has made it seem as though we could afford to be less cautious. As Eric Hellman points out, the decision seems to imply that with copyright the ends are the important part, not the means. “In Judge Chin’s analysis, copyright is concerned only with the ends, not the means. Copyright seems not to be concerned with what happens inside the black box.” 1 As long as the end use of the books was fair, which was deemed to be the case, the initial digitization was not a problem.
Looking at this from the perspective of repository manager, I want to address a few of the theoretical and logistical issues behind such a conclusion for libraries.
What does this mean for digitization at libraries?
At the beginning of 2013 I took over an ongoing digitization project, and as a first-time manager of a large-scale long-term project, I learned a lot about the processes involved in such a project. The project I work with is extremely small-scale compared with many such projects, but even at this scale the project is expensive and time-consuming. What makes it worth it is that long-buried works of scholarship are finally being used and read, sometimes for reasons we do not quite understand. That gets at the heart of the Google Books decision—digitizing books in library stacks and making them more widely available does contribute to education and useful arts.
There are many issues that we need to address, however. Some of the most important ones are what access can and should be provided to what works, and making mass digitization more available to smaller and international cultural heritage institutions. Google Books could succeed because it had the financial and computing resources of Google matched with the cultural resources of the participating research libraries. This problem is international in scope. I encourage you to read this essay by Amelia Sanz, in which she argues that digitization efforts so far have been inherently unequal and a reflection of colonialism. 2 But is there a practical way of approaching this desire to make books available to a wider audience?
There are several separate issues in providing access. Books that are in the public domain are unquestionably fine to digitize, though differences in international copyright law make it difficult to determine what can be provided to whom. As Amelia Sanz points out, Google can only digitize Spanish works prior to 1870 in Spain, but may digitize the complete work in the United States. The complete work is not available to Spanish researchers, but it is available in full to US researchers.
That aside, there are several reasons why it is useful to digitize works still unquestionably under copyright. One of the major reasons is textual corpus analysis–you need to have every word of many texts available to draw conclusions about use of words and phrases in those texts. Google Books ngram viewer is one such tool that comes out of mass digitization. Searching for phrases in Google and finding that phrase as a snippet in a book is an important way to find information in books that might otherwise be ignored in favor of online sources. Some argue that this means that those books will not be purchased when they might have otherwise been, but it is equally possible that this leads to greater discovery and more purchases, which research into music piracy suggests may be the case.
Another reason to digitize works still under copyright is to highlight the work of marginalized communities, though in that case it is imperative to work with those communities to ensure that the digitization is not exploitative. Many orphan works, for whom a rights-holder cannot be located, fall under this, and I know from some volunteer work that I have done that small cultural heritage institutions are eager to digitize material that represents the cultural and intellectual output of their communities.
In all the above cases, it is crucial to put into place mechanisms for ensuring that works under copyright are not abused. Google Books uses an algorithm that makes it impossible to read an entire book, which is probably beyond the abilities of most institutions. (If anyone has an idea for how to do this, I would love to hear it.) Simpler and more practical solutions to limiting access are to only make a chapter or sample of a book available for public use, which many publishers already allow. For instance, Oxford University Press allows up to 10% of a work (within certain limits) on personal websites or institutional repositories. (That is, of course, assuming you can get permission from the author). Many institutions maintain “dark archives“, which are digitized and (usually) indexed archives of material inaccessible to the public, whether institutional or research information. For instance, the US Department of Energy Office of Scientific and Technical Information maintains a dark archive index of technical reports comprising the equivalent of 6 million pages, which makes it possible to quickly find relevant information.
In any case where an institution makes the decision to digitize and make available the full text of in-copyright materials for reasons they determine are valid, there are a few additional steps that institutions should take. Institutions should research rights-holders or at least make it widely known to potential rights-holders that a project is taking place. The Orphan Works project at the University of Michigan is an example of such a project, though it has been fraught with controversy. Another important step is to have a very good policy for taking down material when a rights-holder asks–it should be clear to the rights-holder whether any copies of the work will be maintained and for what purposes (for instance archival or textual analysis purposes).
Digitizing, Curating, Storing, Oh My!
The above considerations are only useful when it is even possible for institutions without the resources of Google to start a digitization program. There are many examples of DIY digitization by individuals, for instance see Public Collectors, which is a listing of collections held by individuals open for public access–much of it digitized by passionate individuals. Marc Fischer, the curator of Public Collectors, also digitizes important and obscure works and posts them on his site, which he funds himself. Realistically, the entire internet contains examples of digitization of various kinds and various legal statuses. Most of this takes place on cheap and widely available equipment such as flatbed scanners. But it is possible to build an overhead book scanner for large-scale digitization with individual parts and at a reasonable cost. For instance, the DIY Book Scanning project provides instructions and free software for creating a book scanner. As they say on the site, all the process involves is to “[p]oint a camera at a book and take pictures of each page. You might build a special rig to do it. Process those pictures with our free programs. Enjoy reading on the device of your choice.”
“Processing the pictures” is a key problem to solve. Turning images into PDF documents is one thing, but providing high quality optical character recognition is extremely challenging. Free tools such as FreeOCR make it possible to do OCR from image or PDF files, but this takes processing power and results vary widely, particularly if the scan quality is lower. Even expensive tools like Adobe Acrobat or ABBYY FineReader have the same problems. Karen Coyle points out that uncorrected OCR text may be sufficient for searching and corpus analysis, but does not provide a faithful reproduction of the text and thus, for instance, provide access to visually impaired persons 3 This is a problem well known in the digital humanities world, and one solved by projects such as Project Gutenberg with the help of dedicated volunteer distributed proofreaders. Additionally, a great deal of material clearly in the public domain is in manuscript form or has text that modern OCR cannot recognize. In that case, crowdsourcing transcriptions is the only financially viable way for institutions to make text of the material available. 4 Examples of successful projects using volunteer transcriptors or proofreaders include Ancient Lives to transcribe ancient papyrus, What’s on the Menu at the New York Public Library, and DIYHistory at the University of Iowa libraries. (The latter has provided step by step instructions for building your own version using open source tools).
So now you’ve built your low-cost DIY book scanner, and put together a suite of open source tools to help you process your collections for free. Now what? The whole landscape of storing and preserving digital files is far beyond the scope of this post, but the cost of accomplishing this is probably the highest of anything other than staffing a digitization project, and it is here where Google clearly has the advantage. The Internet Archive is a potential solution to storing public domain texts (though they are not immune to disaster), but if you are making in-copyright works available in any capacity you will most likely have to take the risk on your own servers. I am not a lawyer, but I have never rented server space that would allow copyrighted materials to be posted.
Conclusion: Is it Worth It?
Obviously from this post I am in favor of taking on digitization projects of both public domain and copyrighted materials when the motivations are good and the policies are well thought out. From this perspective, I think the Google Books decision was a good thing for libraries and for providing greater access to library collections. Libraries should be smart about what types of materials to digitize, but there are more possibilities for large-scale digitization, and by providing more access, the research community can determine what is useful to them.
If you have managed a DIY book scanning project, please let me know in the comments, and I can add links to your project.
- Hellman, Eric. “Google Books and Black-Box Copyright Jurisprudence.” Go To Hellman, November 18, 2013. http://go-to-hellman.blogspot.com/2013/11/google-books-and-black-box-copyright.html. ↩
- Sanz, Amelia. “Digital Humanities or Hypercolonial Studies?” Responsible Innovation in ICT (June 26, 2013). http://responsible-innovation.org.uk/torrii/resource-detail/1249#_ftnref13. ↩
- Coyle, Karen. “It’s FAIR!” Coyle’s InFormation, November 14, 2013. http://kcoyle.blogspot.com/2013/11/its-fair.html. ↩
- For more on this, see Ben Brumfield’s work on crowdsourced transcription, for example Brumfield, Ben W. “Collaborative Manuscript Transcription: ‘The Landscape of Crowdsourcing and Transcription’ at Duke University.” Collaborative Manuscript Transcription, November 23, 2013. http://manuscripttranscription.blogspot.com/2013/11/the-landscape-of-crowdsourcing-and.html. ↩