What is Node.js & why do I care?

At its simplest, Node.js is server-side JavaScript. JavaScript is a popular programming language, but it almost always runs inside a web browser. So JavaScript can, for instance, manipulate the contents of this page by being included inside <script> tags, but it doesn’t get to play around with the files on our computer or tell our server what HTML to send like the PHP that runs this WordPress blog.

Node is interesting for more than just being on the server side. It provides a new way of writing web servers while using an old UNIX philosophy. Hopefully, by the end of this post, you’ll see its potential and how it differentiates itself from other programming environments and web frameworks.

Hello, World

To start, let’s do some basic Node programming. Head over to nodejs.org and click Install.1 Once you’ve run the installer, a node executable will be available for you on the command line. Any script you pass to node will be interpreted and the results displayed. Let’s do the classic “hello world” example. Create a new file in a text editor, name it hello.js, and put the following on its only line:

console.log('Hello, world!');

If you’ve written JavaScript before, you may recognize this already. console.log is a common debugging method which prints strings to your browser’s JavaScript console. In Node, console.log will output to your terminal. To see that, open up a terminal (on Mac, you can use Terminal.app while on Windows both cmd.exe and PowerShell will work) and navigate to the folder where you put hello.js. Your terminal will likely open in your user’s home folder; you can change directories by typing cd followed by a space and the subdirectory you want to go inside. For instance, if I started at “C:\users\me” I could run cd Documents to enter “C:\users\me\Documents”. Below, we open a terminal, cd into the Documents folder, and run our script to see its results.

$ cd Documents
$ node hello.js
Hello, world!

That’s great and all, but it leaves a lot to be desired. Let’s do something a little more sophisticated; let’s write a web server which responds “Hello!” to any request sent to it. Open a new file up, name it server.js, and write this inside:

var http = require('http');
http.createServer(handleRequest).listen(8888);
function handleRequest (request, response) {
  response.end( 'Hello!' );
}

In our terminal, we can run node server.js and…nothing happens. Our prompt seems to hang, not outputting anything but also not letting us type another command. What gives? Well, Node is running a web server and it’s waiting for responses. Open up your web browser and navigate to “localhost:8888″; the exclamation “Hello!” should appear. In four lines of code, we just wrote an HTTP server. Sure, it’s the world’s dumbest server that only says “Hello!” over and over no matter what we request from it, but it’s still an achievement. If you’re the sort of person who gets giddy at how easy this was, then Node.js is for you.

Let’s walk through server.js line-by-line. First, we import the core HTTP library that comes with Node. The “require” function is a way of loading external modules into your script, similar to how the function of the same name does in Ruby or import in Python. The HTTP library gives us a handy “createServer” method which receives HTTP requests and passes them along to a callback function. On our 2nd line, we call createServer, pass it the function we want to handle incoming requests, and set it to listen for requests sent to port 8888. The choice of 8888 is arbitrary; we could choose any number over 1024, while operating systems often restrict the lower ports which are already in use by specific protocols. Finally, we define our handleRequest callback which will receive a request and response object for each HTTP request. Those objects have many useful properties and methods, but we simply called the response object’s end method which sends a response and optionally accepts some data to put into that response.

The use of callback functions is very common in Node. If you’ve written JavaScript for a web browser you may recognize this style of programming; it’s the same as when you define an event listener which responds to mouse clicks, or assign a function to process the result of an AJAX request. The callback function doesn’t executive synchronously in the same order you wrote it in your code, it waits for some “event” to occur, whether that event is a click or an AJAX request returning data.

In our HTTP server example, we also see a bit of what makes Node different from other server-side languages like PHP, Perl, Python, and Ruby. Those languages typically work with a web server, such as Apache, which passes certain requests over to the languages and serves up whatever they return. Node is a server, it gives you low-level access to the inner workings of protocols like HTTP and TCP. You don’t need to run Apache and have requests sent to Node: it handles them on its own.

Who cares?

Some of you are no doubt wondering: what exactly is the big deal? Why am I reading about this? Surely, the world has enough programming languages, and JavaScript is nothing new, even server-side JavaScript isn’t that new.2 There are already plenty of web servers out there. What need does Node.js fill?

To answer that, we must revisit the origins of Node. The best way to understand is to watch Ryan Dahl present on the impetus for creating Node. He says, essentially, that other programming frameworks are doing IO (input/output) wrong. IO comes in many forms: when you’re reading or writing to a file, when you’re querying databases, and when you’re receiving and sending HTTP requests. In all of these situations, your code asks for data…waits…and waits…and then, once it has the data, it manipulates it or performs some calculation, and then sends it somewhere else…and waits…and waits. Basically, because the code is constantly waiting for some IO operation, it spends most of its time sitting around rather than crunching digits like it wants to. IO operations are commonly the bottlenecks in programs, so we shouldn’t let our code just stop every time they perform one.

Node not only has a beneficial asynchronous programming model but it has developed other advantages as well. Because lots of people already know how to write JavaScript, it’s started up much quicker than languages which are entirely new to developers. It reuses Google Chrome’s V8 as a JavaScript interpreter, giving it a big speed boost. Node’s package manager, NPM, is growing at a tremendous rate, far faster than its sibling package managers for Java, Ruby, and Python. NPM itself is a strong point of Node; it’s learned from other package managers and has many excellent features. Finally, other programming languages were developed to be all-purpose tools. Node, while it does share the same all-purpose utility, is really intended for the web. It’s meant to write web servers and handle HTTP intelligently.

Node also follows many UNIX principles. Doug McIlroy succinctly summarized the UNIX philosophy as “Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.” NPM does a great job letting authors write small modules which work well together. This has been tough previously in JavaScript because web browsers have no “require” function; there’s no native way for modules to define and load their dependencies, which resulted in the popularity of large, complicated libraries.3 jQuery is a good example; it’s tremendously popular and it includes hundreds of functions in its API, while most sites that use it really only need a few. Large, complicated programs are more difficult to test, debug, and reason about, which is why UNIX avoided them.

Many Node modules also support streams that allow you to pipe data through a series of programs. This is analogous to how BASH and other shells let you pipe text from one command to another, with each command taking the output of the last as its input. To visualize this, see Stream Playground written by John Resig, creator of jQuery. Streams allow you to plug-in different functionality in when needed. This pseudocode shows how one might read a CSV from a server’s file system (the core “fs” library stands for “file system”), filter out certain rows, and send it over HTTP:

fs.createReadStream('spreadsheet.csv').pipe(filter).pipe(http);
// Want to compress the response? Just add another pipe.
fs.createReadStream('spreadsheet.csv').pipe(filter).pipe(compressor).pipe(http);

Streams have the advantage of limiting how much memory a program uses because only small portions of data are being operated on at once. Think of the difference between copying a million-line spreadsheet all at once or line-by-line; the second is less likely to crash or run into the limit of how much data the system clipboard can hold.

Libraryland Examples

Node is still very new and there aren’t a lot prominent examples of library usage. I’ll try to present a few, but I think it’s more worth knowing about as a major trend in web development.

Most amusingly, Ed Summers of the Library of Congress and Sean Hannan of Johns Hopkins University made a Cataloging Highscores page that presents original cataloging performed in WorldCat in a retro arcade-style display. This app uses the popular socket.io module that establishes a real-time connection between your browser and the server, a strength of Node. Any web service that needs to be continually updated is a prime candidate for Node.js: current news articles, social media streams, auto-complete suggestions as a user types in search terms, and chat reference all come to mind. In fact, SpringShare’s LibChat uses socket.io as well, though I can’t tell if it’s using Node on the server or PHP. A similar example of real-time updating, also by Ed Summers, is Wikistream which streams the dizzying number of edits happening on various Wikipedias through your browser.4

There was a lightning talk on Node at Code4Lib 2010 which mentions writing a connector to the popular Apache Solr search platform. Aaron Coburn’s proposed talk for Code4Lib 2014 mentions that Amherst is using Node to build the web front-end to their Fedora-based digital library.

Tools You Can Use

With the explosive growth of NPM, there are already tons of useful tools written in Node. While many of these are tools for writing web servers, like Express, some are command line programs you can use to accomplish a variety of tasks.

Yeoman is a scaffolding application that makes it easy to produce various web apps by giving you expert templates. You can install separate generators that produce templates for things like a Twitter Bootstrap site, a JavaScript bookmarklet, a mobile site, or a project using the Angular JavaScript MVC framework. Running yo angular to invoke the Angular generator gives you a lot more than just a base HTML file and some JavaScript libraries; it also provides a series of Grunt tasks for testing, running a development server, and building a site optimized for production. Grunt is another incredibly useful Node project, dubbed “the JavaScript task runner.” It lets you pick from hundreds of community plugins to automate tedious tasks like minifying and concatenating your scripts before deploying a website.

Finally, another tool that I like is phantomas which is a Node project that works with PhantomJS to run a suite of performance tests on a site. It provides more detailed reports than any other performance tool I’ve used, telling you things like how many DOM queries ran and median latency of HTTP requests.

Learn More

Nodeschool.io features a growing number of lessons on using Node. Better yet, the lessons are actually written in Node, so you install them with NPM and verify your results on the command line. There are several topics, from basics to using streams to working with databases.

Nettuts+, always a good place for coding tutorials, has an introduction to Node which takes you from installation to coding a real-time server. If you want to learn about writing a real-time chat application with socket.io, they have a tutorial for that, too.

If you want a broad and thorough overview, there are a few introductory books on Node, with The Node Beginner Book offering several free chapters. O’Reilly’s Node for Front-End Developers is also a good starting point.

How to Node is a popular blog with articles on various topics, though some are too in-depth for beginners. I’d head here if you want to learn more on a specific topic, such as streams, or working with particular databases like MongoDB.

Finally, the Node API docs are a good place to go when you get stuck using a particular core module.

Notes

  1. If you use a package manager, such as Homebrew on Mac OS X or APT on Linux, Node is likely available within it. One caveat I have noticed is that the stock Debian/Ubuntu apt-get install nodejs is a few major versions behind; you may want to add Chris Lea’s PPA to get a current version. If you’re subject to the whims of your IT department, you may need to convince them to install Node for you, or talk to your sysadmin to get it on your server. Since it’s a rather new technology, don’t be surprised if you have to explain what it is and why you want to try it out.
  2. Previous projects, including Rhino from Mozilla and Narwhal, have let people use JavaScript outside the server. Node, however, has caught on far more than either of these projects, for some of the reasons outlined in this post.
  3. RequireJS is one project that’s trying to address this need. The ECMAScript standard that defines JavaScript is also working on native modules but they’re in draft form and it’ll be a long time before all browsers support them.
  4. If you’re curious, the code for both Cataloging Highscores and Wikistream are open source and available on GitHub.

An Incomplete Solution: Working with Drupal, Isotope, and JQuery BBQ Plugin

I write a lot of “how-to” posts.  This is fine and I actually think it’s fun…until I have a month like this past one, in which I worry that I have no business telling anyone “how-to” do anything. In which I have written the following comment multiple times:

//MF - this is a bad way to do this.

In the last four weeks I have struggled mightily to make changes to the JavaScript in a Drupal module (the aforementioned Views Isotope).  I have felt lost, I have felt jubilation, I have sworn profusely at my computer and in the end, I have modifications that mostly work.  They basically do what we need.  But the changes I made perform differently in different browsers, are likely not efficient, and I’m not sure the code is included in the right place.  Lots of it might be referred to as “hacky.”

However, the work needed to be done and I did not have time to be perfect.  This needed to mostly work to show that it could work and so I could hand off the concepts and algorithms to someone else for use in another of our sites.  Since I have never coded in JavaScript or jQuery, I needed to learn the related libraries on the fly and try to relate them back to previous coding experiences.  I needed to push to build the house and just hope that we can paint and put on doors and locks later on.

I decided to write this post because I think that maybe it is important to share the narrative of “how we struggle” alongside the “how-to”.  I’ll describe the original problem I needed to tackle, my work towards a solution, and the remaining issues that exist.  There are portions of the solution that work fine and I utilize those portions to illustrate the original requirements.  But, as you will see, there is an assortment of unfinished details.  At the end of the post, I’ll give you the link to my solution and you can judge for yourself.

The Original Problem

We want to implement a new gallery to highlight student work on our department’s main website.  My primary area of responsibility is a different website – the digital library – which is the archival home for student work.  Given my experience in working with these items and also with Views Isotope, the task of developing a “proof of concept” solution fell to me.  I needed to implement some of the technically trickier things in our proposed solution for the gallery pages in order to prove that the features are feasible for the main website.  I decided to use the digital library for this development because

  • the digital library already has the appropriate infrastructure in place
  • both the digital library and our main website are based in Drupal 7
  • the “proof of concept” solution, once complete, could remain in the digital library as a browse feature for our historical collections of student work

The full requirements for the final gallery are outside of the scope of this post, but our problem area is modifying the Views Isotope module to do some advanced things.

First, I need to take the existing Views Isotope module and modify it to use hash history on the same page as multiple filters.  Hash history with Isotope is implemented using the jQuery BBQ plugin, as is demonstrated on the Isotope website. This essentially means that when a user clicks on a filter, a hash value is added to the URL and becomes a line in the browser’s history.  This allows one to click the back button to view the grid with the previous filters applied.

Our specific use case is the following: when viewing galleries of student work from our school, a user can filter the list of works by several filter options, such as degree or discipline, (i.e., Undergraduate Architecture).  These filter options are powered by term references in Drupal, as we saw in my earlier Isotope post.  If the user clicks on an individual work to see details, clicking on the back button should return them to the already filtered list – they should not have to select the filters again.

Let’s take a look at how the end of the URL should progress.  If we start with:

.../student-work-archives

Then, we select Architecture as our discipline, we should see the URL change to :

.../student-work-archives#filter=.architecture

Everything after the # is referred to as the hash.  If we then click on Undergraduate, the URL will change to:

.../student-work-archives#filter=.undergraduate.architecture

If we click our Back button, we should go back to

.../student-work-archives#filter=.architecture

With each move, the Isotope animation should fire and the items on the screen should only be visible if they contain term references to the vocabulary selected.

Further, the selected filter options should be marked as being currently selected.  Items in one vocabulary which require an item in another vocabulary should be hidden and shown as appropriate.  For example, if a user selects Architecture, they should not be able to select PhD from the degree program, because we do not offer a PhD degree in Architecture, therefore there is no student work.  Here is an example of how the list of filters might look.

filter-list-compOnce Architecture is selected, PhD is removed as a option.  Once Undergraduate is selected, our G1, G2 and G3 options are no longer available.

A good real-world example of the types of features we need can be seen at NPR’s Best Books of 2013 site.  The selected filter is marked, options that are no longer available are removed and the animation fires when the filter changes.  Further, when you click the Back button, you are taken back through your selections.

The Solution

It turns out that the jQuery BBQ plugin works quite nicely with Isotope, again as demonstrated on the Isotope website.  It also turns out that support for BBQ is included in Drupal core for the Overlay module.   So theoretically this should all play nicely together.

The existing views-isotope.js file handles filtering as the menu items are clicked.  The process is basically as follows:

  • When the document is ready
    • Identify the container of items we want to filter, as well as the class on the items and set up Isotope with those options.
    • Pre-select All in each set of filters.
    • Identify all of the possible filter options on the page.
    • If a filter item is clicked,
      • first, check to make sure it isn’t already selected, if so, escape
      • then remove the “selected” class from the currently selected option in this set
      • add the “selected” class to the current item
      • set up an “options” variable to hold the value of the selected filter(s)
      • check for other items in other filter sets with the selected class and add them all to the selected filter value
      • call Isotope with the selected filter value(s)

To add the filter to the URL we can use bbq.pushState, which will add “a ‘state’ into the browser history at the current position, setting location.hash and triggering any bound hashchange event callbacks (provided the new state is different than the previous state).”

$bbq.pushState( options);

We then want to handle what’s in the hash when the browser back button is clicked, or if a user enters a URL with the hash value applied.  So we add an option for handling the hashchange event mentioned above.  Instead of calling isotope from the link click function, we call it from the hashchange event portion.  Now our algorithm looks more like this, with the items in bold added:

  • Include the misc/jquery.ba-bbq.js for BBQ (I have to do this explicitly because I don’t use Overlay)
  • When the document is ready
    • identify the container of items we want to filter, as well as the class on the items and set up Isotope with those options.
    • Pre-select All in each set of filters.
    • Identify all of the possible filter options on the page.
    • If a filter item is clicked,
      •   first, check to make sure it isn’t already selected, if so, escape
      •  then remove the “selected” class from the currently selected option in this set
      • add the “selected” class to the current item
      • set up an “options” variable to hold the value of the selected filter(s)
      • check for other items in other filter sets with the selected class and add them all to the selected filter value
      • push “options” to URL and trigger hashchange (don’t call Isotope yet)
    • If a hashchange event is detected
      • create new “hashOptions” object according to what’s in the hash, using the deparam.fragment function from jQuery BBQ
      • manipulate css classes such as “not-available” (ie. If Architecture is selected, apply to PhD) and “selected” based on what’s in “hashOptions”
      • call Isotope with “hashOptions” as the parameter
    • trigger hashchange event to pick up anything that’s in the URL when the page loads

I also updated any available pager links so that they link not just to the appropriate page, but also so that the filters are included in the link.  This is done by appending the hash value to the href attribute for each link with a class “pager”.

And it works.  Sort of…

The Unfinished Details

Part of the solution described above only works on Chrome and – believe it or not – Internet Explorer.  In all browsers, clicking the back button works just as described above, as long as one is still on the page with the list of works.  However, when linking directly to page with the filters included (as we are doing with the pager) or hitting back from a page that does not have the hash present (say, after visiting an individual item), it does not work on Firefox or Safari.  I think this may have to do with the deparam.fragment function, because that appears to be where it gets stuck, but so far can’t track it down.  I could directly link to window.location.hash, but I think that’s a security issue (what’s to stop someone from injecting something malicious after the hash?)

Also, in order to make sure the classes are applied correctly, it feels like I do a lot of “remove it from everywhere, then add it back”.  For example, if I select Architecture, PhD is then hidden from the degree list by assigning the class “not-available”.  When a user clicks on City and Regional Planning or All, I need that PhD to appear again.  Unfortunately, the All filter is handled differently – it is only present in the hash if no other options on the page are selected.  So, I remove “not-available” from all filter lists on hashchange and then reassign based on what’s in the hash.  It seems like it would be more efficient just to change the one I need, but I can’t figure it out.  Or maybe I should alter the way All is handled completely – I don’t know.

I also made these changes directly in the views-isotope.js, which is a big no-no.  What happens when the module is updated?  But, I have never written a custom module for Drupal which included JavaScript, so I’m not even sure how to do it in a way that makes sense.  I have only made changes to this one file.  Can I just override it somewhere?  I’m not sure.  Until I figure it out, we have a backup file and lots and lots of comments.

All of these details are symptoms of learning on the fly.  I studied computer science, so I understand conceptual things like how to loop through an array or return a value from a function, but that was more than ten years ago and my practical coding experience since then has not involved writing jQuery or Javascript.  Much of what I built I pieced together from the Drupal documentation, the Views Isotope documentation and issue queue, the Isotope documentation, the jQuery BBQ documentation and numerous visits to the w3schools.com pages on jQuery and Javascript.  I also frequently landed on the jQuery API Documentation page.

It is hard to have confidence in a solution when building while learning.  When I run into a snag, I have to consider whether or not the problem is the entire approach, as opposed to a syntax error or a limitation of the library.  Frequently, I find an answer to an issue I’m having, but have to look up something from the answer in order to understand it.  I worry that the code contains rookie mistakes – or even intermediate mistakes – which will bite us later, but it is difficult to do an exhaustive analysis of all the available resources.  Coding elegantly is an art which requires more than a basic understanding of how the pieces play together.

Inelegant code, however, can still help make progress.  To see the progress I have made, you can visit https://ksamedia.osu.edu/student-work-archives and play with the filters. This solution is good because it proves we can develop our features using Isotope, BBQ and Views Isotope.  The trick now is figuring out how to paint and put locks and doors on our newly built house, or possibly move a wall or two.


Taking a Practical Look at the Google Books Case

Last month we got the long-awaited ruling in favor of Google in the Authors Guild vs. Google Books case, which by now has been analyzed extensively. Ultimately the judge in the case decided that Google’s digitization was transformative and thus constituted fair use. See InfoDocket for detailed coverage of the decision.

The Google Books project was part of the Google mission to index all the information available, and as such could never have taken place without libraries, which hold all those books. While most, if not all, the librarians I know use Google Books in their work, there has always been a sense that the project should not have been started by a commercial enterprise using the intellectual resources of libraries, but should have been started by libraries themselves working together.  Yet libraries are often forced to be more conservative about digitization than we might otherwise be due to rules designed to protect the college or university from litigation. This ruling has made it seem as though we could afford to be less cautious. As Eric Hellman points out, the decision seems to imply that with copyright the ends are the important part, not the means. “In Judge Chin’s analysis, copyright is concerned only with the ends, not the means. Copyright seems not to be concerned with what happens inside the black box.” 1 As long as the end use of the books was fair, which was deemed to be the case, the initial digitization was not a problem.

Looking at this from the perspective of repository manager, I want to address a few of the theoretical and logistical issues behind such a conclusion for libraries.

What does this mean for digitization at libraries?

At the beginning of 2013 I took over an ongoing digitization project, and as a first-time manager of a large-scale long-term project, I learned a lot about the processes involved in such a project. The project I work with is extremely small-scale compared with many such projects, but even at this scale the project is expensive and time-consuming. What makes it worth it is that long-buried works of scholarship are finally being used and read, sometimes for reasons we do not quite understand. That gets at the heart of the Google Books decision—digitizing books in library stacks and making them more widely available does contribute to education and useful arts.

There are many issues that we need to address, however. Some of the most important ones are what access can and should be provided to what works, and making mass digitization more available to smaller and international cultural heritage institutions. Google Books could succeed because it had the financial and computing resources of Google matched with the cultural resources of the participating research libraries. This problem is international in scope. I encourage you to read this essay by Amelia Sanz, in which she argues that digitization efforts so far have been inherently unequal and a reflection of colonialism. 2 But is there a practical way of approaching this desire to make books available to a wider audience?

Providing Access

There are several separate issues in providing access. Books that are in the public domain are unquestionably fine to digitize, though differences in international copyright law make it difficult to determine what can be provided to whom. As Amelia Sanz points out, Google can only digitize Spanish works prior to 1870 in Spain, but may digitize the complete work in the United States. The complete work is not available to Spanish researchers, but it is available in full to US researchers.

That aside, there are several reasons why it is useful to digitize works still unquestionably under copyright. One of the major reasons is textual corpus analysis–you need to have every word of many texts available to draw conclusions about use of words and phrases in those texts. Google Books ngram viewer is one such tool that comes out of mass digitization. Searching for phrases in Google and finding that phrase as a snippet in a book is an important way to find information in books that might otherwise be ignored in favor of online sources. Some argue that this means that those books will not be purchased when they might have otherwise been, but it is equally possible that this leads to greater discovery and more purchases, which research into music piracy suggests may be the case.

Another reason to digitize works still under copyright is to highlight the work of marginalized communities, though in that case it is imperative to work with those communities to ensure that the digitization is not exploitative. Many orphan works, for whom a rights-holder cannot be located, fall under this, and I know from some volunteer work that I have done that small cultural heritage institutions are eager to digitize material that represents the cultural and intellectual output of their communities.

In all the above cases, it is crucial to put into place mechanisms for ensuring that works under copyright are not abused. Google Books uses an algorithm that makes it impossible to read an entire book, which is probably beyond the abilities of most institutions. (If anyone has an idea for how to do this, I would love to hear it.) Simpler and more practical solutions to limiting access are to only make a chapter or sample of a book available for public use, which many publishers already allow. For instance, Oxford University Press allows up to 10% of a work (within certain limits) on personal websites or institutional repositories. (That is, of course, assuming you can get permission from the author). Many institutions maintain “dark archives“, which are digitized and (usually) indexed archives of material inaccessible to the public, whether institutional or research information. For instance, the US Department of Energy Office of Scientific and Technical Information maintains a dark archive index of technical reports comprising the equivalent of 6 million pages, which makes it possible to quickly find relevant information.

In any case where an institution makes the decision to digitize and make available the full text of in-copyright materials for reasons they determine are valid, there are a few additional steps that institutions should take. Institutions should research rights-holders or at least make it widely known to potential rights-holders that a project is taking place. The Orphan Works project at the University of Michigan is an example of such a project, though it has been fraught with controversy. Another important step is to have a very good policy for taking down material when a rights-holder asks–it should be clear to the rights-holder whether any copies of the work will be maintained and for what purposes (for instance archival or textual analysis purposes).

Digitizing, Curating, Storing, Oh My!

The above considerations are only useful when it is even possible for institutions without the resources of Google to start a digitization program. There are many examples of DIY digitization by individuals, for instance see Public Collectors, which is a listing of collections held by individuals open for public access–much of it digitized by passionate individuals. Marc Fischer, the curator of Public Collectors, also digitizes important and obscure works and posts them on his site, which he funds himself. Realistically, the entire internet contains examples of digitization of various kinds and various legal statuses. Most of this takes place on cheap and widely available equipment such as flatbed scanners. But it is possible to build an overhead book scanner for large-scale digitization with individual parts and at a reasonable cost. For instance, the DIY Book Scanning project provides instructions and free software for creating a book scanner. As they say on the site, all the process involves is to “[p]oint a camera at a book and take pictures of each page. You might build a special rig to do it. Process those pictures with our free programs. Enjoy reading on the device of your choice.”

“Processing the pictures” is a key problem to solve. Turning images into PDF documents is one thing, but providing high quality optical character recognition is extremely challenging. Free tools such as FreeOCR make it possible to do OCR from image or PDF files, but this takes processing power and results vary widely, particularly if the scan quality is lower. Even expensive tools like Adobe Acrobat or ABBYY FineReader have the same problems. Karen Coyle points out that uncorrected OCR text may be sufficient for searching and corpus analysis, but does not provide a faithful reproduction of the text and thus, for instance, provide access to visually impaired persons 3 This is a problem well known in the digital humanities world, and one solved by projects such as Project Gutenberg with the help of dedicated volunteer distributed proofreaders. Additionally, a great deal of material clearly in the public domain is in manuscript form or has text that modern OCR cannot recognize. In that case, crowdsourcing transcriptions is the only financially viable way for institutions to make text of the material available. 4 Examples of successful projects using volunteer transcriptors or proofreaders include Ancient Lives to transcribe ancient papyrus, What’s on the Menu at the New York Public Library, and DIYHistory at the University of Iowa libraries. (The latter has provided step by step instructions for building your own version using open source tools).

So now you’ve built your low-cost DIY book scanner, and put together a suite of open source tools to help you process your collections for free. Now what? The whole landscape of storing and preserving digital files is far beyond the scope of this post, but the cost of accomplishing this is probably the highest of anything other than staffing a digitization project, and it is here where Google clearly has the advantage. The Internet Archive is a potential solution to storing public domain texts (though they are not immune to disaster), but if you are making in-copyright works available in any capacity you will most likely have to take the risk on your own servers. I am not a lawyer, but I have never rented server space that would allow copyrighted materials to be posted.

Conclusion: Is it Worth It?

Obviously from this post I am in favor of taking on digitization projects of both public domain and copyrighted materials when the motivations are good and the policies are well thought out. From this perspective, I think the Google Books decision was a good thing for libraries and for providing greater access to library collections. Libraries should be smart about what types of materials to digitize, but there are more possibilities for large-scale digitization, and by providing more access, the research community can determine what is useful to them.

If you have managed a DIY book scanning project, please let me know in the comments, and I can add links to your project.

  1. Hellman, Eric. “Google Books and Black-Box Copyright Jurisprudence.” Go To Hellman, November 18, 2013. http://go-to-hellman.blogspot.com/2013/11/google-books-and-black-box-copyright.html.
  2. Sanz, Amelia. “Digital Humanities or Hypercolonial Studies?” Responsible Innovation in ICT (June 26, 2013). http://responsible-innovation.org.uk/torrii/resource-detail/1249#_ftnref13.
  3. Coyle, Karen. “It’s FAIR!” Coyle’s InFormation, November 14, 2013. http://kcoyle.blogspot.com/2013/11/its-fair.html.
  4. For more on this, see Ben Brumfield’s work on crowdsourced transcription, for example Brumfield, Ben W. “Collaborative Manuscript Transcription: ‘The Landscape of Crowdsourcing and Transcription’ at Duke University.” Collaborative Manuscript Transcription, November 23, 2013. http://manuscripttranscription.blogspot.com/2013/11/the-landscape-of-crowdsourcing-and.html.

Query a Google Spreadsheet like a Database with Google Visualization API Query Language

Libraries make much use of spreadsheets. Spreadsheets are easy to create, and most library staff are familiar with how to use them. But they can quickly get unwieldy as more and more data are entered. The more rows and columns a spreadsheet has, the more difficult it is to browse and quickly identify specific information. Creating a searchable web application with a database at the back-end is a good solution since it will let users to quickly perform a custom search and filter out unnecessary information. But due to the staff time and expertise it requires, creating a full-fledged searchable web database application is not always a feasible option at many libraries.

Creating a MS Access custom database or using a free service such as Zoho can be an alternative to creating a searchable web database application. But providing a read-only view for MS Access database can be tricky although possible. MS Access is also software locally installed in each PC and therefore not necessarily available for the library staff when they are not with their work PCs on which MS Access is installed. Zoho Creator offers a way to easily convert a spreadsheet into a database, but its free version service has very limited features such as maximum 3 users, 1,000 records, and 200 MB storage.

Google Visualization API Query Language provides a quick and easy way to query a Google spreadsheet and return and display a selective set of data without actually converting a spreadsheet into a database. You can display the query result in the form of a HTML table, which can be served as a stand-alone webpage. All you have to do is to construct a custom URL.

A free version of Google spreadsheet has a limit in size and complexity. For example, one free Google spreadsheet can have no more than 400, 000 total cells. But you can purchase more Google Drive storage and also query multiple Google spreadsheets (or even your own custom databases) by using Google Visualization API Query Language and Google Chart Libraries together.  (This will be the topic of my next post. You can also see the examples of using Google Chart Libraries and Google Visualization API Query Language together in my presentation slides at the end of this post.)

In this post, I will explain the parameters of Google Visualization API Query Language and how to construct a custom URL that will query, return, and display a selective set of data in the form of an HTML page.

A. Display a Google Spreadsheet as an HTML page

The first step is to identify the URL of the Google spreadsheet of your choice.

The URL below opens up the third sheet (Sheet 3) of a specific Google spreadsheet. There are two parameters you need to pay attention inside the URL: key and gid.

https://docs.google.com/spreadsheet/ccc?key=0AqAPbBT_k2VUdDc3aC1xS2o0c2ZmaVpOQWkyY0l1eVE&usp=drive_web#gid=2

This breaks down the parameters in a way that is easier to view:

  • https://docs.google.com/spreadsheet/ccc
    ?key=0AqAPbBT_k2VUdDc3aC1xS2o0c2ZmaVpOQWkyY0l1eVE
    &usp=drive_web

    #gid=2

Key is a unique identifier to each Google spreadsheet. So you need to use that to cretee a custom URL later that will query and display the data in this spreadsheet. Gid specifies which sheet in the spreadsheet you are opening up. The gid for the first sheet is 0; the gid for the third sheet is 2.

Screen Shot 2013-11-27 at 9.44.29 AM

Let’s first see how Google Visualization API returns the spreadsheet data as a DataTable object. This is only for those who are curious about what goes on behind the scenes. You can see that for this view, the URL is slightly different but the values of the key and the gid parameter stay the same.

https://spreadsheets.google.com/tq?&tq=&key=0AqAPbBT_k2VUdDc3aC1xS2o0c2ZmaVpOQWkyY0l1eVE&gid=2

Screen Shot 2013-11-27 at 9.56.03 AM

In order to display the same result as an independent HTML page, all you need to do is to take the key and the gid parameter values of your own Google spreadsheet and construct the custom URL following the same pattern shown below.

  • https://spreadsheets.google.com
    /tq?tqx=out:html&tq=
    &key=0AqAPbBT_k2VUdDc3aC1xS2o0c2ZmaVpOQWkyY0l1eVE
    &gid=2

https://spreadsheets.google.com/tq?tqx=out:html&tq=&key=0AqAPbBT_k2VUdDc3aC1xS2o0c2ZmaVpOQWkyY0l1eVE&gid=2

Screen Shot 2013-11-27 at 9.59.11 AM

By the way, if the URL you created doesn’t work, it is probably because you have not encoded it properly. Try this handy URL encoder/decoder page to encode it by hand or you can use JavaScript encodeURIComponent() function.
Also if you want the URL to display the query result without people logging into Google Drive first, make sure to set the permission setting of the spreadsheet to be public. On the other hand, if you need to control access to the spreadsheet only to a number of users, you have to remind your users to first go to Google Drive webpage and log in with their Google account before clicking your URLs. Only when the users are logged into Google Drive, they will be able see the query result.

B. How to Query a Google Spreadsheet

We have seen how to create a URL to show an entire sheet of a Google spreadsheet as an HTML page above. Now let’s do some querying, so that we can pick and choose what data the table is going to display instead of the whole sheet. That’s where the Query Language comes in handy.

Here is an example spreadsheet with over 50 columns and 500 rows.

  • https://docs.google.com/spreadsheet/ccc?
    key=0AqAPbBT_k2VUdDFYamtHdkFqVHZ4VXZXSVVraGxacEE
    &usp=drive_web
    #gid=0

https://docs.google.com/spreadsheet/ccc?key=0AqAPbBT_k2VUdDFYamtHdkFqVHZ4VXZXSVVraGxacEE&usp=drive_web#gid=0

Screen Shot 2013-11-27 at 10.15.41 AM

What I want to do is to show only column B, C, D, F where C contains ‘Florida.’ How do I do this? Remember the URL we created to show the entire sheet above?

  • https://spreadsheets.google.com/tq?tqx=out:html&tq=&key=___&gid=___

There we had no value for the tq parameter. This is where we insert our query.

Google Visualization API Query Language is pretty much the same as SQL. So if you are familiar with SQL, forming a query is dead simple. If you aren’t SQL is also easy to learn.

  • The query should be written like this:
    SELECT B, C, D, F WHERE C CONTAINS ‘Florida’
  • After encoding it properly, you get something like this:
    SELECT%20B%2C%20C%2C%20D%2C%20F%20WHERE%20C%20CONTAINS%20%27Florida%27
  • Add it to the tq parameter and don’t forget to also specify the key:
    https://spreadsheets.google.com/tq?tqx=out:html&tq=SELECT%20B%2C%20C%2C%20D%2C%20F%20WHERE%20C%20CONTAINS%20%27Florida%27
    &key=0AqAPbBT_k2VUdEtXYXdLdjM0TXY1YUVhMk9jeUQ0NkE

I am omitting the gid parameter here because there is only one sheet in this spreadsheet but you can add it if you would like. You can also omit it if the sheet you want is the first sheet. Ta-da!

Screen Shot 2013-11-27 at 10.26.13 AM

Compare this with the original spreadsheet view. I am sure you can appreciate how the small effort put into creating a URL pays back in terms of viewing an unwieldy large spreadsheet manageable.

You can also easily incorporate functions such as count() or sum() into your query to get an overview of the data you have in the spreadsheet.

  • select D,F count(C) where (B contains ‘author name’) group by D, F

For example, this query above shows how many articles a specific author published per year in each journal. The screenshot of the result is below and you can see it for yourself here: https://spreadsheets.google.com/tq?tqx=out:html&tq=select+D,F,count(C)+where+%28B+contains+%27Agoulnik%27%29+group+by+D,F&key=0AqAPbBT_k2VUdEtXYXdLdjM0TXY1YUVhMk9jeUQ0NkE

Screen Shot 2013-11-27 at 11.34.25 AM

Take this spread sheet as another example.

libbudgetfake

This simple query below displays the library budget by year. For those who are unfamiliar with ‘pivot‘, pivot table is a data summarization tool. The query below asks the spreadsheet to calculate the total of all the values in the B column (Budget amount for each category) by the values found in the C column (Years).

Screen Shot 2013-11-27 at 11.46.49 AM

This is another example of querying the spreadsheet connected to my library’s Literature Search request form. The following query asks the spreadsheet to count the number of literature search requests by Research Topic (=column I) that were received in 2011 (=column G) grouped by the values in the column C, i.e. College of Medicine Faculty or College of Medicine Staff.

  • select C, count(I) where (G contains ’2011′) group by C

litsearch

C. More Querying Options

There are many more things you can do with a custom query. Google has an extensive documentation that is easy to follow: https://developers.google.com/chart/interactive/docs/querylanguage#Language_Syntax

These are just a few examples.

  • ORDER BY __ DESC
    : Order the results in the descending order of the column of your choice. Without ‘DESC,’ the result will be listed in the ascending order.
  • LIMIT 5
    : Limit the number of results. Combined with ‘Order by’ you can quickly filter the results by the most recent or the oldest items.

My presentation slides given at the 2013 LITA Forum below includes more detailed information about Google Visualization API Query Language, parameters, and other options as well as how to use Google Chart Libraries in combination with Google Visualization API Query Language for data visualization, which is the topic of my next post.

Happy querying Google Spreadsheet!