Taking a Practical Look at the Google Books Case

Last month we got the long-awaited ruling in favor of Google in the Authors Guild vs. Google Books case, which by now has been analyzed extensively. Ultimately the judge in the case decided that Google’s digitization was transformative and thus constituted fair use. See InfoDocket for detailed coverage of the decision.

The Google Books project was part of the Google mission to index all the information available, and as such could never have taken place without libraries, which hold all those books. While most, if not all, the librarians I know use Google Books in their work, there has always been a sense that the project should not have been started by a commercial enterprise using the intellectual resources of libraries, but should have been started by libraries themselves working together.  Yet libraries are often forced to be more conservative about digitization than we might otherwise be due to rules designed to protect the college or university from litigation. This ruling has made it seem as though we could afford to be less cautious. As Eric Hellman points out, the decision seems to imply that with copyright the ends are the important part, not the means. “In Judge Chin’s analysis, copyright is concerned only with the ends, not the means. Copyright seems not to be concerned with what happens inside the black box.” 1 As long as the end use of the books was fair, which was deemed to be the case, the initial digitization was not a problem.

Looking at this from the perspective of repository manager, I want to address a few of the theoretical and logistical issues behind such a conclusion for libraries.

What does this mean for digitization at libraries?

At the beginning of 2013 I took over an ongoing digitization project, and as a first-time manager of a large-scale long-term project, I learned a lot about the processes involved in such a project. The project I work with is extremely small-scale compared with many such projects, but even at this scale the project is expensive and time-consuming. What makes it worth it is that long-buried works of scholarship are finally being used and read, sometimes for reasons we do not quite understand. That gets at the heart of the Google Books decision—digitizing books in library stacks and making them more widely available does contribute to education and useful arts.

There are many issues that we need to address, however. Some of the most important ones are what access can and should be provided to what works, and making mass digitization more available to smaller and international cultural heritage institutions. Google Books could succeed because it had the financial and computing resources of Google matched with the cultural resources of the participating research libraries. This problem is international in scope. I encourage you to read this essay by Amelia Sanz, in which she argues that digitization efforts so far have been inherently unequal and a reflection of colonialism. 2 But is there a practical way of approaching this desire to make books available to a wider audience?

Providing Access

There are several separate issues in providing access. Books that are in the public domain are unquestionably fine to digitize, though differences in international copyright law make it difficult to determine what can be provided to whom. As Amelia Sanz points out, Google can only digitize Spanish works prior to 1870 in Spain, but may digitize the complete work in the United States. The complete work is not available to Spanish researchers, but it is available in full to US researchers.

That aside, there are several reasons why it is useful to digitize works still unquestionably under copyright. One of the major reasons is textual corpus analysis–you need to have every word of many texts available to draw conclusions about use of words and phrases in those texts. Google Books ngram viewer is one such tool that comes out of mass digitization. Searching for phrases in Google and finding that phrase as a snippet in a book is an important way to find information in books that might otherwise be ignored in favor of online sources. Some argue that this means that those books will not be purchased when they might have otherwise been, but it is equally possible that this leads to greater discovery and more purchases, which research into music piracy suggests may be the case.

Another reason to digitize works still under copyright is to highlight the work of marginalized communities, though in that case it is imperative to work with those communities to ensure that the digitization is not exploitative. Many orphan works, for whom a rights-holder cannot be located, fall under this, and I know from some volunteer work that I have done that small cultural heritage institutions are eager to digitize material that represents the cultural and intellectual output of their communities.

In all the above cases, it is crucial to put into place mechanisms for ensuring that works under copyright are not abused. Google Books uses an algorithm that makes it impossible to read an entire book, which is probably beyond the abilities of most institutions. (If anyone has an idea for how to do this, I would love to hear it.) Simpler and more practical solutions to limiting access are to only make a chapter or sample of a book available for public use, which many publishers already allow. For instance, Oxford University Press allows up to 10% of a work (within certain limits) on personal websites or institutional repositories. (That is, of course, assuming you can get permission from the author). Many institutions maintain “dark archives“, which are digitized and (usually) indexed archives of material inaccessible to the public, whether institutional or research information. For instance, the US Department of Energy Office of Scientific and Technical Information maintains a dark archive index of technical reports comprising the equivalent of 6 million pages, which makes it possible to quickly find relevant information.

In any case where an institution makes the decision to digitize and make available the full text of in-copyright materials for reasons they determine are valid, there are a few additional steps that institutions should take. Institutions should research rights-holders or at least make it widely known to potential rights-holders that a project is taking place. The Orphan Works project at the University of Michigan is an example of such a project, though it has been fraught with controversy. Another important step is to have a very good policy for taking down material when a rights-holder asks–it should be clear to the rights-holder whether any copies of the work will be maintained and for what purposes (for instance archival or textual analysis purposes).

Digitizing, Curating, Storing, Oh My!

The above considerations are only useful when it is even possible for institutions without the resources of Google to start a digitization program. There are many examples of DIY digitization by individuals, for instance see Public Collectors, which is a listing of collections held by individuals open for public access–much of it digitized by passionate individuals. Marc Fischer, the curator of Public Collectors, also digitizes important and obscure works and posts them on his site, which he funds himself. Realistically, the entire internet contains examples of digitization of various kinds and various legal statuses. Most of this takes place on cheap and widely available equipment such as flatbed scanners. But it is possible to build an overhead book scanner for large-scale digitization with individual parts and at a reasonable cost. For instance, the DIY Book Scanning project provides instructions and free software for creating a book scanner. As they say on the site, all the process involves is to “[p]oint a camera at a book and take pictures of each page. You might build a special rig to do it. Process those pictures with our free programs. Enjoy reading on the device of your choice.”

“Processing the pictures” is a key problem to solve. Turning images into PDF documents is one thing, but providing high quality optical character recognition is extremely challenging. Free tools such as FreeOCR make it possible to do OCR from image or PDF files, but this takes processing power and results vary widely, particularly if the scan quality is lower. Even expensive tools like Adobe Acrobat or ABBYY FineReader have the same problems. Karen Coyle points out that uncorrected OCR text may be sufficient for searching and corpus analysis, but does not provide a faithful reproduction of the text and thus, for instance, provide access to visually impaired persons 3 This is a problem well known in the digital humanities world, and one solved by projects such as Project Gutenberg with the help of dedicated volunteer distributed proofreaders. Additionally, a great deal of material clearly in the public domain is in manuscript form or has text that modern OCR cannot recognize. In that case, crowdsourcing transcriptions is the only financially viable way for institutions to make text of the material available. 4 Examples of successful projects using volunteer transcriptors or proofreaders include Ancient Lives to transcribe ancient papyrus, What’s on the Menu at the New York Public Library, and DIYHistory at the University of Iowa libraries. (The latter has provided step by step instructions for building your own version using open source tools).

So now you’ve built your low-cost DIY book scanner, and put together a suite of open source tools to help you process your collections for free. Now what? The whole landscape of storing and preserving digital files is far beyond the scope of this post, but the cost of accomplishing this is probably the highest of anything other than staffing a digitization project, and it is here where Google clearly has the advantage. The Internet Archive is a potential solution to storing public domain texts (though they are not immune to disaster), but if you are making in-copyright works available in any capacity you will most likely have to take the risk on your own servers. I am not a lawyer, but I have never rented server space that would allow copyrighted materials to be posted.

Conclusion: Is it Worth It?

Obviously from this post I am in favor of taking on digitization projects of both public domain and copyrighted materials when the motivations are good and the policies are well thought out. From this perspective, I think the Google Books decision was a good thing for libraries and for providing greater access to library collections. Libraries should be smart about what types of materials to digitize, but there are more possibilities for large-scale digitization, and by providing more access, the research community can determine what is useful to them.

If you have managed a DIY book scanning project, please let me know in the comments, and I can add links to your project.

  1. Hellman, Eric. “Google Books and Black-Box Copyright Jurisprudence.” Go To Hellman, November 18, 2013. http://go-to-hellman.blogspot.com/2013/11/google-books-and-black-box-copyright.html.
  2. Sanz, Amelia. “Digital Humanities or Hypercolonial Studies?” Responsible Innovation in ICT (June 26, 2013). http://responsible-innovation.org.uk/torrii/resource-detail/1249#_ftnref13.
  3. Coyle, Karen. “It’s FAIR!” Coyle’s InFormation, November 14, 2013. http://kcoyle.blogspot.com/2013/11/its-fair.html.
  4. For more on this, see Ben Brumfield’s work on crowdsourced transcription, for example Brumfield, Ben W. “Collaborative Manuscript Transcription: ‘The Landscape of Crowdsourcing and Transcription’ at Duke University.” Collaborative Manuscript Transcription, November 23, 2013. http://manuscripttranscription.blogspot.com/2013/11/the-landscape-of-crowdsourcing-and.html.

Building a Dynamic Image Display with Drupal & Isotope

I am in love with Isotope.  It’s not often that you hear someone profess their love for a JQuery library (unless it’s this), but there it is.  I want to display everything in animated grids.

I also love Views Isotope, a Drupal 7 module that enabled me to create a dynamic image gallery for our school’s Year in Review.  This module (paired with a few others) is instrumental in building our new digital library.

yearinreview

In this blog post, I will walk you through how we created the Year in Review page, and how we plan to extrapolate the design to our collection views in the Knowlton Digital Library.  This post assumes you have some basic knowledge of Drupal, including an understanding of content types, taxonomy terms and how to install a module.

Year in Review Project

Our Year in Review project began over the summer, when our communications team expressed an interest in displaying the news stories from throughout the school year in an online, interactive display.  The designer on our team showed me several examples of card-like interfaces, emphasizing the importance of ease and clean graphics.  After some digging, I found Isotope, which appeared to be the exact solution we needed.  Isotope, according to its website, assists in creating “intelligent, dynamic layouts that can’t be achieved with CSS alone.”  This JQuery library provides for the display of items in a masonry or grid-type layout, augmented by filters and sorting options that move the items around the page.

At first, I was unsure we could make this library work with Drupal, the content management system we employ for our main web site and our digital library.  Fortunately I soon learned – as with many things in Drupal – there’s a module for that.  The Views Isotope module provides just the functionality we needed, with some tweaking, of course.

We set out to display a grid of images, each representing a news story from the year.  We wanted to allow users to filter those news stories based on each of the sections in our school: Architecture, Landscape Architecture and City and Regional Planning.  News stories might be relevant to one, two or all three disciplines.  The user can see the news story title by hovering over the image, and read more about the new story by clicking on the corresponding item in the grid.

Views Isotope Basics

Views Isotope is installed in the same way as other Drupal modules.  There is an example in the module and there are also videos linked from the main module page to help you implement this in Views.  (I found this video particularly helpful.)

You must have the following modules installed to use Views Isotope:

You also need to install the Isotope JQuery library.  It is important to note that Isotope is only free for non-commercial projects.  To install the library, download the package from the Isotope GitHub repository.  Unzip the package and copy the whole directory into your libraries directory.  Within your Drupal installation, this should be in the /sites/all/libraries folder.  Once the module and the library are both installed, you’re ready to start.

If you have used Drupal, you have likely used Views.  It is a very common way to query the underlying database in order to display content.The Views Isotope module provides additional View types: Isotope Grid, Isotope Filter Block and Isotope Sort Block.  These three view types combine to provide one display.  In my case, I have not yet implemented the Sort Block, so I won’t discuss it in detail here.

To build a new view, go to Structure > Views > Add a new view.  In our specific example, we’ll talk about the steps in more detail.  However, there’s a few important tenets of using Views Isotope, regardless of your setup:

  1. There is a grid.  The View type Isotope Grid powers the main display.
  2. The field on which we want to filter is included in the query that builds the grid, but a CSS class is applied which hides the filters from the grid display and shows them only as filters.
  3. The Isotope Filter Block drives the filter display.  Again, a CSS class is applied to the fields in the query to assign the appropriate display and functionality, instead of using default classes provided by Views.
  4. Frequently in Drupal, we are filtering on taxonomy terms.  It is important that when we display these items we do not link to the taxonomy term page, so that a click on a term filters the results instead of taking the user away from the page.

With those basic tenets in mind, let’s look at the specific process of building the Year in Review.

Building the Year in Review

Armed with the Views Isotope functionality, I started with our existing Digital Library Drupal 7 instance and one content type, Item.  Items are our primary content type and contain many, many fields, but here are the important ones for the Year in Review:

  • Title: text field containing the headline of the article
  • Description: text field containing the shortened article body
  • File: File field containing an image from the article
  • Item Class: A reference to a taxonomy term indicating if the item is from the school archives
  • Discipline: Another term reference field which ties the article to one or more of our disciplines: Architecture, Landscape Architecture or City and Regional Planning
  • Showcase: Boolean field which flags the article for inclusion in the Year in Review

The last field was essential so that the communications team liaison could curate the page.  There are more news articles in our school archives then we necessarily want to show in the Year in Review, and the showcase flag solves this problem.

In building our Views, we first wanted to pull all of the Items which have the following characteristics:

  • Item Class: School Archives
  • Showcase: True

So, we build a new View.  While logged in as administrator, we click on Structure, Views then Add a New View.  We want to show Content of type Item, and display an Isotope Grid of fields.  We do not want to use a pager.  In this demo, I’m going to build a Page View, but a Block works as well (as we will see later).  So my settings appear as follows:

GridSettings

Click on Continue & edit.  For the Year in Review we next needed to add our filters – for Item Class and Showcase.  Depending on your implementation, you may not need to filter the results, but likely you will want to narrow the results slightly.  Next to Filter Criteria, click on Add.

addclass0I first searched for Item Class, then clicked on Apply.

addclass1Next, I need to select a value for Item Class and click on Apply.

addclassI repeated the process with the Showcase field.

addshowcase

If you click Update Preview at the bottom of the View edit screen, you’ll see that much of the formatting is already done with just those steps.

preview1

Note that the formatting in the image above is helped along by some CSS.  To style the grid elements, the Views Isotope module contains its own CSS in the module folder ([drupal_install]/sites/all/modules/views_isotope).  You can move forward with this default display if it works for your site.  Or, you can override this in the site’s theme files, which is what I’ve done above.  In my theme CSS file, I have applied the following styling to the class “isotope-element”

.isotope-element {
    float: left;
    height: 140px;
    margin: 6px;
    overflow: hidden;
    position: relative;
    width: 180px;
}
I put the above code in my CSS file associated with my theme, and it overrides the default Views Isotope styling.  “isotope-element” is the class applied to the div which contains all the fields being displayed for each item.  Let’s add a few more items and see how the rendered HTML looks.
First, I want to add an image.  In my case, all of my files are fields of type File, and I handle the rendering through Media based on file type.  But you could use any image field, also.

addfile

I use the Rendered File Formatter and select the Grid View Mode, which applies an Image Style to the file, resizing it to 180 x 140.  Clicking Update Preview again shows that the image has been added each item.

imageAndtext

This is closer, but in our specific example, we want to hide the title until the user hovers over the item.  So, we need to add some CSS to the title field.

hidetitle

In my CSS file, I have the following:

.isotope-grid-text {
    background: none repeat scroll 0 0 #4D4D4F;
    height: 140px;
    left: 0;
    opacity: 0;
    position: absolute;
    top: 0;
    width: 100%;
    z-index: 20;
}

Note the opacity is 0 – which means the div is transparent, allowing the image to show through.  Then, I added a hover style which just changes the opacity to mostly cover the image:

.isotope-grid-text {
  opacity: 0.9;
}

Now, if we update preview, we should see the changes.

imagewhover

The last thing we need to do is add the Discipline field for each item so that we can filter.

There are two very important things here.  First, we want to make sure that the field is not formatted as a link to the term, so we select Plain text as the Formatter.

Second, we need to apply a CSS class here as well, so that the Discipline fields show in filters, not in the grid.  To do that, check the Customize field HTML and select the DIV element.  Then, select Create a class and enter “isotope-filter”.  Also, uncheck “Apply default classes.”  Click Apply.
addfilter1

Using Firebug, I can now look at the generated HTML from this View and see that isotope-element <div> contains all the fields for each item, though the isotope-filter class loads Discipline as hidden.

<div class="isotope-element landscape-architecture" data-category="landscape-architecture">
  <div class="views-field views-field-title"> (collapsed for brevity) </div>
  <div class="views-field views-field-field-file"> (collapsed for brevity) </div>
  <div>  
    <div class="isotope-filter">Landscape Architecture</div>
  </div>
</div>

You might also notice that the data-category for this element is assigned as landscape-architecture, which is our Discipline term for this item.  This data-category will drive the filters.

So, let’s save our View by clicking Save at the top and move on to create our filter block.  Create a new view, but this time create a block which displays taxonomy terms of type Discipline.  Then, click on Continue & Edit.

filterblockThe first thing we want to do is adjust view so that the default row wrappers are not applied.  Note: this is the part I ALWAYS forget, and then when my filters don’t work it takes me forever to track it down.

Click on Settings next to Fields.

fieldsettingsUncheck the Provide default field wrapper elements.  Click Apply.

fieldsettings2

Next, we do not want the fields to be links to term pages, because a user click should filter the results, not link back to the term.  So, click on the term name to edit that field.  Uncheck the box next to “Link this field to its taxonomy term page”.  Click on Apply.

term-nolink

Save the view.

The last thing is to make the block appear on the page with the grid.  In practice, Drupal administrators would use Panels or Context to accomplish this (we use Context), but it can also be done using the Blocks menu.

So, go to Structure, then click on Blocks.  Find our Isotope-Filter Demo block.  Because it’s a View, the title will begin with “View:”

blockname

Click Configure.  Set block settings so that the Filter appears only on the appropriate Grid page, in the region which is appropriate for your theme.  Click save.

blocksettings

Now, let’s visit our /isotope-grid-demo page.  We should see both the grid and the filter list.

final

It’s worth noting that here, too, I have customized the CSS.  If we look at the rendered HTML using Firebug, we can see that the filter list is in a div with class “isotope-options” and the list itself has a class of “isotope-filters”.

<div class="isotope-options">
  <ul class="isotope-filters option-set clearfix" data-option-key="filter">
    <li><a class="" data-option-value="*" href="#filter">All</a></li>
    <li><a class="filterbutton" href="#filter" data-option-value=".architecture">Architecture</a></li>
    <li><a class="filterbutton selected" href="#filter" data-option-value=".city-and-regional-planning">City and Regional Planning</a></li>
    <li><a class="filterbutton" href="#filter" data-option-value=".landscape-architecture">Landscape Architecture</a></li>
  </ul>
</div>

I have overridden the CSS for these classes to remove the background from the filters and change the list-style-type to none, but you can obviously make whatever changes you want.  When I click on one of the filters, it shows me only the news stories for that Discipline.  Here, I’ve clicked on City and Regional Planning.

crpfilter

Next Steps

So, how do we plan to use this in our digital library going forward?  So far, we have mostly used the grid without the filters, such as in one of our Work pages.  This shows the metadata related to a given work, along with all the items tied to that work.  Eventually, each of the taxonomy terms in the metadata will be a link.  The following grids are all created with blocks instead of pages, so that I can use Context to override the default term or node display.

WorkScreenShot

However, in our recently implemented Collection view, we allow users to filter the items based on their type: image, video or document.  Here, you see an example of one of our lecture collections, with the videos and the poster in the same grid, until the user filters for one or the other.

CollectionPage

There are two obstacles to using this feature in a more widespread manner throughout the site.  First, I have only recently figured out how to implement multiple filter options.  For example, we might want to filter our news stories by Discipline and Semester.  To do this, we rewrite the sorting fields in our Grid display so that they all display in one field.  Then, we create two Filter blocks, one for each set of terms.  Implementing this across the site so that users can sort by say, item type and vocabulary term, will make it more useful to us.

Second, we have several Views that might return upwards of 500 items.  Loading all of the image files for this result set is costly, especially when you add in the additional overhead of a full image loading in the background for a Colorbox overlay and Drupal performance issues.  The filters will not work across pages, so if I use pager, I will only filter the items on the page I’m viewing.  I believe this can fixed somehow using Infinite Scroll (as described in several ways here), but I have not tried yet.

With these two advanced options, there are many options for improving the digital library interface.  I am especially interested in how to use multiple filters on a set of search results returned from a SOLR index.

What other extensions might be useful?  Let us know what you think in the comments.

Resources

 


One Week, One Tool, Many Lessons

When I was a kid, I cherished the Paula Danziger book Remember Me to Harold Square, in which a group of kids call themselves the Serendipities, named for the experience of making fortunate discoveries accidentally.  Last week I found myself remembering the book over and over again as I helped develop Serendip-o-matic, a tool which introduces serendipity to research, as part of a twelve person team attending One Week | One Tool at the Roy Rosenzweig Center for History and New Media at George Mason University (RRCHNM).

In this blog post, I’ll take you through the development of the “serendipity machine”, from the convening of the team to the selection and development of the tool.  The experience turned out to be an intense learning experience for me, so along the way, I will share some of my own fortunate discoveries.

(Note: this is a pretty detailed play-by-play of the process.  If you’re more interested in the result, please see the RRCHNM news items on both our process and our product, or play with Serendip-o-matic itself.)

The Eve of #OWOT

@foundhistory: One Week | One Tool is Here! Meet the crew and join in! http://t.co/4EyDPllNnu #owot #buildsomething

Approximately thirty people applied to be part of One Week | One Tool (OWOT), an Institute for Advanced Topics in the Digital Humanities, sponsored by the National Endowment for the Humanities.  Twelve were selected and we arrive on Sunday, July 28, 2013 and convene in the Well, the watering hole at the Mason Inn.

As we circle around the room and introduce ourselves, I can’t help but marvel at the myriad skills of this group.  Each person arrives with more than one bag of tricks from which to pull, including skills in web development, historical scholarship, database administration, librarianship, human-computer interaction and literature studies.  It takes about a minute before someone mentions something I’ve never heard of and so the Googling begins.  (D3, for the record, is a Javascript library for data visualizations).

Tom Scheinfeldt (@foundhistory), the RRCHNM director-at-large who organized OWOT, delivers the pre-week pep talk and discusses how we will measure success.  The development of the tool is important, but so is the learning experience for the twelve assembled scholars.  It’s about the product, but also about the process.  We are encouraged to learn from each other, to “hitch our wagon” to another smart person in the room and figure out something new.

As for the product, the goal is to build something that is used.  This means that defining and targeting the audience is essential.

The tweeting began before we arrived, but typing starts in earnest at this meeting and the #owot hashtag is populated with our own perspectives and feedback from the outside.  Feedback, as it turns out, will be the priority for Day 1.

Day 1

@DoughertyJack: “One Week One Tool team wants feedback on which digital tool to build.”

Mentors from RRCHNM take the morning to explain some of the basic tenets of what we’re about to do.  Sharon Leon talks about the importance of defining the project: “A project without an end is not a project.”  Fortunately, the one week timeline solves this problem for us initially, but there’s the question of what happens after this week?

Patrick Murray-John takes us through some of the finer points of developing in a collaborative environment.  Sheila Brennan discusses outreach and audience, and continues to emphasize the point from the night before: the audience definition is key.  She also says the sentence that, as we’ll see, would need to be my mantra for the rest of the project: “Being willing to make concrete decisions is the only way you’re going to get through this week.

All of the advice seems spot-on and I find myself nodding my head.  But we have no tool yet, and so how to apply specifics is still really hazy.  The tool is the piece of the puzzle that we need.

We start with an open brainstorming session, which results in a filled whiteboard of words and concepts.  We debate audience, we debate feasibility, we debate openness.  Debate about openness brings us back to the conversation about audience – for whom are we being open?  There’s lot of conversation but at the end, we essentially have just a word cloud associated with projects in our heads.

So, we then take those ideas and try to express them in the following format: X tool addresses Y need for Z audience.  I am sitting closest to the whiteboards so I do a lot of the scribing for this second part and have a few observations:

  • there are pet projects in the room – some folks came with good ideas and are planning to argue for them
  • our audience for each tool is really similar; as a team we are targeting “researchers”, though there seems to be some debate on how inclusive that term is.  Are we including students in general?  Teachers?  What designates “research”?  It seems to depend on the proposed tool.
  • the problem or need is often hard to articulate.  “It would be cool” is not going to cut it with this crowd, but there are some cases where we’re struggling to define why we want to do something.
Clarifying Ideas

Clarifying Ideas. Photo by Mia Ridge.

A few group members begin taking the rows and creating usable descriptions and titles for the projects in a Google Doc, as we want to restrict public viewing while still sharing within the group.  We discuss several platforms for sharing our list with the world, and land on IdeaScale.  We want voters to be able to vote AND comment on ideas, and IdeaScale seems to fit the bill.  We adjourn from the Center and head back to the hotel with one thing left to do: articulate these ideas to the world using IdeaScale and get some feedback.

The problem here, of course, is that everyone wants to make sure that their idea is communicated effectively and we need to agree on public descriptions for the projects.  Finally, it seems like there’s a light at the end of the tunnel…until we hit another snag.  IdeaScale requires a login to vote or comment and there’s understandable resistance around the table to that idea.  For a moment, it feels like we’re back to square one, or at least square five.  Team members begin researching alternatives but nothing is perfect, we’ve already finished dinner and need the votes by 10am tomorrow.  So we stick with IdeaScale.

And, not for the last time this week, I reflect on Sheila’s comment, “being willing to make concrete decisions is the only way you’re going to get through this week.”  When new information, such as the login requirement, challenges the concrete decision you made, how do you decide whether or not to revisit the decision?  How do you decide that with twelve people?

I head to bed exhausted, wondering about how many votes we’re going to get, and worried about tomorrow: are we going to make a decision?

Day 2

@briancroxall: “We’ve got a lot of generous people in the #owot room who are willing to kill their own ideas.”

It turns out that I need not have worried.  In the winnowing from 11 choices down to 2, many members of the team are willing to say, “my tool can be done later” or “that one can be done better outside this project.”   Approximately 100 people weighed in on the IdeaScale site, and those votes are helpful as we weigh each idea.  Scott Kleinman leads us in a discussion about feasbility for implementation and commitment in the room and the choices begin to fall away.  At the end, there are four, but after a few rounds of voting we’re down to two with equal votes that must be differentiated.  After a little more discussion, Tom proposes a voting system that allows folks to weight their votes in terms of commitment and the Serendipity project wins out.  The drafted idea description reads:

“A serendipitous discovery tool for researchers that takes information from your personal collection (such as a Zotero citation library  or a CSV file) and delivers content (from online libraries or collections like DPLA or Europeana) similar to it, which can then be visualized and manipulated.”

We decide to keep our project a secret until our launch and we break for lunch before assigning teams.  (Meanwhile, #owot hashtag follower Sherman Dorn decides to create an alternative list of ideas – One Week Better Tools – which provides some necessary laughs over the next couple of days).

After lunch, it’s time to break out responsibilities.  Mia Ridge steps up, though, and suggests that we first establish a shared understanding of the tool.  She sketches on one of the whiteboards the image which would guide our development over the next few days.

This was a takeaway moment for me.  I frequently sketch out my projects, but I’m afraid the thinking often gets pushed out in favor of the doing when I’m running low on time.  Mia’s suggestion that we take the time despite being against the clock probably saved us lots of hours and headaches later in the project.  We needed to aim as a group, so our efforts would fire in the same direction.  The tool really takes shape in this conversation, and some of the tasks are already starting to become really clear.  (We are also still indulging our obsession with mustaches at this time, as you may notice.)

Mia sketchs app.

Mia Ridge leads discussion of application. Photo by Meghan Frazer.

Tom leads the discussion of teams.  He recommends three: a project management team, a design/dev team and an outreach team.  The project managers should be selected first, and they can select the rest of the teams.  The project management discussion is difficult; there’s an abundance of qualified people in the room.  From my perspective, it makes sense to have the project managers be folks who can step in and pinch hit as things get hectic, but we also need our strongest technical folks on the dev team.  In the end, Brian Croxall and I are selected to be the project management team.

We decide to ask the remaining team members where they would like to be and see where our numbers end up.  The numbers turn out great: 7 for design/dev and 3 for outreach, with two design/dev team members slated to help with outreach needs as necessary.

The teams hit the ground running and begin prodding the components of the idea. The theme of the afternoon is determining the feasibility of this “serendipity engine” we’ve elected to build.  Mia Ridge, leader of the design/dev team, runs a quick skills audit and gets down to the business of selecting programming languages, frameworks and strategies for the week. They choose to work in Python with the Django framework.  Isotope, a JQuery plugin I use in my own development, is selected to drive the results page.  A private Github repository is set up under a code name.  (Beyond Isotope, HTML and CSS, I’m a little out of my element here, so for more technical details, please visit the public repository’s wiki.)  The outreach team lead, Jack Dougherty, brainstorms with his team on overall outreach needs and high priority tasks.  The Google document from yesterday becomes a Google Drive folder, with shells for press releases, a contact list for marketing and work plans for both teams.

This is the first point where I realize that I am going to have to adjust to a lack of hands on work.  I do my best when I’m working a keyboard: making lists, solving problems with code, etc.  As one of the project managers, my job is much less on the keyboard and much more about managing people and process.

When the teams come back together to report out, there’s a lot of getting each side up to speed, and afterwards our mentors advise us that the meetings have to be shorter.  We’re already at the end of day 2, though both teams would be working into the night on their work plans and Brian and need I still need to set the schedule for tomorrow.

We’re past the point where we can have a lot of discussion, except for maybe about the name.

Day 3

@briancroxall: Prepare for more radio silence from #owot today as people put their heads down and write/code.

@DoughertyJack: At #owot we considered 120 different names for our tool and FINALLY selected number 121 as the winner. Stay tuned for Friday launch!

Wednesday is tough.  We have to come up with a name, and all that exploration from yesterday needs to be a prototype by the end of the day. We are still hammering out the language we use in talking to each other and there’s some middle ground to be found on terminology. One example is the use of the word “standup” in our schedule.  “Standup” means something very specific to developers familiar with the Agile development process whereas I just mean, “short update meeting.”  Our approach to dealing with these issues is to identify the confusion and quickly agree on language we all understand.

I spend most of the day with the outreach team.  We have set a deadline for presenting names at lunchtime and are hoping the whole team can vote after lunch.  This schedule turns out to be folly as the name takes most of the day and we have to adjust our meeting times accordingly.  As project managers, Brian and I are canceling meetings (because folks are on a roll, we haven’t met a deadline, etc) whenever we can, but we have to balance this with keeping the whole team informed.

Camping out in a living room type space in RRCHNM, spread out among couches and looking at a Google Doc being edited on a big-screen TV, the outreach team and various interested parties spend most of the day brainstorming names.  We take breaks to work on the process press release and other essential tasks, but the name is the thing for the moment.  We need a name to start working on branding and logos.  Product press releases need to be completed, the dev team needs a named target and of course, swag must be ordered.

It is in this process, however, that an Aha! moment occurs for me.  We have been discussing names for a long time and folks are getting punchy.  The dev team lead and our designer, Amy Papaelias, have joined the outreach team along with most of our CHNM mentors.  I want to revisit something dev team member Eli Rose said earlier in the day.  To paraphrase, Eli said that he liked the idea that the tool automated or mechanized the concept of surprise.  So I repeat Eli’s concept to the group and it isn’t long after that that Mia says, “what about Serendip-o-matic?”  The group awards the name with head nods and “I like that”s and after running it by developers and dealing with our reservations (eg, hyphens, really?), history is made.

As relieved as I am to finally have a name, the bigger takeaway for me here is in the role of the manager.  I am not responsible for the inspiration for the name or the name itself, but instead repeating the concept to the right combination of people at a time when the team was stuck.  The project managers can create an opportunity for the brilliant folks on the team to make connections.  This thought serves as a consolation to me as I continue to struggle without concrete tasks.

Meanwhile, on the other side the building, the rest of dev team is pushing to finish code.  We see a working prototype at the end of the day, and folks are feeling good, but its been a long day.  So we go to dinner as a team, and leave the work behind for a couple of hours, though Amy is furiously sketching at various moments throughout the meal as she tries to develop a look and feel for this newly named thing.

On the way home from dinner, I think, “there’s only two days left.”  All of the sudden it feels like we haven’t gotten anywhere.

Day 4

@shazamrys: Just looking at first logo designs from @fontnerd. This is going to be great. #owot

We start hectic but excited.  Both teams were working into the night, Amy has a logo strategy and it goes over great.  Still, there’s lots of work to do today. Brian and I sit down in the morning and try to discuss what happens with the tool and the team next week and the week after before jumping in and trying to help the teams where we can, including things like finding a laptop for a dev team member, connecting someone with javascript experience to a team member who is stuck, or helping edit the press release.  This day is largely a blur in my recollection, but there are some salient points that stick out.

The decision to add the Flickr API to our work in order to access the Flickr Commons is made with the dev team, based on the feeling that we have enough time and the images located there enhance our search results and expand our coverage of subject areas and geographic locations.

We also spend today addressing issues.  The work of both teams overlaps in some key areas.  In the afternoon, Brian and I realize that we have mishandled some of the communication regarding language on the front page and both teams are working on the text.  We scramble to unify the approaches and make sure that efforts are not wasted.

This is another learning moment for me.  I keep flashing on Sheila’s words from Monday, and worry that our concrete decision making process is suffering from”too many cooks in the kitchen.” Everyone on this team has a stake in the success of this project and we have lots of smart people with valid opinions.  But everyone can’t vote on everything and we are spending too much time getting consensus now, with a mere twenty-four hours to go.  As a project manager, part of my job is to start streamlining and making executive decisions, but I am struggling with how to do that.

As we prepare to leave the center at 6pm, things are feeling disconnected.  This day has flown by.  Both teams are overwhelmed by what has to get done before tomorrow and despite hard work throughout the day, we’re trying to get a dev server and production server up and running.  As we regroup at the Inn, the dev team heads upstairs to a quiet space to work and eat and the outreach team sets up in the lobby.

Then, good news arrives.  Rebecca Sutton-Koeser has managed to get both the dev and production servers up and the code is able to be deployed.  (We are using Heroku and Amazon Web Services specifically, but again, please see the wiki for more technical details.)

The outreach team continues to work on documentation, and release strategy and Brian and I continue to step in where we can.  Everyone is working until midnight or later, but feeling much better about our status then we did at 6pm.

Day 5

@raypalin: If I were to wait one minute, I could say launch is today. Herculean effort in final push by my #owot colleagues. Outstanding, inspiring.

The final tasks are upon us.  Scott Williams moves on from his development responsibilities to facilitate user testing, which was forced to slide from Thursday due to our server problems.  Amanda Visconti works to get the interactive results screen finalized.  Ray Palin hones our list of press contacts and works with Amy to get the swag design in place. Amrys Williams collaborates with the outreach team and then Sheila to publish the product press release.  Both the dev and outreach teams triage and fix and tweak and defer issues as we move towards our 1pm “code chill”, a point which we’re hoping to have the code in a fairly stable state.

We are still making too many decisions with too many people, and I find myself weighing not only the options but how attached people are to either option.  Several choices are made because they reflect the path of least resistance.  The time to argue is through and I trust the team’s opinions even when I don’t agree.

We end up running a little behind and the code freeze scheduled for 2pm slides to 2:15.  But at this point we know: we’re going live at 3:15pm.

Jack Dougherty has arranged a Google hangout with Dan Cohen of the Digital Public Library of America and Brett Bobley and Jen Serventi of the NEH Office of Digital Humanities, which the project managers co-host.  We broadcast the conversation live via the One Week | One Tool website.

The code goes live and the broadcast starts but my jitters do not subside…until I hear my teammates cheering in the hangout.  Serendip-o-matic is live.

The Aftermath

At 8am on Day 6, Serendip-o-matic had its first pull request and later in the day, a fourth API – Trove of Australia – was integrated.  As I drafted this blog post on Day 7, I received email after email generated by the active issue queue and the tweet stream at #owot is still being populated.  On Day 9, the developers continue to fix issues and we are all thinking about long term strategy.  We are brainstorming ways to share our experience and help other teams achieve similar results.

I found One Week | One Tool incredibly challenging and therefore a highly rewarding experience.  My major challenge lay in shifting my mindset from that of a someone hammering on a keyboard in a one-person shop to a that of a project manager for a twelve-person team.  I write for this blog because I like to build things and share how I built them, but I have never experienced the building from this angle before.  The tight timeline ensured that we would not have time to go back and agonize over decisions, so it was a bit like living in a project management accelerator.  We had to recognize issues, fix them and move on quickly, so as not to derail the project.

However, even in those times when I became acutely more aware of the clock, I never doubted that we would make it.  The entire team is so talented; I never lost my faith that a product would emerge.  And, it’s an application that I will use, for inspiration and for making fortunate discoveries.

@meghanfrazer: I am in awe. Amazing to work with such a smart, giving team. #owot #feedthemachine

Team Photo

The #OWOT team. Photo by Sharon Leon.

(More on One Week | One Tool, including other blog entries, can be found by visiting the One Week | One Tool Zotero Group.)


Event Tracking with Google Analytics

In a previous post by Kelly Sattler and Joel Richard, we explored using web analytics to measure a website’s success. That post provides a clear high-level picture of how to create an analytics strategy by evaluating our users, web content, and goals. This post will explore a single topic in-depth; how to set up event tracking in Google Analytics.

Why Do We Need Event Tracking?

Finding solid figures to demonstrate a library’s value and make strategic decisions is a topic of increasing importance. It can be tough to stitch together the right information from a hodgepodge of third-party services; we rely on our ILSs to report circulation totals, our databases to report usage like full-text downloads, and our web analytics software to show visitor totals. But are pageviews and bounce rates the only meaningful measure of website success? Luckily, Google Analytics provides a way to track arbitrary events which occur on web pages. Event tracking lets us define what is important. Do we want to monitor how many people hover over a carousel of book covers, but only in the first second after the page has loaded? How about how many people first hover over the carousel, then the search box, but end up clicking a link in the footer? As long as we can imagine it and JavaScript has an event for it, we can track it.

How It Works

Many people are probably familiar with Google Analytics as a snippet of JavaScript pasted into their web pages. But Analytics also exposes some of its inner workings to manipulation. We can use the _gaq.push method to execute a “_trackEvent” method which sends information about our event back to Analytics. The basic structure of a call to _trackEvent is:

_gaq.push( [ '_trackEvent', 'the category of the event', 'the action performed', 'an optional label for the event', 'an optional integer value that quantifies something about the event' ] );

Looking at the array parameter of _gaq.push is telling: we should have an idea of what our event categories, actions, labels, and quantitative details will be before we go crazy adding tracking code to all our web pages. Once events are recorded, they cannot be deleted from Analytics. Developing a firm plan helps us to avoid the danger of switching the definition of our fields after we start collecting data.

We can be a bit creative with these fields. “Action” and “label” are just Google’s way of describing them; in reality, we can set up anything we like, using category->action->label as a triple-tiered hierarchy or as three independent variables.

Example: A List of Databases

Almost every library has a web page listing third-party databases, be they subscription or open access. This is a prime opportunity for event tracking because of the numerous external links. Default metrics can be misleading on this type of page. Bounce rate—the proportion of visitors who start on one of our pages and then immediately leave without viewing another page—is typically considered a negative metric; if a page has a high bounce rate, then visitors are not engaged with its content. But the purpose of a databases page is to get visitors to their research destinations as quickly as possible; bounce rate is a positive figure. Similarly, time spent on page is typically considered a positive sign of engagement, but on a databases page it’s more likely to indicate confusion or difficulty browsing. With event tracking, we can not only track which links were clicked but we can make it so database links don’t count towards bounce rate, giving us a more realistic picture of the page’s success.

One way of structuring “database” events is:

  • The top-level Category is “database”
  • The Action is the topical category, e.g. “Social Sciences”
  • The Label is the name of the database itself, e.g. “Academic Search Premier”

The final, quantitative piece could be the position of the database in the list or the number of seconds after page load it took the user to click its link. We could report some boolean value, such as whether the database is open access or highlighted in some way.

To implement this, we set up a JavaScript function which will be called every time one of our events occur. We will store some contextual information in variables, push that information to Google Analytics, and then delay the page’s navigation so the event has a chance to be recorded. Let’s walk through the code piece by piece:

function databaseTracking  ( event ) {
    var destination = $( this )[ 0 ].href,
        resource = $( this ).text(),
        // move up from <a> to parent element, then find the nearest preceding <h2> section header
        section = $( this ).parent().prevAll( 'h2' )[ 0 ].innerText,
        highlighted = $( this ).hasClass( 'highlighted' ) ? 1 : 0;

_gaq.push( [ '_trackEvent', 'database', resource, section, highlighted ] );

The top of our function just grabs information from the page. We’re using jQuery to make our lives easier, so all the $( this ) pieces of our code refer to the element that initiated the event. In our case, that’s the link pointing to an external database which the user just clicked. So we set destination to the link’s href attribute, resource to its text (e.g. the database’s name), section to the text inside the h2 element that labels a topical set of databases, and highlighted is a boolean value equal to 1 if the element has a class of “highlighted.” Next, this data is pushed into the _gaq array which is a queue of functions and their parameters that Analytics fires asynchronously. In this instance, we’re telling Analytics to run the _trackEvent function with the parameters that follow. Analytics will then record an event of type “database” with an action of [database name], a label of [section header], and a boolean representing whether it was highlighted or not.

setTimeout( function () {
    window.location = destination;
}, 200 );
event.preventDefault();
}

Next comes perhaps the least obvious piece: we prevent the default browser behavior from occurring, which in the case of a link is navigating away from our page, but then send the user to destination 200 milliseconds later anyways. The _trackEvent function now has a chance to fire; if we let the user follow the link right away it might not complete and our event would not be recorded.1

$( document ).ready(
    // target all anchors in list of databases
    $( '#databases-list a' ).on( 'click', databaseTracking )
);

There’s one last step; merely defining the databaseTracking function won’t cause it to execute when we want it to. JavaScript uses event handlers to execute certain functions based on various user actions, such as mousing over or clicking an element. Here, we add click event handlers to all <a> elements in the list of databases. Now whenever a user clicks a link in the databases list (which has a container with id “databases-list”), databaseTracking will run and send data to Google Analytics.

There is a demo on JSFiddle which uses the code above with some sample HTML. Every time you click a link, a pop-up shows you what the _gaq.push array looks like.

Though we used jQuery in our example, any JavaScript library can be used with event tracking.2 The procedure is always the same: write a function that gathers data to send back to Google Analytics and then add that function as a handler to an appropriate event, such as click or mouseover, on an element.

For another example, complete with code samples, see the article “Discovering Digital Library User Behavior with Google Analytics” in Code4Lib Journal. In it, Kirk Hess of the University of Illinois Urbana-Champaign details how to use event tracking to see how often external links are clicked or files are downloaded. While these events are particularly meaningful to digital libraries, most libraries offer PDFs or other documents online.

Some Ideas

The true power of Event Tracking is that it does not have to be limited to the mere clicking of hyperlinks; any interaction which JavaScript knows about can be recorded and categorized. Google’s own Event Tracking Guide uses the example of a video player, recording when control buttons like play, pause, and fast forward are activated. Here are some more obvious use cases for event tracking:

  • Track video plays on particular pages; we may already know how many views a video gets, but how many come from particular embedded instances of the video?
  • Clicking to external content, such as a vendor’s database or another library’s study materials.
  • If there is a print or “download to PDF” button on our site, we can track each time it’s clicked. Unfortunately, only Internet Explorer and Firefox (versions >= 6.0) have an onbeforeprint event in JavaScript which could be used to detect when a user hits the browser’s native print command.
  • Web applications are particularly suited to event tracking. Many modern web apps have a single page architecture, so while the user is constantly clicking and interacting within the app they rarely generate typical interaction statistics like pageviews or exits.

 

Notes
  1. There is a discussion on the best way to delay outbound links enough to record them as events. A Google Analytics support page condones the setTimeout approach. For other methods, there are threads on StackOverflow and various blog posts around the web. Alternatively, we could use the onmousedown event which fires slightly earlier than onclick but also might record false positives due to click-and-drag scrolling.
  2. Below is an attempt at rewriting the jQuery tracking code in pure JavaScript. It will only work in modern browsers because of use of querySelectorAll, parentElement, and previousElementSibling. Versions of Internet Explorer prior to 9 also use a unique attachEvent syntax for event handlers. Yes, there’s a reason people use libraries to do anything the least bit sophisticated with JavaScript.
function databaseTracking  ( event ) {
        var destination = event.target.href,
            resource = event.target.innerHTML,
            section = "none",
            highlighted = event.target.className.match( /highlighted/ ) ? 1: 0;

        // getting a parent element's nearest <h2> sibling is non-trivial without a library
        var currentSibling = event.target.parentElement;
        while ( currentSibling !== null ) {
            if ( currentSibling.tagName !== "H2" ) {
                currentSibling = currentSibling.previousElementSibling;
            }
            else {
                section = currentSibling.innerHTML;
                currentSibling = null;
            }
        }

        _gaq.push( [ '_trackEvent', 'database', resource, section, highlighted ] );

        // delay navigation to ensure event is recorded
        setTimeout( function () {
            window.location = destination;
        }, 200 );
        event.preventDefault();
    }

document.addEventListener( 'DOMContentLoaded', function () {
        var dbLinks = document.querySelectorAll( '#databases-list a' ),
            len = dbLinks.length;
        for ( i = 0; i < len; i++ ) {
            dbLinks[ i ].addEventListener( 'click', databaseTracking, false );
        }
    }, false );
Association of College & Research Libraries. (n.d.). ACRL Value of Academic Libraries. Retrieved January 12, 2013, from http://www.acrl.ala.org/value/
Event Tracking – Web Tracking (ga.js) – Google Analytics — Google Developers. (n.d.). Retrieved January 12, 2013, from https://developers.google.com/analytics/devguides/collection/gajs/eventTrackerGuide
Hess, K. (2012). Discovering Digital Library User Behavior with Google Analytics. The Code4Lib Journal, (17). Retrieved from http://journal.code4lib.org/articles/6942
Marek, K. (2011). Using Web Analytics in the Library a Library Technology Report. Chicago, IL: ALA Editions. Retrieved from http://public.eblib.com/EBLPublic/PublicView.do?ptiID=820360
Sattler, K., & Richard, J. (2012, October 30). Learning Web Analytics from the LITA 2012 National Forum Pre-conference. ACRL TechConnect Blog. Blog. Retrieved January 18, 2013, from http://acrl.ala.org/techconnect/?p=2133
Tracking Code: Event Tracking – Google Analytics — Google Developers. (n.d.). Retrieved January 12, 2013, from https://developers.google.com/analytics/devguides/collection/gajs/methods/gaJSApiEventTracking
window.onbeforeprint – Document Object Model (DOM) | MDN. (n.d.). Mozilla Developer Network. Retrieved January 12, 2013, from https://developer.mozilla.org/en-US/docs/DOM/window.onbeforeprint

An Elevator Pitch for File Naming Conventions

As a curator and a coder, I know it is essential to use naming conventions.  It is important to employ a consistent approach when naming digital files or software components such as modules or variables. However, when a student assistant asked me recently why it was important not to use spaces in our image file names, I struggled to come up with an answer.  “Because I said so,” while tempting, is not really an acceptable response.  Why, in fact, is this important?  For this blog entry, I set out to answer this question and to see if, along the way, I could develop an “elevator pitch” – a short spiel on the reasoning behind file naming conventions.

The Conventions

As a habit, I implore my assistants and anyone I work with on digital collections to adhere to the following when naming files:

  • Do not use spaces or special characters (other than “-” and “_”)
  • Use descriptive file names.  Descriptive file names include date information and keywords regarding the content of the file, within a reasonable length.
  • Date information is the following format: YYYY-MM-DD.

So, 2013-01-03-SmithSculptureOSU.jpg would be an appropriate file name, whereas Smith Jan 13.jpg would not.  But, are these modern practices?  Current versions of Windows, for example, will accept a wide variety of special characters and spaces in naming files, so why is it important to restrict the use of these characters in our work?

The Search Results

A quick Google search finds support for my assertions, though often for very specific cases of file management.  For example, the University of Oregon publishes recommendations on file naming for managing research data.  A similar guide is available from the University of Illinois library, but takes a much more technical, detailed stance on the format of file names for the purposes of the library’s digital content.

The Bentley Historical Library at University of Michigan, however, provides a general guide to digital file management very much in line with my practices: use descriptive directory and file names, avoid special characters and spaces.  In addition, this page discusses avoiding personal names in the directory structure and using consistent conventions to indicate the version of a file.

The Why – Dates

The Bentley page also provides links to a couple of sources which help answer the “why” question.  First, there is the ISO date standard (or, officially, “ISO 8601:2004: Data elements and interchange formats — Information interchange — Representation of dates and times”).  This standard dictates that dates be ordered from largest term to smallest term, so instead of the month-day-year we all wrote on our grade school papers, dates should take the form year-month-day.  Further, since we have passed into a new millennium, a four digit year is necessary.  This provides a consistent format to eliminate confusion, but also allows for file systems to sort the files appropriately.  For example, let’s look at the following three files:

1960-11-01_libraryPhoto.jpg
1977-01-05_libraryPhoto.jpg
2000-05-01_libraryPhoto.jpg

If we expressed those dates in another format, say, month-day-year, they would not be listed in chronological order in a file system sorting alphabetically.  Instead, we would see:

01-05-1977_libraryPhoto.jpg
05-01-2000_libraryPhoto.jpg
11-01-1960_libraryPhoto.jpg

This may not be a problem if you are visually searching through three files, but what if there were 100?  Now, if we only used a two digit year, we would see:

00-05-01_libraryPhoto.jpg
60-11-01_libraryPhoto.jpg
77-01-05_libraryPhoto.jpg

If we did not standardize the number of digits, we might see:

00-5-1_libraryPhoto.jpg
77-1-5_libraryPhoto.jpg
60-11-1_libraryPhoto.jpg

You can try this pretty easily on your own system.  Create three text files with the names above, sort the files by name and check the order.  Imagine the problems this might create for someone trying to quickly locate a file.

You might ask, why include the date at all, when dates are also maintained by the operating system?  There are many situations where the operating system dates are unreliable.  In cases where a file moves to a new drive or computer, for example, the Date Created may reflect the date the file moved to the new system, instead of the initial creation date.  Or, consider the case where a user opens a file to view it and the application changes the Date Modified, even though the file content was not modified.  Lastly, consider our earlier example of a photograph from 1960; the Date Created is likely to reflect the date of digitization.  In each of these examples, it would be helpful to include an additional date in the file name.

The Why – Descriptive File Names

So far we have digressed down a date-specific path.  What about our other conventions?  Why are those important?  Also linked from the Bentley Library and in the Google search results are YouTube videos created by the State Library of North Carolina which answer some of these questions.  The Inform U series on file naming has four parts, and is intended to help users manage personal files.  However, the rationale described in Part 1 for descriptive file names in personal file management also applies in our libraries.

First, we want to avoid the accidental overwriting of files.  Image files can provide a good example here: many cameras use the file naming convention of IMG_1234.jpg.  If this name is unique to the system, that works ok, but in a situation where multiple cameras or scanners are generating files for a digital collection, there is potential for problems.  It is better to batch re-name image files with a more descriptive name. (Tutorials on this can be found all over the web, such as the first item in this list on using Mac’s Automator program to re-name a batch of photos).

Second, we want to avoid the loss of files due to non-descriptive names.  While many operating systems will search the text content of files, naming files appropriately makes for more efficient access.  For example, consider mynotes.docx and 2012-01-05WebMeetingNotes.docx – which file’s contents are easier to ascertain?

I should note, however, that there are cases where non-descriptive file names are appropriate.  The use of a unique identifier as a filename is sometimes a necessary approach.  However, in those cases where you must use a non-descriptive filename, be sure that the file names are unique and in a descriptive directory structure.  Overall, it is important that others in the organization have the same ability to find and use the files we currently manage, long after we have moved on to another institution.

The Why – Special Characters & Spaces

We have now covered descriptive names and reasons for including dates, which leaves us with spaces and special characters to address.  Part 3 of the Inform U video series addresses this as well.  Special characters can designate special meaning to programming languages and operating systems, and might be misinterpreted when included in file names.  For instance, the $ character designates the beginning of variable names in the php programming language and the \ character designates file path locations in the Windows operating system.

Spaces may make things easier for humans to read, but systems generally do better without the spaces.  While operating systems attempt to handle spaces gracefully and generally do so, browsers and software programs are not consistent in how they handle spaces.  For example, consider a file stored in a digital repository system with a space in the file name.  The user downloads the file and their browser truncates the file name after the first space.  This equates to the loss of any descriptive information after the first space.  Plus, the file extension is also removed, which may make it harder for less tech savvy users to use a file.

The Pitch

That example leads us to the heart of the issue: we never know where our files are going to end up, especially files disseminated to the public.  Our content is useless if our users cannot open it due to a poorly formatted file name.  And, in the case of non-public files or our personal archives, it is essential to facilitate the discovery of items in the piles and piles of digital material accumulated every day.

So, do I have my elevator pitch?  I think so.  When asked about file naming standards in the future, I think I can safely reply with the following:  “It is impossible to accurately predict all of the situations in which a file might be used.  Therefore, in the interest of preserving access to digital files, we choose file name components that are least likely to cause a problem in any environment.  File names should provide context and be easily understood by humans and computers, now and in the future.”

And, like a good file name, that is much more effective and descriptive than, “Because I said so.”