My library has used Omeka as part of our suite of platforms for creating digital collections and exhibits for many years now. It’s easy to administer and use, and many of our students, particularly in history or digital humanities, learn how to create exhibits with it in class or have experience with it from other institutions, which makes it a good solution for student projects. This creates challenges, however, since it’s been difficult to have multiple sites or distributed administration. A common scenario is that we have a volunteer student, often in history, working on a digital exhibit as part of a practicum, and we want the donor to review the exhibit before it goes live. We had to create administrative accounts for both the student and the donor, which required a lot of explanations about how to get in to just the one part of the system they were supposed to be in (it’s possible to create a special account to view collections that aren’t public, but not exhibits). Even though the admin accounts can’t do everything (there’s a super admin level for that), it’s a bit alarming to hand out administrative accounts to people I barely know.
This problem goes away with Omeka S, which is the new and completely rebuilt Omeka. It supports having multiple sites (which is the new name for exhibits) and distributed administration by site. Along with this, there are sophisticated metadata templates that you can assign to sites or users, which takes away the need for lots of documentation on what metadata to use for which item type. When I showed a member of my library’s technical services department the metadata templates in Omeka S, she gasped with excitement. This should indicate that, at least for those of us working on the back end, this is a fun system to use.
Trying it Out For Yourself
I have included some screenshots below, but you might want to use the Omeka S Sandbox to follow along. You can experiment with anything, and the data is reset every Monday, Wednesday, Friday, and Sunday. This includes a variety of sample exhibits, one is “A Battered Tin Dispatch Box” from which I include some screenshots below.
A Quick Tour Through Omeka S
This is what the Omeka Classic administrative dashboard looks like for a super administrator.And this is the dashboard for Omeka S. It’s not all that different functionally, but definitely a different aesthetic experience.
Most things in Omeka S work analogously to classic Omeka, but some things have been renamed or moved around. The documentation walks through everything in order, so it’s a great place to start learning. Overall, my feeling about Omeka S is that it’s much easier to tap into the powerful features with less of a learning curve. I first learned Omeka S at the DLF Forum conference in fall 2017 directly from Patrick Murray-John, the Omeka Development Team Manager, and some of what is below is from his description.
Omeka S has the very useful concept of Sites, which again function like exhibits in classic Omeka. Each site has its own set of administrative functions and user permissions, which allow for viewer, editor, or admin by site. I really appreciate this, since it allowed me to give student volunteers access to just the site they needed, and when we need to give other people access to view the site before it’s published we can do that. It’s easier to add outside or supplementary materials to the exhibit navigation. On the individual pages there are a variety of blocks available, and the layout is easier for people without a lot of HTML skills to set up.
These existed in Omeka Classic, but were less straightforward. Now you can set a resource template with properties from multiple vocabularies and build the documentation right into the template. The data type can be text or URI, or draw from vocabularies with autosuggest. For example, you can set the Rights field to draw from Rights Statement options.
Items work in a similar fashion to Omeka Classic. Items exist at the installation level, so can be reused across multiple sites. What’s great is that the nature of an item can be much more flexible. They can include URIs, maps, and multiple types of media such as a URL, HTML, IIIF image, oEmbed, or YouTube. This reflects the actual way that we were using Omeka Classic, but without the technical overhead to make it all work. This will make it easier for more people to create much more interactive and web-integrated exhibits.
Item Sets are the new name given to Collections and, like Items, they can have metadata from multiple vocabularies. Item Sets are analogous to Collections, but items can be in multiple Item Sets to be associated with sites to limit what people see. The tools for batch adding and editing are similar, but more powerful because you can actually remove or edit metadata in bulk.
Themes in Omeka S have changed quite a bit, and as Murray-John explained, it is more complicated to do theming than in the past. Rather than call to local functions, Omeka S uses patterns from Zend Framework 3, and so the process of theming will require more careful thought and planning. That said, the base themes provided are a great base, and thanks to the multiple options for layouts in sites, it’s less critical to be able to create custom themes for certain exhibits. I wrote about how to create themes in Omeka in 2013, and while some of that still holds true, you would want to consult the updated documentation to see how to do this in Omeka S.
One of my favorite things in Omeka S is the Mapping module, which allows you to add geolocation metadata to items, and create a map on site pages. Here’s an example from the Omeka S Sandbox with locations related to Scotland Yard mapped for an item in the Battered Tin Dispatch Box exhibit.
This can then turn into an interactive map on the front end.
For the vast majority of mapping projects that our students want to do, this works in a very straightforward manner. Neatline is a plugin for Omeka Classic that allows much more sophisticated mapping and timelines–while it should be ported over to Omeka S, it currently is not listed as a module. In my experience, however, Neatline is more powerful than what many people are trying to do, and that added complexity can be a challenge. So I think the Mapping module looks like a great compromise.
Possible Approaches to Migration
Migration between Omeka Classic and Omeka S works well for items. For that, there’s the Omeka2 Importer module. Because exhibits work differently, they would have to be recreated. Omeka.net, the hosted version of Omeka, will stay on Omeka Classic for the foreseeable future, so there’s no concern that it will stop being supported any time soon, according to Patrick Murray-John.
We are still working on setting up Omeka S. My personal approach is that as new ideas for exhibits come up we will start them first in Omeka S. As we have time and interest, we may start to migrate older exhibits if they need continual management. Because some of our older exhibits rely on Omeka Classic pla but are planning to mostly create new exhibits in there that don’t rely on Omeka Classic plugins. I am excited to pair this with our other digital collection platforms to build exhibits that use content across our platforms and extend into the wider web.
The 2017 Digital Library Federation (DLF) Forum will take place October 23-25 in Pittsburgh, and throughout the program there are multiple opportunities to interact with several of the DLF Groups. For those who are new to DLF, or have never been to a Forum before, it may be hard to know what to expect or how these Groups are different from other associations’ interest groups or committees.
It can be helpful to remember that DLF is an institutional member organization. You don’t need a personal membership to belong to a working group of DLF. Actually, you don’t even need to belong to an institution to sign up to work with a group. DLF practices a very welcoming and inclusive approach to community. Membership does grant discounts on the Forum or other programs, like the eResearch Network, but more importantly, it signals an institution’s commitment to the work that DLF supports and coordinates – such as these groups.
DLF’s groups are not just interest groups or working groups. They are essentially communities that drive a conversation around a topic, or have a particular focus, and usually have some kind of an output. Here is the current list of active groups, with a brief description from their website – those that have programming at this year’s Forum are noted with anasterisk:
The DLF Assessment Interest Group (DLF AIG) was formed in 2014 as an informal interest group within the larger DLF community. The group meets during the DLF Forum to share problems, ideas, and solutions [related to digital library assessment]. The group also has a dedicated Google Group, DLF-supported wiki, and project documentation available in the Open Science Framework.
The DLF Digital Library Pedagogy group is an informal community within the larger DLF community that was formed thanks to practitioner interest following the 2015 DLF Forum. The group, which has a dedicated Google Group, is open to anyone interested in learning about or collaborating on digital library pedagogy.
The DLF eResearch Network brings together teams from research-supporting libraries to strengthen and advance their data services and digital scholarship roles within their organizations. The core of the 2017 network is a working curriculum that guides participants through 6 monthly webinars that address current topics and strategic methods for supporting and facilitating data services and digital scholarship locally.
DLF has created a new framework for establishing mentoring relationships among our community members, centered around face-to-face interaction at our annual Forum. The program is meant to be lightweight, collegial, and mostly focused around the annual DLF Forum.
In 2015, a volunteer planning committee from within our Liberal Arts College community organized a first, one-day Liberal Arts Colleges Pre-conference, specifically created for those who work with digital libraries and/or digital scholarship at teaching-focused institutions, held before the DLF Forum in Vancouver. Both this event and the one that followed in Milwaukee (2016) were huge successes, including concurrent sessions of presentations and panels on pedagogical, organizational, and technological approaches to the digital humanities and digital scholarship, data curation, digital collections, and digital preservation.
All DLF practitioners with museum interests or who engage in college and university museum-based projects are welcome to join. Likewise, current DLF member institutions with museums, galleries, and museum libraries are invited to participate in Museums Cohort conversations.
The DLF Project Managers group is an informal community within the larger DLF community. They meet at the annual DLF Forum and also have a dedicated listserv. The DLF PM Group was formed in 2008 to acknowledge the intersection of the discipline of project management and library technology. The group provides a forum for sharing project management methodologies and tools, alongside broader discussions that consider issues such as portfolio management and cross-organizational communication. The group also maintains an eye towards keeping pace with the dynamic digital library landscape, by bringing new and evolving project management practices to the attention and mutual benefit of our colleagues.
A new DLF group, looking for all levels of commitment, from willingness to be a co-leader of the Working Group to dropping in to point out a good article/blog post/someone-doing-this-already we may not have seen. A Google Group is used for coordination of meetings and work.
Metadata is hard. The Metadata Support Group aims to help. This is a place to share resources, strategies for working through some common metadata conundrums, and reassurances that you’re not the only one that has no idea how that happened. If you’re coming here with a problem we hope you’ll find a solution or a strategy to move you towards a solution!
These groups are excellent ways to learn more about a topic, contribute to problem-solving strategies, and to network with others who share your interests. As you can see, some of these groups have been around for nearly a decade, while others just started this year. There have also been several groups that have sunsetted, reflecting DLF groups’ strength as responsive and current communities, based on need and interest.
If you are at the 2017 Forum, consider learning more by joining a group’s working lunch or presentation. And remember, these groups are based off need and interest. Consider proposing something that stirs your passion, if you don’t see it reflected in the current DLF community!
I have a confession: XSLT (eXtensible Stylesheet Language Transformations) is one of those library technologies I’ve dreaded learning. While it’s come up several times in my career, I’ve always managed to avoid it. I’m not normally like this—I truly enjoy learning new skills, especially programming languages or coding tools, so much so that I’ll find myself diving into tutorials or books right after returning home from work.
Why is XSLT so abhorrent? I have more opinions about programming languages than I do expertise. Here are some opinions: programming languages should be readable. If they have shorthands, they should be elegant & intuitive. They shouldn’t need much boilerplate. And they should be recognizable to a degree; for better or worse, many programming languages share one of a very small number of lineages which means that recognizing fundamental constructs like functions & arrays isn’t too difficult if you can identify the family a language belongs to.
XSLT’s family is XML. An XSLT script is, in fact, valid XML. And XML is not a programming language, it’s a markup language that’s used for virtually everything. While I understand XML’s ubiquity, I’m not a fan of it in general & I think other serializations make more sense in many situations. As an example, take a list of elements in XML:
Background: IR OAI OMG
Alright now that you’ve read entirely too much about What Eric Thinks of XSLT, why was he forced to learn it? And why did he start speaking in the third person all of the sudden? He should switch back to first person.
My institution wants to start exposing our digital collections more. While the human-facing web presence of our institutional repository is growing, we also need to start publishing our metadata in a machine-readable format. This will allow our collections to be consumed by large aggregators, specifically Calisphere, Worldcat, & DPLA.1 Luckily, libraries already have a well-established standard to turn to for these needs, OAI-PMH. OAI-PMH lets us expose our repository metadata in XML in a way that allows a harvesting application to periodically fetch batches of records, adding new ones & updating ones that have changed.
Right away, there are challenges to exposing our EQUELLA repository’s metadata. We use MODS in our repository, while OAI expects you to use Dublin Core. Luckily, these are two common formats & there’s a lot of information on how to map information between them.2 Unfortunately, our MODS schema is heavily modified. Decisions were made to add or alter elements based on local needs. What’s worse, in order to make certain user-facing fields easier to use, our repository software makes us insert wrapper elements into our metadata schema. All of this combines to make our MODS-to-DC mapping utterly unique & more complicated than usual.
When I went to inspect our repository’s OAI implementation out of the box, I was greeted by records like this one:
<dc:description>Explored the experience of pregnant and parenting teenagers via an installation and a symposium, "Cribs, Classrooms, and Communities: The Teen Pregnancy Controversy."</dc:description>
The default settings were showing only a Dublin Core title & description for each item, which was certainly not bad for no configuration at all, but not desirable. We heavily invested in designing our repository’s upload form & cataloging. We want that work to shine through in the metadata we provide to external services. Now, after applying some very basic XSLT which I’ll cover in the remainder of this post, our OAI endpoint looks quite a bit better. There are numerous fields represented for each record, though our original records are still a bit richer in information.
A Metadata Transformation Language
Let’s talk first about what XSLT is before we talk about how to use it. XSLT is unique in that it’s specifically designed to transform XML documents. This doesn’t necessarily mean it was designed for mapping from one metadata schema to another, as XML is used for more than just metadata & XSLT can do more than just shuffle around values, but it does mean that the language is uniquely suited to that task. XSLT allows us to change certain elements in the original document, alter text, & add or drop pieces of information.
In this post, we’ll specifically look at converting the Library of Congress’ MODS schema to Dublin Core. LOC has provided a handy map between the two which illustrates the complexity of the task. A few things that we need to address:
All data is going to need a new field, there isn’t really a way to simply “leave” an element in its place
MODS is hierarchical, meaning some elements have child elements, while Dublin Core is flat with only a top level bearing no children
Sometimes multiple MODS elements will collapse into a single Dublin Core field e.g. subject/name, subject/occupation, subject/topic, & classification all file under DC’s Subject field
Inversely, a MODS name/namePart element might map to either the DC Creator or Contributor fields depending upon the role of the person being referred to (captured in the role/roleTerm child of the name element)
Both schemas have their own slightly differing vocabulary of resource types, stored in MODS’ typeOfResource & DC’s Type, so what one schema considers a “sound recording-nonmusical” the other considers merely “Sound”, for instance
I see mapping between metadata schemas as a subset of data wrangling, & data wrangling is one of the most common aspects of my Systems Librarian position: I take CSV reports from our student data system & map them into a weird MARC-like format for the Millennium ILS to ingest, I take course lists & turn them into a controlled vocabulary of sorts in our repository, & of course I convert our repository’s MODS into OAI DC. All of these procedures are duck-taped together with custom scripts filled with comments detailing bizarre data behaviors. XSLT is one of the few languages designed for this type of work. While there are software packages like OpenRefine or Stanford’s Data Wrangler, sometimes the power & flexibility of a programming language is preferable. It’s disappointing to me that the only prominent choice is also XML-based.
To try out the examples below, you’ll need an XML document to use as input (preferably MODS, as that’s what the examples target) & an XSLT processor. On Mac OS X, the built-in program xsltproc does the trick. You can run a document through a transformation using the following syntax on the command line:
For Linux, you can install xsltproc as part of the libxml package. Windows users can install the Saxon XSLT processor. Most web browsers support XSLT processing, too. If you add a line like <?xml-stylesheet type='text/xsl' href='name-of-stylesheet.xsl'?> up at the top of an XML file, the browser will automatically run the XML through your stylesheet & present you with the results.
Pretty much all XSLT starts with some boilerplate that looks like this:
You start an XML document, open up an xsl:stylesheet element with a certain namespace, & define a template inside which you’re going to apply to the incoming document. The xsl:template element’s match attribute can differ depending on your objective—here we’re selecting all data in the input document, attributes & nodes—but that’s the basic setup. You then declare the root element of your output document inside the template, which here is the <oai_dc:dc> tag. What matters is what goes on inside the template, represented in the example with an ellipsis, since that gives us the ability to map elements to new locations.
Confession: I haven’t taken the time to fully learn the usage of the xsl:template element. You can use multiple of them within an xsl:stylesheet, they can target different elements in the origin document with their match attributes, & you can call them later with xsl:apply-template(s). This lets you modularize your stylesheet into several smaller, focused templates. But I won’t discuss them further, in the hopes that showing other features provides enough detail to communicate the substance of XSLT.
Let’s look at another example, which simply copies the MODS’ identifier element to Dublin Core’s dc:identifier, skipping everything else in the input document. Going forward, I will leave off the XML prolog <?xml …?> & all of the wrapping xsl:stylesheet, xsl:template, & oai_dc:dc elements for brevity’s sake.
We can see that the root element of the document has been transformed into oai_dc:dc & our identifier is present inside a dc:identifier element. So two things from our template above happened: first of all, anything that’s not an xsl: prefixed element is mostly output on the other end in exactly the same form. This goes for text as well as markup language tags, such as our oai_dc:dc opening & closing tags which are recreated exactly. Inside dc:identifier, something else happens: the xsl:value-of element selects an element from the origin document & replaces itself with that value. The key piece is the select attribute, which accepts an XPath query it uses on the input.
XPath is another entire language you need to know to use XSLT to transform XML documents. Luckily, if you’ve worked much with XML at all, you are probably familiar with its basics. XPath lets us query a document by providing hierarchical paths such that “mods/identifier” matches an “identifier” element which is the child of a “mods” element. The XPath queries in this post won’t be more sophisticated than that.
Let’s do the same thing but map several elements to different places. The example below is starting to resemble a more fully-fledged stylesheet which could actually be of use.
We’ve now mapped the identifier, abstract, & title from MODS into their appropriate Dublin Core locations. Yay! While this may seem straightforward, it is worth noting we took the MODS’ titleInfo/title element from out of its parent titleInfo element & placed it in the top-level of Dublin Core’s flat schema, handling one of the mapping complexities we’d noted earlier.
However, what happens when we pass a document lacking a title or abstract through our stylesheet? Say we start with input.xml:
While the xsl:value-of element didn’t find any content & thus returned an empty string, we were still telling our transformation to produce dc:title & dc:description fields. One way to work around this is using conditional if statements to first check if a field has any text in it before returning its corresponding element in the output schema. If statements in general are extremely useful during transformations, so let’s take a look at some.
This transformation does our standard mods/identifier->dc:identifier mapping (hopefully every document has an identifier…), but it uses the xsl:if element to first test if mods/titleInfo/title has any content before outputting anything. The xsl:if syntax is <xsl:if test="condition"> where the “condition” can be one many standard comparisons that returns a “true” or “false” value: equal to “=”, not equal to “!=”, greater than “\>”, & less than “\<“. Note that, since we’re working with XML, we cannot use the greater or less than signs so we must use their XML character entity forms.
Hyper Advanced (not really)
We can get a long way with just xsl:value-of, xsl:if, & our innate verve, but let’s learn a few more useful XSLT constructs. Remember earlier when we noticed that, depending on the person’s role, the “namePart” value in MODS might be mapped to one of two DC elements, Creator or Contributor? We can’t actually handle that situation with what we know, because we’ve only mapped singular & not repeating elements. Our “select” & “test” attributes will only select the first matching element in the origin MODS document. We need some kind of loop that lets us iterate over repeated elements while also testing their values.
Below, we use xsl:for-each to loop over a selected element, & then we apply to tests to each element to determine if it’s referring to an author or an editor.
Using xsl:for-each in combination with xsl:if accomplished one of the more onerous data mapping tasks we’re faced with in a rather elegant manner. Inside our test attributes we also see something new: we’re using a contains() function. XSLT has many functions which can be used inside certain attributes. The contains() syntax is straightforward: it accepts two parameters, an XPath & a string of text, returning “true” if the text is found in the value of the XPath or false otherwise.
Alright, now let’s tackle something truly intimidating: how could we possibly handle mapping the two “resource type” vocabularies to one another? We can see now that a very long series of xsl:if statements inside a xsl:for-each loop should get the job done. But there’s a slightly nicer method available in xsl:choose & xsl:when:
There are a few little tricks in here, but the general frame should be apparent: we loop over all the typeOfResource elements and, when they match one of the tests in our xsl:when element, the text or XSLT commands inside the when block are produced. This lets us set up a nice crosswalk between the MODS & DC type vocabularies. It’s also a bit faster than a series of “if” statements, since the xsl:choose block will exit as soon as a test returns true, while all if statements would execute even if only the very first one was necessary. Finally, it’s also possible to provide an xsl:otherwise block at the end of a series of xsl:when statements which is used as a fallback: if none of the “when” tests were true, the fallback is output. If there were overarching “base type” or “unknown type” values in Dublin Core then this would be useful.
The two other new pieces we saw in the above XSLT snippet: we used xsl:variable to store the text value of the typeOfResource node, obtained with the text() function, & then referred to it later by attaching a dollar sign sigil to the variable’s name, much like BASH, Perl, & PHP do with their variables. Looking up the value of the node only once & then using the stand-in variable is another minor speed optimization. We also used the XSLT function starts-with to test three MODS type values at once. How did I know that function exists? I Googled it, like a professional software developer. The W3Schools reference on functions is a thorough overview. In general, if you find yourself thinking “I bet there’s a handy shortcut function that would make this less painful…” then you should search for one.
After much stumbling, followed by trial & error, followed by more stumbling, our OAI endpoint is looking much better thanks to my XSLT stylesheet. We’re publishing format, creator, contributor, type, & rights information in Dublin Core. I’m certain that my examples here, & the code in my final script, don’t follow XSLT best practices. I’m quite shaky on some of the fundamental mechanics of XSLT, like xsl:template3. Nonetheless, I hope this post gives a broad overview of the technology & its application. Go forth & transform!
Incidentally, I’m also trying to get our collections better indexed by search engines like Google & Google Scholar using Schema.org & other metadata embedded in HTML. That’s a very different beast though & not strongly related to the XSLT work that I am discussing here. ↩
I referred to the Library of Congress’ guide on the MODS site when constructing our crosswalk. ↩
I suspect my stylesheets would need far fewer xsl:if statements if I used templates effectively & not just as a boilerplate wrapper element. ↩
The takeaway from these for me was twofold. First, digital preservation doesn’t have to be hard, but it does have to be intentional, and secondly, it does require institutional commitment. If you’re new to the world of digital preservation, understanding all the basic issues and what your options are can be daunting. I’ve been fortunate enough to lead a group at my institution that has spent the last few years working through some of these issues, and so in this post I want to give a brief overview of the work we’ve done, as well as the current landscape for digital preservation systems. This won’t be an in-depth exploration, more like a key to the map. Note that ACRL TechConnect has covered a variety of digital preservation issues before, including data management and preservation in “The Library as Research Partner” and using bash scripts to automate digital preservation workflow tasks in “Bash Scripting: automating repetitive command line tasks”.
The committee I chair started examining born digital materials, but expanded focus to all digital materials, since our digitized materials were an easier test case for a lot of our ideas. The committee spent a long time understanding the basic tenets of digital preservation–and in truth, we’re still working on this. For this process, we found working through the NDSA Levels of Digital Preservation an extremely helpful exercise–you can find a helpfully annotated version with tools by Shira Peltzman and Alice Sara Prael, as well as an additional explanation by Shira Peltman. We also relied on the Library of Congress Signal blog and the work of Brad Houston, among other resources. A few of the tasks we accomplished were to create a rough inventory of digital materials, a workflow manual, and to acquire many terabytes (currently around 8) of secure networked storage space for files to replace all removable hard drives being used for backups. While backups aren’t exactly digital preservation, we wanted to at the very least secure the backups we did have. An inventory and workflow manual may sound impressive, but I want to emphasize that these are living and somewhat messy documents. The major advantage of having these is not so much for what we do have, but for identifying gaps in our processes. Through this process, we were able to develop a lengthy (but prioritized) list of tasks that need to be completed before we’ll be satisfied with our processes. An example of this is that one of the major workflow gaps we discovered is that we have many items on obsolete digital media formats, such as floppy disks, that needs to be imaged before it can even be inventoried. We identified the tool we wanted to use for that, but time and staffing pressures have left the completion of this project in limbo. We’re now working on hiring a graduate student who can help work on this and similar projects.
The other piece of our work has been trying to understand what systems are available for digital preservation. I’ll summarize my understanding of this below, with several major caveats. This is a world that is currently undergoing a huge amount of change as many companies and people work on developing new systems or improving existing systems, so there is a lot missing from what I will say. Second, none of these solutions are necessarily mutually exclusive. Some by design require various pieces to be used together, some may not require it, but your circumstances may dictate a different solution. For instance, you may not like the access layer built into one system, and so will choose something else. The dream that you can just throw money at the problem and it will go away is, at present, still just a dream–as are so many library technology problems.
The closest to such a dream is the end-to-end system. This is something where at one end you load in a file or set of files you want to preserve (for example, a large set of donated digital photographs in TIFF format), and at the other end have a processed archival package (which might include the TIFF files, some metadata about the processing, and a way to check for bit rot in your files), as well as an access copy (for example, a smaller sized JPG appropriate for display to the public) if you so desire–not all digital files should be available to the public, but still need to be preserved.
Examples of such systems include Preservica, ArchivesDirect, and Rosetta. All of these are hosted vended products, but ArchivesDirect is based on open source Archivematica so it is possible to get some idea of the experience of using it if you are able to install the tools on which it based. The issues with end-t0-end systems are similar to any other choice you make in library systems. First, they come at a high price–Preservica and ArchivesDirect are open about their pricing, and for a plan that will meet the needs of medium-sized libraries you will be looking at $10,000-$14,000 annual cost. You are pretty much stuck with the options offered in the product, though you still have many decisions to make within that framework. Migrating from one system to another if you change your mind may involve some very difficult processes, and so inertia dictates that you will be using that system for the long haul, which a short trial period or demos may not be enough to really tell you that it’s a good idea. But you do have the potential for more simplicity and therefore a stronger likelihood that you will actually use them, as well as being much more manageable for smaller staffs that lack dedicated positions for digital preservation work–or even room in the current positions for digital preservation work. A hosted product is ideal if you don’t have the staff or servers to install anything yourself, and helps you get your long-term archival files onto Amazon Glacier. Amazon Glacier is, by the way, where pretty much all the services we’re discussing store everything you are submitting for long-term storage. It’s dirt cheap to store on Amazon Glacier and if you can restore slowly, not too expensive to restore–only expensive if you need to restore a lot quickly. But using it is somewhat technically challenging since you only interact with it through APIs–there’s no way to log in and upload files or download files as with a cloud storage service like Dropbox. For that reason, when you’re paying a service hundreds of dollars a terabyte that ultimately stores all your material on Amazon Glacier which costs pennies per gigabye, you’re paying for the technical infrastructure to get your stuff on and off of there as much as anything else. In another way you’re paying an insurance policy for accessing materials in a catastrophic situation where you do need to recover all your files–theoretically, you don’t have to pay extra for such a situation.
A related option to an end-to-end system that has some attractive features is to join a preservation network. Examples of these include Digital Preservation Network (DPN) or APTrust. In this model, you pay an annual membership fee (right now $20,000 annually, though this could change soon) to join the consortium. This gives you access to a network of preservation nodes (either Amazon Glacier or nodes at other institutions), access to tools, and a right (and requirement) to participate in the governance of the network. Another larger preservation goal of such networks is to ensure long-term access to material even if the owning institution disappears. Of course, $20,000 plus travel to meetings and work time to participate in governance may be out of reach of many, but it appears that both DPN and APTrust are investigating new pricing models that may meet the needs of smaller institutions who would like to participate but can’t contribute as much in money or time. This a world that I would recommend watching closely.
Up until recently, the way that many institutions were achieving digital preservation was through some kind of repository that they created themselves, either with open source repository software such as Fedora Repository or DSpace or some other type of DIY system. With open source Archivematica, and a few other tools, you can build your own end-to-end system that will allow you to process files, store the files and preservation metadata, and provide access as is appropriate for the collection. This is theoretically a great plan. You can make all the choices yourself about your workflows, storage, and access layer. You can do as much or as little as you need to do. But in practice for most of us, this just isn’t going to happen without a strong institutional commitment of staff and servers to maintain this long term, at possibly a higher cost than any of the other solutions. That realization is one of the driving forces behind Hydra-in-a-Box, which is an exciting initiative that is currently in development. The idea is to make it possible for many different sizes of institutions to take advantage of the robust feature sets for preservation in Fedora and workflow management/access in Hydra, but without the overhead of installing and maintaining them. You can follow the project on Twitter and by joining the mailing list.
After going through all this, I am reminded of one of my favorite slides from Julie Swierczek’s Code4Lib presentation. She works through the Open Archival Initiative System model graph to explain it in depth, and comes to a point in the workflow that calls for “Sustainable Financing”, and then zooms in on this. For many, this is the crux of the digital preservation problem. It’s possible to do a sort of ok job with digital preservation for nothing or very cheap, but to ensure long term preservation requires institutional commitment for the long haul, just as any library collection requires. Given how much attention digital preservation is starting to receive, we can hope that more libraries will see this as a priority and start to participate. This may lead to even more options, tools, and knowledge, but it will still require making it a priority and putting in the work.
Recently a faculty member working in the Digital Humanities on my campus asked the library to explore International Image Interoperability Framework (IIIF) image servers, with the ultimate goal of determining whether it would be feasible for the library to support a IIIF server as a service for the campus. I typically am not very involved in supporting work in the Digital Humanities on my campus, despite my background in (and love for) the humanities (philosophy majors, unite!). Since I began investigating this technology, I seem to see references to IIIF-compliance popping up all over the place, mostly in discussions related to IIIF compatibility in Digital Asset Management System (DAMS) repositories like Hydra 1 and Rosetta 2, but also including ArtStor3 and the Internet Archive 4.
IIIF was created by a group of technologists from Stanford, the British Library, and Oxford to solve three problems: 1) slow loading of high resolution images in the browser, 2) high variation of user experience across image display platforms, requiring users to learn new controls and navigation for different image sites, and 3) the complexity of setting up high performance image servers.5 Image servers traditionally have also tended to silo content, coupling back-end storage with either customized or commercial systems that do not allow additional 3rd party applications to access the stored data.
By storing your images in a way that multiple applications can access them and render them, you enable users to discover your content through a variety of different portals. With IIIF, images can be stored in a way that facilitates API access to them. This enables a variety of applications to retrieve the data. For example, if you have images stored in a IIIF-compatible server, you could have multiple front-end discovery platforms access the images through API, either at your own institution or other institutions that would be interested in providing gateways to your content. You might have images that are relevant to multiple repositories or collections; for instance, you might want your images to be discoverable through your institutional repository, discovery system, and digital archives system.
IIIF systems are designed to work with two components: an image server (such as the Python-based Loris application)6 and a front-end viewer (such as Mirador 7 or OpenSeadragon8). There are other viewer options out there (IIIF Viewer 9, for example), and you could conceivably write your own viewer application, or write a IIIF display plugin that can retrieve images from IIIF servers. Your image server can serve up images via APIs (discussed below) to any IIIF-compatible front-end viewer, and any IIIF-compatible front-end viewer can be configured to access information served by any IIIF-compatible image server.
IIIF Image API and Presentation API
IIIF-compatible software enables retrieval of content from two APIs: the Image API and the Presentation API. As you might expect, the Image API is designed to enable the retrieval of actual images. Supported file types depends on the image server application being used, but API calls enable the retrieval of specific file type extensions including .jpg, .tif, .png, .gif, .jp2, .pdf, and .webp.10. A key feature of the API is the ability to request images to be returned by image region – meaning that if only a portion of the image is requested, the image server can return precisely the area of the image requested.11 This enables faster, more nimble rendering of detailed image regions in the viewer.
The basic structure of a request to a IIIF image server follows a standard scheme:
The Presentation API returns contextual and descriptive information about images, such as how an image fits in with a collection or compound object, or annotations and properties to help the viewer understand the origin of the image. The Presentation API retrieves metadata stored as “manifests” that are often expressed as JSON for Linked Data, or JSON-LD.13 Image servers such as Loris may only provide the ability to work with the Image API; Presentation API data and metadata can be stored on any server and image viewers such as Mirador can be configured to retrieve presentation API data.14
Why would you need a IIIF Image Server or Viewer?
IIIF servers and their APIs are particularly suited for use by cultural heritage organizations. The ability to use APIs to render high resolution images in the browser efficiently is essential for collections like medieval manuscripts that have very fine details that lower-quality image rendering might obscure. Digital humanities, art, and history scholars who need access to high quality images for their research would be able to zoom, pan and analyze images very closely. This sort of an analysis can also facilitate collaborative editing of metadata – for example, a separate viewing client could be set up specifically to enable scholars to add metadata, annotations, or translations to documents without necessarily publishing the enhanced data to other repositories.
In this demo, the user can consult a number of manuscripts, held by different institutions, in the same interface. In particular, there are several manuscripts from Stanford and Yale, as well as the first example from Gallica and served by Biblissima (BnF Français 1728)….
It is important to note that the images displayed in the viewer do not leave their original repositories; this is one of the fundamental principles of the IIIF initiative. All data (images and associated metadata) remain in their respective repositories and the institutions responsible for them maintain full control over what they choose to share. 15.
The approach described by Biblissima represents the increasing shift toward designing repositories to guide users toward linked or related information that may not be actually held by the repository. While I can certainly anticipate some problems with this approach for some archival collections – injecting objects from other collections might skew the authentic representation of some collections, even if the objects are directly related to each other – this approach might work well to help represent provenance for collections that have been broken up across multiple institutions. Without this kind of architecture, researchers would have to visit and keep track of multiple repositories that contain similar collections or associated objects. Manuscript collections are particularly suited to this kind approach, where a single manuscript may have been separated into individual leaves that can be found in multiple institutions worldwide – these manuscripts can be digitally re-assembled without requiring institutions to transfer copies of files to multiple repositories.
One challenge we are running into in exploring IIIF is how to incorporate this technology into existing legacy applications that host high resolution images (for example, ContentDM and DSpace). We wouldn’t necessarily want to build a separate IIIF image server – it would be ideal if we could continue storing our high res images on our existing repositories and pull them together with a IIIF viewer such as Loris). There is a Python-based translator to enable ContentDM to serve up images using the IIIF standard16, but I’ve found it difficult to find case studies or step-by-step implementation and troubleshooting information (if you have set up IIIF with ContentDM, I’d love to know about your experience!). To my knowledge, there is not an existing way to integrate IIIF with DSpace (but again, I would love to stand corrected if there is something out there). Because IIIF is such a new standard, and legacy applications were not necessarily built to enable this kind of content distribution, it may be some time before legacy digital asset management applications integrate IIIF easily and seamlessly. Apart from these applications serving up content for use with IIIF viewers, embedding IIIF viewer capabilities into existing applications would be another challenge.
Finally, another challenge is discovering IIIF repositories from which to pull images and content. Libraries looking to explore supporting IIIF viewers will certainly need to collaborate with content experts, such as archivists, historians, digital humanities and/or art scholars, who may be familiar with external repositories and sources of IIIF content that would be relevant to building coherent collections for IIIF viewers. Viewers are manually configured to pull in content from repositories, and so any library wanting to support a IIIF viewer will need to locate sources of content and configure the viewer to pull in that content.
Undertaking support for IIIF servers and viewers is fundamentally not a trivial project, but can be a way for libraries to potentially expand the visibility and findability of their own high-resolution digital collections (by exposing content through a IIIF-compatible server) or enable their users to find content related to their collections (by supporting a IIIF viewer). While my library hasn’t determined what exactly our role will be in supporting IIIF technology, we will definitely be taking information learned from this experiences to shape our exploration of emerging digital asset management systems, such as Hydra and Islandora.
TL;DR WebSockets allows the server to push up-to-date information to the browser without the browser making a new request. Watch the videos below to see the cool things WebSockets enables.
You are on a Web page. You click on a link and you wait for a new page to load. Then you click on another link and wait again. It may only be a second or a few seconds before the new page loads after each click, but it still feels like it takes way too long for each page to load. The browser always has to make a request and the server gives a response. This client-server architecture is part of what has made the Web such a success, but it is also a limitation of how HTTP works. Browser request, server response, browser request, server response….
But what if you need a page to provide up-to-the-moment information? Reloading the page for new information is not very efficient. What if you need to create a chat interface or to collaborate on a document in real-time? HTTP alone does not work so well in these cases. When a server gets updated information, HTTP provides no mechanism to push that message to clients that need it. This is a problem because you want to get information about a change in chat or a document as soon as it happens. Any kind of lag can disrupt the flow of the conversation or slow down the editing process.
This kind of polling has been implemented in many different ways, but all polling methods still have some queuing latency. Queuing latency is the time a message has to wait on the server before it can be delivered to the client. Until recently there has not been a standardized, widely implemented way for the server to send messages to a browser client as soon as an event happens. The server would always have to sit on the information until the client made a request. But there are a couple of standards that do allow the server to send messages to the browser without having to wait for the client to make a new request.
WebSockets allows for full-duplex communication between the client and the server. The client does not have to open up a new connection to send a message to the server which saves on some overhead. When the server has new data it does not have to wait for a request from the client and can send messages immediately to the client over the same connection. Client and server can even be sending messages to each other at the same time. WebSockets is a better option for applications like chat or collaborative editing because the communication channel is bidirectional and always open. While there are other kinds of latency involved here, WebSockets solves the problem of queuing latency. Removing this latency concern is what is meant by WebSockets being a real-time technology. Current browsers have good support for WebSockets.
Using WebSockets solves some real problems on the Web, but how might libraries, archives, and museums use them? I am going to share details of a couple applications from my work at NCSU Libraries.
Digital Collections Now!
When Google Analytics first turned on real-time reporting it was mesmerizing. I could see what resources on the NCSU Libraries’ Rare and Unique Digital Collections site were being viewed at the exact moment they were being viewed. Or rather I could view the URL for the resource being viewed. I happened to notice that there would sometimes be multiple people viewing the same resource at the same time. This gave me some hint that today someone’s social share or forum post was getting a lot of click throughs right now. Or sometimes there would be a story in the news and we had an image of one of the people involved. I could then follow up and see examples of where we were being effective with search engine optimization.
The Rare & Unique site has a lot of visual resources like photographs and architectural drawings. I wanted to see the actual images that were being viewed. The problem, though, was that Google Analytics does not have an easy way to click through from a URL to the resource on your site. I would have to retype the URL, copy and paste the part of the URL path, or do a search for the resource identifier. I just wanted to see the images now. (OK, this first use case was admittedly driven by one of the great virtues of a programmer–laziness.)
My first attempt at this was to create a page that would show the resources which had been viewed most frequently in the past day and past week. To enable this functionality, I added some custom logging that is saved to a database. Every view of every resource would just get a little tick mark that would be tallied up occasionally. These pages showing the popular resources of the moment are then regenerated every hour.
Now that I had this logging in place I set about to make it really real-time. I wanted to see the actual images being viewed at that moment by a real user. I wanted to serve up a single page and have it be updated in real-time with what is being viewed. And this is where the persistent communication channel of WebSockets came in. WebSockets allows the server to immediately send these updates to the page to be displayed.
I also used WebSockets to create interactive interfaces on the the Hunt Library video walls. The Hunt Library has five large video walls created with Cristie MicroTiles. These very large displays each have their own affordances based on the technologies in the space and the architecture. The Art Wall is above the single service point just inside the entrance of the library and is visible from outside the doors on that level. The Commons Wall is in front of a set of stairs that also function as coliseum-like seating. The Game Lab is within a closed space and already set up with various game consoles.
Listen to Wikipedia
When I saw and heard the visualization and sonification Listen to Wikipedia, I thought it would be perfect for the iPearl Immersion Theater. Listen to Wikipedia visualizes and sonifies data from the stream of edits on Wikipedia. The size of the bubbles is determined by the size of the change to an entry, and the sound changes in pitch based on the size of the edit. Green circles show edits from unregistered contributors, and purple circles mark edits performed by automated bots. (These automated bots are sometimes used to integrate library data into Wikipedia.) A bell signals an addition to an entry. A string pluck is a subtraction. New users are announced with a string swell.
The original Listen to Wikipedia (L2W) is a good example of the use of WebSockets for real-time displays. Wikipedia publishes all edits for every language into IRC channels. A bot called wikimon monitors each of the Wikipedia IRC channels and watches for edits. The bot then forwards the information about the edits over WebSockets to the browser clients on the Listen to Wikipedia page. The browser then takes those WebSocket messages and uses the data to create the visualization and sonification.
As you walk into the Hunt Library almost all traffic goes past the iPearl Immersion Theater. The one feature that made this space perfect for Listen to Wikipedia was that it has sound and, depending on your tastes, L2W can create pleasant ambient sounds1. I began by adjusting the CSS styling so that the page would fit the large. Besides setting the width and height, I adjusted the size of the fonts. I added some text to a panel on the right explaining what folks are seeing and hearing. On the left is now text asking passersby to interact with the wall and the list of languages currently being watched for updates.
One feature of the original L2W that we wanted to keep was the ability to change which languages are being monitored and visualized. Each language can individually be turned off and on. During peak times the English Wikipedia alone can sound cacophonous. An active bot can make lots of edits of all roughly similar sizes. You can also turn off or on changes to Wikidata which collects structured data that can support Wikipedia entries. Having only a few of the less frequently edited languages on can result in moments of silence punctuated by a single little dot and small bell sound.
We wanted to keep the ability to change the experience and actually get a feel for the torrent or trickle of Wikipedia edits and allow folks to explore what that might mean. We currently have no input device for directly interacting with the Immersion Theater wall. For L2W the solution was to allow folks to bring their own devices to act as a remote control. We encourage passersby to interact with the wall with a prominent message. On the wall we show the URL to the remote control. We also display a QR code version of the URL. To prevent someone in New Zealand from controlling the Hunt Library wall in Raleigh, NC, we use a short-lived, three-character token.
Because we were uncertain how best to allow a visitor to kick off an interaction, we included both a URL and QR code. They each have slightly different URLs so that we can track use. We were surprised to find that most of the interactions began with scanning the QR code. Currently 78% of interactions begin with the QR code. We suspect that we could increase the number of visitors interacting with the wall if there were other simpler ways to begin the interaction. For bring-your-own-device remote controls we are interested in how we might use technologies like Bluetooth Low Energy within the building for a variety of interactions with the surroundings and our services.
The remote control Web page is a list of big checkboxes next to each of the languages. Clicking on one of the languages turns its stream on or off on the wall (connects or disconnects one of the WebSockets channels the wall is listening on). The change happens almost immediately with the wall showing a message and removing or adding the name of the language from a side panel. We wanted this to be at least as quick as the remote control on your TV at home.
The quick interaction is possible because of WebSockets. Both the browser page on the wall and the remote control client listen on another WebSockets channel for such messages. This means that as soon as the remote control sends a message to the server it can be sent immediately to the wall and the change reflected. If the wall were using polling to get changes, then there would potentially be more latency before a change registered on the wall. The remote control client also uses WebSockets to listen on a channel waiting for updates. This allows feedback to be displayed to the user once the change has actually been made. This feedback loop communication happens over WebSockets.
Having the remote control listen for messages from the server also serves another purpose. If more than one person enters the space to control the wall, what is the correct way to handle that situation? If there are two users, how do you accurately represent the current state on the wall for both users? Maybe once the first user begins controlling the wall it locks out other users. This would work, but then how long do you lock others out? It could be frustrating for a user to have launched their QR code reader, lined up the QR code in their camera, and scanned it only to find that they are locked out and unable to control the wall. What I chose to do instead was to have every message of every change go via WebSockets to every connected remote control. In this way it is easy to keep the remote controls synchronized. Every change on one remote control is quickly reflected on every other remote control instance. This prevents most cases where the remote controls might get out of sync. While there is still the possibility of a race condition, it becomes less likely with the real-time connection and is harmless. Besides not having to lock anyone out, it also seems like a lot more fun to notice that others are controlling things as well–maybe it even makes the experience a bit more social. (Although, can you imagine how awful it would be if everyone had their own TV remote at home?)
I also thought it was important for something like an interactive exhibit around Wikipedia data to provide the user some way to read the entries. From the remote control the user can get to a page which lists the same stream of edits that are shown on the wall. The page shows the title for the most recently edited entry at the top of the page and pushes others down the page. The titles link to the current revision for that page. This page just listens to the same WebSockets channels as the wall does, so the changes appear on the wall and remote control at the same time. Sometimes the stream of edits can be so fast that it is impossible to click on an interesting entry. A button allows the user to pause the stream. When an intriguing title appears on the wall or there is a large edit to a page, the viewer can pause the stream, find the title, and click through to the article.
The reaction from students and visitors has been fun to watch. The enthusiasm has had unexpected consequences. For instance one day we were testing L2W on the wall and noting what adjustments we would want to make to the design. A student came in and sat down to watch. At one point they opened up their laptop and deleted a large portion of a Wikipedia article just to see how large the bubble on the wall would be. Fortunately the edit was quickly reverted.
We have also seen the L2W exhibit pop up on social media. This Instagram video was posted with the comment, “Reasons why I should come to the library more often. #huntlibrary.”
This is people editing–Oh, someone just edited Home Alone–editing Wikipedia in this exact moment.
The Web is now a better development platform for creating real-time and interactive interfaces. WebSockets provides the means for sending real-time messages between servers, browser clients, and other devices. This opens up new possibilities for what libraries, archives, and museums can do to provide up to the moment data feeds and to create engaging interactive interfaces using Web technologies.
Last month we got the long-awaited ruling in favor of Google in the Authors Guild vs. Google Books case, which by now has been analyzed extensively. Ultimately the judge in the case decided that Google’s digitization was transformative and thus constituted fair use. See InfoDocket for detailed coverage of the decision.
The Google Books project was part of the Google mission to index all the information available, and as such could never have taken place without libraries, which hold all those books. While most, if not all, the librarians I know use Google Books in their work, there has always been a sense that the project should not have been started by a commercial enterprise using the intellectual resources of libraries, but should have been started by libraries themselves working together. Yet libraries are often forced to be more conservative about digitization than we might otherwise be due to rules designed to protect the college or university from litigation. This ruling has made it seem as though we could afford to be less cautious. As Eric Hellman points out, the decision seems to imply that with copyright the ends are the important part, not the means. “In Judge Chin’s analysis, copyright is concerned only with the ends, not the means. Copyright seems not to be concerned with what happens inside the black box.” 1 As long as the end use of the books was fair, which was deemed to be the case, the initial digitization was not a problem.
Looking at this from the perspective of repository manager, I want to address a few of the theoretical and logistical issues behind such a conclusion for libraries.
What does this mean for digitization at libraries?
At the beginning of 2013 I took over an ongoing digitization project, and as a first-time manager of a large-scale long-term project, I learned a lot about the processes involved in such a project. The project I work with is extremely small-scale compared with many such projects, but even at this scale the project is expensive and time-consuming. What makes it worth it is that long-buried works of scholarship are finally being used and read, sometimes for reasons we do not quite understand. That gets at the heart of the Google Books decision—digitizing books in library stacks and making them more widely available does contribute to education and useful arts.
There are many issues that we need to address, however. Some of the most important ones are what access can and should be provided to what works, and making mass digitization more available to smaller and international cultural heritage institutions. Google Books could succeed because it had the financial and computing resources of Google matched with the cultural resources of the participating research libraries. This problem is international in scope. I encourage you to read this essay by Amelia Sanz, in which she argues that digitization efforts so far have been inherently unequal and a reflection of colonialism. 2 But is there a practical way of approaching this desire to make books available to a wider audience?
There are several separate issues in providing access. Books that are in the public domain are unquestionably fine to digitize, though differences in international copyright law make it difficult to determine what can be provided to whom. As Amelia Sanz points out, Google can only digitize Spanish works prior to 1870 in Spain, but may digitize the complete work in the United States. The complete work is not available to Spanish researchers, but it is available in full to US researchers.
That aside, there are several reasons why it is useful to digitize works still unquestionably under copyright. One of the major reasons is textual corpus analysis–you need to have every word of many texts available to draw conclusions about use of words and phrases in those texts. Google Books ngram viewer is one such tool that comes out of mass digitization. Searching for phrases in Google and finding that phrase as a snippet in a book is an important way to find information in books that might otherwise be ignored in favor of online sources. Some argue that this means that those books will not be purchased when they might have otherwise been, but it is equally possible that this leads to greater discovery and more purchases, which research into music piracy suggests may be the case.
Another reason to digitize works still under copyright is to highlight the work of marginalized communities, though in that case it is imperative to work with those communities to ensure that the digitization is not exploitative. Many orphan works, for whom a rights-holder cannot be located, fall under this, and I know from some volunteer work that I have done that small cultural heritage institutions are eager to digitize material that represents the cultural and intellectual output of their communities.
In all the above cases, it is crucial to put into place mechanisms for ensuring that works under copyright are not abused. Google Books uses an algorithm that makes it impossible to read an entire book, which is probably beyond the abilities of most institutions. (If anyone has an idea for how to do this, I would love to hear it.) Simpler and more practical solutions to limiting access are to only make a chapter or sample of a book available for public use, which many publishers already allow. For instance, Oxford University Press allows up to 10% of a work (within certain limits) on personal websites or institutional repositories. (That is, of course, assuming you can get permission from the author). Many institutions maintain “dark archives“, which are digitized and (usually) indexed archives of material inaccessible to the public, whether institutional or research information. For instance, the US Department of Energy Office of Scientific and Technical Information maintains a dark archive index of technical reports comprising the equivalent of 6 million pages, which makes it possible to quickly find relevant information.
In any case where an institution makes the decision to digitize and make available the full text of in-copyright materials for reasons they determine are valid, there are a few additional steps that institutions should take. Institutions should research rights-holders or at least make it widely known to potential rights-holders that a project is taking place. The Orphan Works project at the University of Michigan is an example of such a project, though it has been fraught with controversy. Another important step is to have a very good policy for taking down material when a rights-holder asks–it should be clear to the rights-holder whether any copies of the work will be maintained and for what purposes (for instance archival or textual analysis purposes).
Digitizing, Curating, Storing, Oh My!
The above considerations are only useful when it is even possible for institutions without the resources of Google to start a digitization program. There are many examples of DIY digitization by individuals, for instance see Public Collectors, which is a listing of collections held by individuals open for public access–much of it digitized by passionate individuals. Marc Fischer, the curator of Public Collectors, also digitizes important and obscure works and posts them on his site, which he funds himself. Realistically, the entire internet contains examples of digitization of various kinds and various legal statuses. Most of this takes place on cheap and widely available equipment such as flatbed scanners. But it is possible to build an overhead book scanner for large-scale digitization with individual parts and at a reasonable cost. For instance, the DIY Book Scanning project provides instructions and free software for creating a book scanner. As they say on the site, all the process involves is to “[p]oint a camera at a book and take pictures of each page. You might build a special rig to do it. Process those pictures with our free programs. Enjoy reading on the device of your choice.”
“Processing the pictures” is a key problem to solve. Turning images into PDF documents is one thing, but providing high quality optical character recognition is extremely challenging. Free tools such as FreeOCR make it possible to do OCR from image or PDF files, but this takes processing power and results vary widely, particularly if the scan quality is lower. Even expensive tools like Adobe Acrobat or ABBYY FineReader have the same problems. Karen Coyle points out that uncorrected OCR text may be sufficient for searching and corpus analysis, but does not provide a faithful reproduction of the text and thus, for instance, provide access to visually impaired persons 3 This is a problem well known in the digital humanities world, and one solved by projects such as Project Gutenberg with the help of dedicated volunteer distributed proofreaders. Additionally, a great deal of material clearly in the public domain is in manuscript form or has text that modern OCR cannot recognize. In that case, crowdsourcing transcriptions is the only financially viable way for institutions to make text of the material available. 4 Examples of successful projects using volunteer transcriptors or proofreaders include Ancient Lives to transcribe ancient papyrus, What’s on the Menu at the New York Public Library, and DIYHistory at the University of Iowa libraries. (The latter has provided step by step instructions for building your own version using open source tools).
So now you’ve built your low-cost DIY book scanner, and put together a suite of open source tools to help you process your collections for free. Now what? The whole landscape of storing and preserving digital files is far beyond the scope of this post, but the cost of accomplishing this is probably the highest of anything other than staffing a digitization project, and it is here where Google clearly has the advantage. The Internet Archive is a potential solution to storing public domain texts (though they are not immune to disaster), but if you are making in-copyright works available in any capacity you will most likely have to take the risk on your own servers. I am not a lawyer, but I have never rented server space that would allow copyrighted materials to be posted.
Conclusion: Is it Worth It?
Obviously from this post I am in favor of taking on digitization projects of both public domain and copyrighted materials when the motivations are good and the policies are well thought out. From this perspective, I think the Google Books decision was a good thing for libraries and for providing greater access to library collections. Libraries should be smart about what types of materials to digitize, but there are more possibilities for large-scale digitization, and by providing more access, the research community can determine what is useful to them.
If you have managed a DIY book scanning project, please let me know in the comments, and I can add links to your project.
Hellman, Eric. “Google Books and Black-Box Copyright Jurisprudence.” Go To Hellman, November 18, 2013. http://go-to-hellman.blogspot.com/2013/11/google-books-and-black-box-copyright.html. ↩
Sanz, Amelia. “Digital Humanities or Hypercolonial Studies?” Responsible Innovation in ICT (June 26, 2013). http://responsible-innovation.org.uk/torrii/resource-detail/1249#_ftnref13. ↩
For more on this, see Ben Brumfield’s work on crowdsourced transcription, for example Brumfield, Ben W. “Collaborative Manuscript Transcription: ‘The Landscape of Crowdsourcing and Transcription’ at Duke University.” Collaborative Manuscript Transcription, November 23, 2013. http://manuscripttranscription.blogspot.com/2013/11/the-landscape-of-crowdsourcing-and.html. ↩
I am in love with Isotope. It’s not often that you hear someone profess their love for a JQuery library (unless it’s this), but there it is. I want to display everything in animated grids.
I also love Views Isotope, a Drupal 7 module that enabled me to create a dynamic image gallery for our school’s Year in Review. This module (paired with a few others) is instrumental in building our new digital library.
In this blog post, I will walk you through how we created the Year in Review page, and how we plan to extrapolate the design to our collection views in the Knowlton Digital Library. This post assumes you have some basic knowledge of Drupal, including an understanding of content types, taxonomy terms and how to install a module.
Year in Review Project
Our Year in Review project began over the summer, when our communications team expressed an interest in displaying the news stories from throughout the school year in an online, interactive display. The designer on our team showed me several examples of card-like interfaces, emphasizing the importance of ease and clean graphics. After some digging, I found Isotope, which appeared to be the exact solution we needed. Isotope, according to its website, assists in creating “intelligent, dynamic layouts that can’t be achieved with CSS alone.” This JQuery library provides for the display of items in a masonry or grid-type layout, augmented by filters and sorting options that move the items around the page.
At first, I was unsure we could make this library work with Drupal, the content management system we employ for our main web site and our digital library. Fortunately I soon learned – as with many things in Drupal – there’s a module for that. The Views Isotope module provides just the functionality we needed, with some tweaking, of course.
We set out to display a grid of images, each representing a news story from the year. We wanted to allow users to filter those news stories based on each of the sections in our school: Architecture, Landscape Architecture and City and Regional Planning. News stories might be relevant to one, two or all three disciplines. The user can see the news story title by hovering over the image, and read more about the new story by clicking on the corresponding item in the grid.
Views Isotope Basics
Views Isotope is installed in the same way as other Drupal modules. There is an example in the module and there are also videos linked from the main module page to help you implement this in Views. (I found this video particularly helpful.)
You must have the following modules installed to use Views Isotope:
You also need to install the Isotope JQuery library. It is important to note that Isotope is only free for non-commercial projects. To install the library, download the package from the Isotope GitHub repository. Unzip the package and copy the whole directory into your libraries directory. Within your Drupal installation, this should be in the /sites/all/libraries folder. Once the module and the library are both installed, you’re ready to start.
If you have used Drupal, you have likely used Views. It is a very common way to query the underlying database in order to display content.The Views Isotope module provides additional View types: Isotope Grid, Isotope Filter Block and Isotope Sort Block. These three view types combine to provide one display. In my case, I have not yet implemented the Sort Block, so I won’t discuss it in detail here.
To build a new view, go to Structure > Views > Add a new view. In our specific example, we’ll talk about the steps in more detail. However, there’s a few important tenets of using Views Isotope, regardless of your setup:
There is a grid. The View type Isotope Grid powers the main display.
The field on which we want to filter is included in the query that builds the grid, but a CSS class is applied which hides the filters from the grid display and shows them only as filters.
The Isotope Filter Block drives the filter display. Again, a CSS class is applied to the fields in the query to assign the appropriate display and functionality, instead of using default classes provided by Views.
Frequently in Drupal, we are filtering on taxonomy terms. It is important that when we display these items we do not link to the taxonomy term page, so that a click on a term filters the results instead of taking the user away from the page.
With those basic tenets in mind, let’s look at the specific process of building the Year in Review.
Building the Year in Review
Armed with the Views Isotope functionality, I started with our existing Digital Library Drupal 7 instance and one content type, Item. Items are our primary content type and contain many, many fields, but here are the important ones for the Year in Review:
Title: text field containing the headline of the article
Description: text field containing the shortened article body
File: File field containing an image from the article
Item Class: A reference to a taxonomy term indicating if the item is from the school archives
Discipline: Another term reference field which ties the article to one or more of our disciplines: Architecture, Landscape Architecture or City and Regional Planning
Showcase: Boolean field which flags the article for inclusion in the Year in Review
The last field was essential so that the communications team liaison could curate the page. There are more news articles in our school archives then we necessarily want to show in the Year in Review, and the showcase flag solves this problem.
In building our Views, we first wanted to pull all of the Items which have the following characteristics:
Item Class: School Archives
So, we build a new View. While logged in as administrator, we click on Structure, Views then Add a New View. We want to show Content of type Item, and display an Isotope Grid of fields. We do not want to use a pager. In this demo, I’m going to build a Page View, but a Block works as well (as we will see later). So my settings appear as follows:
Click on Continue & edit. For the Year in Review we next needed to add our filters – for Item Class and Showcase. Depending on your implementation, you may not need to filter the results, but likely you will want to narrow the results slightly. Next to Filter Criteria, click on Add.
I first searched for Item Class, then clicked on Apply.
Next, I need to select a value for Item Class and click on Apply.
I repeated the process with the Showcase field.
If you click Update Preview at the bottom of the View edit screen, you’ll see that much of the formatting is already done with just those steps.
Note that the formatting in the image above is helped along by some CSS. To style the grid elements, the Views Isotope module contains its own CSS in the module folder ([drupal_install]/sites/all/modules/views_isotope). You can move forward with this default display if it works for your site. Or, you can override this in the site’s theme files, which is what I’ve done above. In my theme CSS file, I have applied the following styling to the class “isotope-element”
I put the above code in my CSS file associated with my theme, and it overrides the default Views Isotope styling. “isotope-element” is the class applied to the div which contains all the fields being displayed for each item. Let’s add a few more items and see how the rendered HTML looks.
First, I want to add an image. In my case, all of my files are fields of type File, and I handle the rendering through Media based on file type. But you could use any image field, also.
I use the Rendered File Formatter and select the Grid View Mode, which applies an Image Style to the file, resizing it to 180 x 140. Clicking Update Preview again shows that the image has been added each item.
This is closer, but in our specific example, we want to hide the title until the user hovers over the item. So, we need to add some CSS to the title field.
In my CSS file, I have the following:
background: none repeat scroll 0 0 #4D4D4F;
Note the opacity is 0 – which means the div is transparent, allowing the image to show through. Then, I added a hover style which just changes the opacity to mostly cover the image:
Now, if we update preview, we should see the changes.
The last thing we need to do is add the Discipline field for each item so that we can filter.
There are two very important things here. First, we want to make sure that the field is not formatted as a link to the term, so we select Plain text as the Formatter.
Second, we need to apply a CSS class here as well, so that the Discipline fields show in filters, not in the grid. To do that, check the Customize field HTML and select the DIV element. Then, select Create a class and enter “isotope-filter”. Also, uncheck “Apply default classes.” Click Apply.
Using Firebug, I can now look at the generated HTML from this View and see that isotope-element <div> contains all the fields for each item, though the isotope-filter class loads Discipline as hidden.
You might also notice that the data-category for this element is assigned as landscape-architecture, which is our Discipline term for this item. This data-category will drive the filters.
So, let’s save our View by clicking Save at the top and move on to create our filter block. Create a new view, but this time create a block which displays taxonomy terms of type Discipline. Then, click on Continue & Edit.
The first thing we want to do is adjust view so that the default row wrappers are not applied. Note: this is the part I ALWAYS forget, and then when my filters don’t work it takes me forever to track it down.
Click on Settings next to Fields.
Uncheck the Provide default field wrapper elements. Click Apply.
Next, we do not want the fields to be links to term pages, because a user click should filter the results, not link back to the term. So, click on the term name to edit that field. Uncheck the box next to “Link this field to its taxonomy term page”. Click on Apply.
Save the view.
The last thing is to make the block appear on the page with the grid. In practice, Drupal administrators would use Panels or Context to accomplish this (we use Context), but it can also be done using the Blocks menu.
So, go to Structure, then click on Blocks. Find our Isotope-Filter Demo block. Because it’s a View, the title will begin with “View:”
Click Configure. Set block settings so that the Filter appears only on the appropriate Grid page, in the region which is appropriate for your theme. Click save.
Now, let’s visit our /isotope-grid-demo page. We should see both the grid and the filter list.
It’s worth noting that here, too, I have customized the CSS. If we look at the rendered HTML using Firebug, we can see that the filter list is in a div with class “isotope-options” and the list itself has a class of “isotope-filters”.
I have overridden the CSS for these classes to remove the background from the filters and change the list-style-type to none, but you can obviously make whatever changes you want. When I click on one of the filters, it shows me only the news stories for that Discipline. Here, I’ve clicked on City and Regional Planning.
So, how do we plan to use this in our digital library going forward? So far, we have mostly used the grid without the filters, such as in one of our Work pages. This shows the metadata related to a given work, along with all the items tied to that work. Eventually, each of the taxonomy terms in the metadata will be a link. The following grids are all created with blocks instead of pages, so that I can use Context to override the default term or node display.
However, in our recently implemented Collection view, we allow users to filter the items based on their type: image, video or document. Here, you see an example of one of our lecture collections, with the videos and the poster in the same grid, until the user filters for one or the other.
There are two obstacles to using this feature in a more widespread manner throughout the site. First, I have only recently figured out how to implement multiple filter options. For example, we might want to filter our news stories by Discipline and Semester. To do this, we rewrite the sorting fields in our Grid display so that they all display in one field. Then, we create two Filter blocks, one for each set of terms. Implementing this across the site so that users can sort by say, item type and vocabulary term, will make it more useful to us.
Second, we have several Views that might return upwards of 500 items. Loading all of the image files for this result set is costly, especially when you add in the additional overhead of a full image loading in the background for a Colorbox overlay and Drupal performance issues. The filters will not work across pages, so if I use pager, I will only filter the items on the page I’m viewing. I believe this can fixed somehow using Infinite Scroll (as described in several ways here), but I have not tried yet.
With these two advanced options, there are many options for improving the digital library interface. I am especially interested in how to use multiple filters on a set of search results returned from a SOLR index.
What other extensions might be useful? Let us know what you think in the comments.
In this blog post, I’ll take you through the development of the “serendipity machine”, from the convening of the team to the selection and development of the tool. The experience turned out to be an intense learning experience for me, so along the way, I will share some of my own fortunate discoveries.
(Note: this is a pretty detailed play-by-play of the process. If you’re more interested in the result, please see the RRCHNM news items on both our process and our product, or play with Serendip-o-matic itself.)
Approximately thirty people applied to be part of One Week | One Tool (OWOT), an Institute for Advanced Topics in the Digital Humanities, sponsored by the National Endowment for the Humanities. Twelve were selected and we arrive on Sunday, July 28, 2013 and convene in the Well, the watering hole at the Mason Inn.
Tom Scheinfeldt (@foundhistory), the RRCHNM director-at-large who organized OWOT, delivers the pre-week pep talk and discusses how we will measure success. The development of the tool is important, but so is the learning experience for the twelve assembled scholars. It’s about the product, but also about the process. We are encouraged to learn from each other, to “hitch our wagon” to another smart person in the room and figure out something new.
As for the product, the goal is to build something that is used. This means that defining and targeting the audience is essential.
The tweeting began before we arrived, but typing starts in earnest at this meeting and the #owot hashtag is populated with our own perspectives and feedback from the outside. Feedback, as it turns out, will be the priority for Day 1.
@DoughertyJack: “One Week One Tool team wants feedback on which digital tool to build.”
Mentors from RRCHNM take the morning to explain some of the basic tenets of what we’re about to do. Sharon Leon talks about the importance of defining the project: “A project without an end is not a project.” Fortunately, the one week timeline solves this problem for us initially, but there’s the question of what happens after this week?
Patrick Murray-John takes us through some of the finer points of developing in a collaborative environment. Sheila Brennan discusses outreach and audience, and continues to emphasize the point from the night before: the audience definition is key. She also says the sentence that, as we’ll see, would need to be my mantra for the rest of the project: “Being willing to make concrete decisions is the only way you’re going to get through this week.“
All of the advice seems spot-on and I find myself nodding my head. But we have no tool yet, and so how to apply specifics is still really hazy. The tool is the piece of the puzzle that we need.
We start with an open brainstorming session, which results in a filled whiteboard of words and concepts. We debate audience, we debate feasibility, we debate openness. Debate about openness brings us back to the conversation about audience – for whom are we being open? There’s lot of conversation but at the end, we essentially have just a word cloud associated with projects in our heads.
So, we then take those ideas and try to express them in the following format: X tool addresses Y need for Z audience. I am sitting closest to the whiteboards so I do a lot of the scribing for this second part and have a few observations:
there are pet projects in the room – some folks came with good ideas and are planning to argue for them
our audience for each tool is really similar; as a team we are targeting “researchers”, though there seems to be some debate on how inclusive that term is. Are we including students in general? Teachers? What designates “research”? It seems to depend on the proposed tool.
the problem or need is often hard to articulate. “It would be cool” is not going to cut it with this crowd, but there are some cases where we’re struggling to define why we want to do something.
A few group members begin taking the rows and creating usable descriptions and titles for the projects in a Google Doc, as we want to restrict public viewing while still sharing within the group. We discuss several platforms for sharing our list with the world, and land on IdeaScale. We want voters to be able to vote AND comment on ideas, and IdeaScale seems to fit the bill. We adjourn from the Center and head back to the hotel with one thing left to do: articulate these ideas to the world using IdeaScale and get some feedback.
The problem here, of course, is that everyone wants to make sure that their idea is communicated effectively and we need to agree on public descriptions for the projects. Finally, it seems like there’s a light at the end of the tunnel…until we hit another snag. IdeaScale requires a login to vote or comment and there’s understandable resistance around the table to that idea. For a moment, it feels like we’re back to square one, or at least square five. Team members begin researching alternatives but nothing is perfect, we’ve already finished dinner and need the votes by 10am tomorrow. So we stick with IdeaScale.
And, not for the last time this week, I reflect on Sheila’s comment, “being willing to make concrete decisions is the only way you’re going to get through this week.” When new information, such as the login requirement, challenges the concrete decision you made, how do you decide whether or not to revisit the decision? How do you decide that with twelve people?
I head to bed exhausted, wondering about how many votes we’re going to get, and worried about tomorrow: are we going to make a decision?
@briancroxall: “We’ve got a lot of generous people in the #owot room who are willing to kill their own ideas.”
It turns out that I need not have worried. In the winnowing from 11 choices down to 2, many members of the team are willing to say, “my tool can be done later” or “that one can be done better outside this project.” Approximately 100 people weighed in on the IdeaScale site, and those votes are helpful as we weigh each idea. Scott Kleinman leads us in a discussion about feasbility for implementation and commitment in the room and the choices begin to fall away. At the end, there are four, but after a few rounds of voting we’re down to two with equal votes that must be differentiated. After a little more discussion, Tom proposes a voting system that allows folks to weight their votes in terms of commitment and the Serendipity project wins out. The drafted idea description reads:
“A serendipitous discovery tool for researchers that takes information from your personal collection (such as a Zotero citation library or a CSV file) and delivers content (from online libraries or collections like DPLA or Europeana) similar to it, which can then be visualized and manipulated.”
We decide to keep our project a secret until our launch and we break for lunch before assigning teams. (Meanwhile, #owot hashtag follower Sherman Dorn decides to create an alternative list of ideas – One Week Better Tools – which provides some necessary laughs over the next couple of days).
After lunch, it’s time to break out responsibilities. Mia Ridge steps up, though, and suggests that we first establish a shared understanding of the tool. She sketches on one of the whiteboards the image which would guide our development over the next few days.
This was a takeaway moment for me. I frequently sketch out my projects, but I’m afraid the thinking often gets pushed out in favor of the doing when I’m running low on time. Mia’s suggestion that we take the time despite being against the clock probably saved us lots of hours and headaches later in the project. We needed to aim as a group, so our efforts would fire in the same direction. The tool really takes shape in this conversation, and some of the tasks are already starting to become really clear. (We are also still indulging our obsession with mustaches at this time, as you may notice.)
Tom leads the discussion of teams. He recommends three: a project management team, a design/dev team and an outreach team. The project managers should be selected first, and they can select the rest of the teams. The project management discussion is difficult; there’s an abundance of qualified people in the room. From my perspective, it makes sense to have the project managers be folks who can step in and pinch hit as things get hectic, but we also need our strongest technical folks on the dev team. In the end, Brian Croxall and I are selected to be the project management team.
We decide to ask the remaining team members where they would like to be and see where our numbers end up. The numbers turn out great: 7 for design/dev and 3 for outreach, with two design/dev team members slated to help with outreach needs as necessary.
The teams hit the ground running and begin prodding the components of the idea. The theme of the afternoon is determining the feasibility of this “serendipity engine” we’ve elected to build. Mia Ridge, leader of the design/dev team, runs a quick skills audit and gets down to the business of selecting programming languages, frameworks and strategies for the week. They choose to work in Python with the Django framework. Isotope, a JQuery plugin I use in my own development, is selected to drive the results page. A private Github repository is set up under a code name. (Beyond Isotope, HTML and CSS, I’m a little out of my element here, so for more technical details, please visit the public repository’s wiki.) The outreach team lead, Jack Dougherty, brainstorms with his team on overall outreach needs and high priority tasks. The Google document from yesterday becomes a Google Drive folder, with shells for press releases, a contact list for marketing and work plans for both teams.
This is the first point where I realize that I am going to have to adjust to a lack of hands on work. I do my best when I’m working a keyboard: making lists, solving problems with code, etc. As one of the project managers, my job is much less on the keyboard and much more about managing people and process.
When the teams come back together to report out, there’s a lot of getting each side up to speed, and afterwards our mentors advise us that the meetings have to be shorter. We’re already at the end of day 2, though both teams would be working into the night on their work plans and Brian and need I still need to set the schedule for tomorrow.
We’re past the point where we can have a lot of discussion, except for maybe about the name.
@briancroxall: Prepare for more radio silence from #owot today as people put their heads down and write/code.
@DoughertyJack: At #owot we considered 120 different names for our tool and FINALLY selected number 121 as the winner. Stay tuned for Friday launch!
Wednesday is tough. We have to come up with a name, and all that exploration from yesterday needs to be a prototype by the end of the day. We are still hammering out the language we use in talking to each other and there’s some middle ground to be found on terminology. One example is the use of the word “standup” in our schedule. “Standup” means something very specific to developers familiar with the Agile development process whereas I just mean, “short update meeting.” Our approach to dealing with these issues is to identify the confusion and quickly agree on language we all understand.
I spend most of the day with the outreach team. We have set a deadline for presenting names at lunchtime and are hoping the whole team can vote after lunch. This schedule turns out to be folly as the name takes most of the day and we have to adjust our meeting times accordingly. As project managers, Brian and I are canceling meetings (because folks are on a roll, we haven’t met a deadline, etc) whenever we can, but we have to balance this with keeping the whole team informed.
Camping out in a living room type space in RRCHNM, spread out among couches and looking at a Google Doc being edited on a big-screen TV, the outreach team and various interested parties spend most of the day brainstorming names. We take breaks to work on the process press release and other essential tasks, but the name is the thing for the moment. We need a name to start working on branding and logos. Product press releases need to be completed, the dev team needs a named target and of course, swag must be ordered.
It is in this process, however, that an Aha! moment occurs for me. We have been discussing names for a long time and folks are getting punchy. The dev team lead and our designer, Amy Papaelias, have joined the outreach team along with most of our CHNM mentors. I want to revisit something dev team member Eli Rose said earlier in the day. To paraphrase, Eli said that he liked the idea that the tool automated or mechanized the concept of surprise. So I repeat Eli’s concept to the group and it isn’t long after that that Mia says, “what about Serendip-o-matic?” The group awards the name with head nods and “I like that”s and after running it by developers and dealing with our reservations (eg, hyphens, really?), history is made.
As relieved as I am to finally have a name, the bigger takeaway for me here is in the role of the manager. I am not responsible for the inspiration for the name or the name itself, but instead repeating the concept to the right combination of people at a time when the team was stuck. The project managers can create an opportunity for the brilliant folks on the team to make connections. This thought serves as a consolation to me as I continue to struggle without concrete tasks.
Meanwhile, on the other side the building, the rest of dev team is pushing to finish code. We see a working prototype at the end of the day, and folks are feeling good, but its been a long day. So we go to dinner as a team, and leave the work behind for a couple of hours, though Amy is furiously sketching at various moments throughout the meal as she tries to develop a look and feel for this newly named thing.
On the way home from dinner, I think, “there’s only two days left.” All of the sudden it feels like we haven’t gotten anywhere.
The decision to add the Flickr API to our work in order to access the Flickr Commons is made with the dev team, based on the feeling that we have enough time and the images located there enhance our search results and expand our coverage of subject areas and geographic locations.
We also spend today addressing issues. The work of both teams overlaps in some key areas. In the afternoon, Brian and I realize that we have mishandled some of the communication regarding language on the front page and both teams are working on the text. We scramble to unify the approaches and make sure that efforts are not wasted.
This is another learning moment for me. I keep flashing on Sheila’s words from Monday, and worry that our concrete decision making process is suffering from”too many cooks in the kitchen.” Everyone on this team has a stake in the success of this project and we have lots of smart people with valid opinions. But everyone can’t vote on everything and we are spending too much time getting consensus now, with a mere twenty-four hours to go. As a project manager, part of my job is to start streamlining and making executive decisions, but I am struggling with how to do that.
As we prepare to leave the center at 6pm, things are feeling disconnected. This day has flown by. Both teams are overwhelmed by what has to get done before tomorrow and despite hard work throughout the day, we’re trying to get a dev server and production server up and running. As we regroup at the Inn, the dev team heads upstairs to a quiet space to work and eat and the outreach team sets up in the lobby.
The outreach team continues to work on documentation, and release strategy and Brian and I continue to step in where we can. Everyone is working until midnight or later, but feeling much better about our status then we did at 6pm.
@raypalin: If I were to wait one minute, I could say launch is today. Herculean effort in final push by my #owot colleagues. Outstanding, inspiring.
The final tasks are upon us. Scott Williams moves on from his development responsibilities to facilitate user testing, which was forced to slide from Thursday due to our server problems. Amanda Visconti works to get the interactive results screen finalized. Ray Palin hones our list of press contacts and works with Amy to get the swag design in place. Amrys Williams collaborates with the outreach team and then Sheila to publish the product press release. Both the dev and outreach teams triage and fix and tweak and defer issues as we move towards our 1pm “code chill”, a point which we’re hoping to have the code in a fairly stable state.
We are still making too many decisions with too many people, and I find myself weighing not only the options but how attached people are to either option. Several choices are made because they reflect the path of least resistance. The time to argue is through and I trust the team’s opinions even when I don’t agree.
We end up running a little behind and the code freeze scheduled for 2pm slides to 2:15. But at this point we know: we’re going live at 3:15pm.
The code goes live and the broadcast starts but my jitters do not subside…until I hear my teammates cheering in the hangout. Serendip-o-matic is live.
At 8am on Day 6, Serendip-o-matic had its first pull request and later in the day, a fourth API – Trove of Australia – was integrated. As I drafted this blog post on Day 7, I received email after email generated by the active issue queue and the tweet stream at #owot is still being populated. On Day 9, the developers continue to fix issues and we are all thinking about long term strategy. We are brainstorming ways to share our experience and help other teams achieve similar results.
I found One Week | One Tool incredibly challenging and therefore a highly rewarding experience. My major challenge lay in shifting my mindset from that of a someone hammering on a keyboard in a one-person shop to a that of a project manager for a twelve-person team. I write for this blog because I like to build things and share how I built them, but I have never experienced the building from this angle before. The tight timeline ensured that we would not have time to go back and agonize over decisions, so it was a bit like living in a project management accelerator. We had to recognize issues, fix them and move on quickly, so as not to derail the project.
However, even in those times when I became acutely more aware of the clock, I never doubted that we would make it. The entire team is so talented; I never lost my faith that a product would emerge. And, it’s an application that I will use, for inspiration and for making fortunate discoveries.
All too often, libraries are forced to start from scratch each time we roll out a new technology. To avoid this problem librarians should consider middleware as their technology prototyping pipeline. Middleware is a software layer that delivers data in standard XML or JSON formats from a variety of sources. The sources could be distributed in multiple third party databases, residing in HTML, or available as other XML encoded data living on another internet domain. Middleware layers help to integrate these data into more robust and flexible alternative data sources to turnkey products from vendors. A number of turnkey services we rely on do not provide APIs, access to the underlying data or the relevant data librarians need to offer compelling, twenty-first century services our students and faculties require. Middleware design is the process of architecting your own library API.
The right middleware design approach will empower libraries to move data between products and services and deliver it to users where they need it. In the example below, the library does not have access to an API of room reservation data, so a library’s data is held captive. With strategic application of middleware design, the library is free to take the data and reformat it in mobile or tablet-friendly format, greatly improving access and allowing new services for our users.
Problem: This stand alone website allows a student to look up and book available group study rooms in the library. It is a self-service product. But what if we wanted to advertise and show only the currently available study rooms into our website as a data feed? What if we needed *only* currently bookable room data to port into a mobile app? As the website stands we cannot readily access this data other than on a desktop computer.
Library and Information Science foundations can be utilized to extend library data into new interfaces and platforms. Half of digital librarianship is really just the description of data using metadata schemes (usually encoded in XML, but more recently, JSON as well.) Before I can really talk about establishing a prototyping pipeline I need to unpack XML and then I’m going to talk about how building on these foundations we can create the XML feeds we need to extend library data anywhere and everywhere.
A short primer on XML:
Many others have sung the virtues of XML. I personally like a good clean XML feed because to me, it means extensibility. XML supports all kinds of system efficency and data independence. Those are the features I like. Those who develop metadata standards and work with digital library development cite the following when promulgating XML virtues from McDonough, 2008 :
XML helps ensure platform (and perhaps more critically vendor) independence;
XML provides the multilingual character support critical to the handling of library materials;
XML’s extensibility and modularity allow libraries to customize its application within their own operating environments;
XML helps minimize software development costs by allowing libraries to leverage existing, open source development tools;
XML, through virtue of being an open standard which enables descriptive markup, may assist in the long-term preservation of electronic materials; and perhaps most importantly
XML provides a technological basis for interoperability of both content and metadata across library systems.
Now that you’re considering drinking the XML kool-aide, think about what might be possible if you could pull any library data you wanted (in the form of XML) into any other system — like an iPhone or iPad app. To do that you’d need a RESTful API. RESTful APIs are vitally important for extending library data across systems. You can create your own library API with a middleware layer.
One layer I’ve learned in the past couple months is a Tomcat/Jersey stack that allows you to pull in data from multiple sources, and then serialize that data to XML. A recent XML feed that was developed this way is an “available now” XML feed of group rooms in the library that are available in the next hour and can be booked immediately. For this example I pull in an additional Java package to the Jersey program — an HTML parser, Jsoup.
Jersey is implemented on a Tomcat server. Tomcat can run a number of Java based applications but it is essentially a webserver that we are running Jersey from. It also bears noting that Apache Tomcat is not the same as the Apache webserver that many of us know and use for serving HTML pages.
In order to serialize a Java data object to XML Jersey uses a standard MVC architecture, where our data model is the model, the new library web page/mobile app that we display is the view and the Jersey resource file is the controller. In essence the Jersery output is a set of three Java programs that comprise the MVC. The data that Jersey is pulling in from the website is HTML. Since the HTML needs to be parsed and then pulled into a data object, we use another Java library called jsoup. Along with the tutelage of the research programmers in the library, I followed this tutorial on creating a RESTful API with Jersey, that explains the programming annotations needed for creating the web-service, which are rather simple to implement once you have your developer environment set up — for this project I worked entirely in Eclipse since it can also simulate running a Tomcat server on your local machine.
Once you have that feed modeled and serving XML data from the page you are able to pull that into a new system/ interface. Using Apple’s Dashcode I was able to model a prototype of what the room reservation feed might look like in an iPhone app:
Middleware is the digital library prototyping pipeline, it is a profound tool in the digital library toolkit since it is the foundation for new services, initiatives, and the extension of library data. There are some areas that I breezed over in this post, like how to program the HTML screen scraping necessary to pull data in a webpage into a Java data object. I’ll cover screen scraping with Jersey and Jsoup in my next post. I’ll also submit that having access to the underlying database that powered the room reservation website would have been preferable. We could have imported a Java package that acts as a database connector from Jersey to the XML serializer — but alas, as often happens in the wild we also could not get direct database acces to the underlying data in the page. One final thought on the approach used here: the software are open source Java tools — so they are free to download and utilize for your rapid prototyping library needs.