All too often, libraries are forced to start from scratch each time we roll out a new technology. To avoid this problem librarians should consider middleware as their technology prototyping pipeline. Middleware is a software layer that delivers data in standard XML or JSON formats from a variety of sources. The sources could be distributed in multiple third party databases, residing in HTML, or available as other XML encoded data living on another internet domain. Middleware layers help to integrate these data into more robust and flexible alternative data sources to turnkey products from vendors. A number of turnkey services we rely on do not provide APIs, access to the underlying data or the relevant data librarians need to offer compelling, twenty-first century services our students and faculties require. Middleware design is the process of architecting your own library API.
The right middleware design approach will empower libraries to move data between products and services and deliver it to users where they need it. In the example below, the library does not have access to an API of room reservation data, so a library’s data is held captive. With strategic application of middleware design, the library is free to take the data and reformat it in mobile or tablet-friendly format, greatly improving access and allowing new services for our users.
Problem: This stand alone website allows a student to look up and book available group study rooms in the library. It is a self-service product. But what if we wanted to advertise and show only the currently available study rooms into our website as a data feed? What if we needed *only* currently bookable room data to port into a mobile app? As the website stands we cannot readily access this data other than on a desktop computer.
Library and Information Science foundations can be utilized to extend library data into new interfaces and platforms. Half of digital librarianship is really just the description of data using metadata schemes (usually encoded in XML, but more recently, JSON as well.) Before I can really talk about establishing a prototyping pipeline I need to unpack XML and then I’m going to talk about how building on these foundations we can create the XML feeds we need to extend library data anywhere and everywhere.
A short primer on XML:
Many others have sung the virtues of XML. I personally like a good clean XML feed because to me, it means extensibility. XML supports all kinds of system efficency and data independence. Those are the features I like. Those who develop metadata standards and work with digital library development cite the following when promulgating XML virtues from McDonough, 2008 :
- XML helps ensure platform (and perhaps more critically vendor) independence;
- XML provides the multilingual character support critical to the handling of library materials;
- XML’s extensibility and modularity allow libraries to customize its application within their own operating environments;
- XML helps minimize software development costs by allowing libraries to leverage existing, open source development tools;
- XML, through virtue of being an open standard which enables descriptive markup, may assist in the long-term preservation of electronic materials; and perhaps most importantly
- XML provides a technological basis for interoperability of both content and metadata across library systems.
Now that you’re considering drinking the XML kool-aide, think about what might be possible if you could pull any library data you wanted (in the form of XML) into any other system — like an iPhone or iPad app. To do that you’d need a RESTful API. RESTful APIs are vitally important for extending library data across systems. You can create your own library API with a middleware layer.
One layer I’ve learned in the past couple months is a Tomcat/Jersey stack that allows you to pull in data from multiple sources, and then serialize that data to XML. A recent XML feed that was developed this way is an “available now” XML feed of group rooms in the library that are available in the next hour and can be booked immediately. For this example I pull in an additional Java package to the Jersey program – an HTML parser, Jsoup.
Jersey is implemented on a Tomcat server. Tomcat can run a number of Java based applications but it is essentially a webserver that we are running Jersey from. It also bears noting that Apache Tomcat is not the same as the Apache webserver that many of us know and use for serving HTML pages.
In order to serialize a Java data object to XML Jersey uses a standard MVC architecture, where our data model is the model, the new library web page/mobile app that we display is the view and the Jersey resource file is the controller. In essence the Jersery output is a set of three Java programs that comprise the MVC. The data that Jersey is pulling in from the website is HTML. Since the HTML needs to be parsed and then pulled into a data object, we use another Java library called jsoup. Along with the tutelage of the research programmers in the library, I followed this tutorial on creating a RESTful API with Jersey, that explains the programming annotations needed for creating the web-service, which are rather simple to implement once you have your developer environment set up — for this project I worked entirely in Eclipse since it can also simulate running a Tomcat server on your local machine.
An example of the feed is below (abbreviated):
<models> <model> <date>1/27/2013</date> <endTime>5:00 PM</endTime> <roomName>Collaboration Room 01 - Undergraduate Library</roomName> <startTime>4:00 PM</startTime> </model> <model> <date>1/27/2013</date> <endTime>5:00 PM</endTime> <roomName>Collaboration Room 02 - Undergraduate Library</roomName> <startTime>4:00 PM</startTime> </model> ...
Once you have that feed modeled and serving XML data from the page you are able to pull that into a new system/ interface. Using Apple’s Dashcode I was able to model a prototype of what the room reservation feed might look like in an iPhone app:
Middleware is the digital library prototyping pipeline, it is a profound tool in the digital library toolkit since it is the foundation for new services, initiatives, and the extension of library data. There are some areas that I breezed over in this post, like how to program the HTML screen scraping necessary to pull data in a webpage into a Java data object. I’ll cover screen scraping with Jersey and Jsoup in my next post. I’ll also submit that having access to the underlying database that powered the room reservation website would have been preferable. We could have imported a Java package that acts as a database connector from Jersey to the XML serializer — but alas, as often happens in the wild we also could not get direct database acces to the underlying data in the page. One final thought on the approach used here: the software are open source Java tools — so they are free to download and utilize for your rapid prototyping library needs.
McDonough, Jerome. “Structural Metadata and the Social Limitation of Interoperability: A Sociotechnical View of XML and Digital Library Standards Development.” Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 – 15, 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). doi:10.4242/BalisageVol1.McDonough01.
Two years after the initial meeting for the Digital Public Library of America, another major planning and update meeting took place in Chicago at DPLA Midwest. At this meeting the steering committee handed the project over to the inaugural board and everyone who has been working on the project talked about what had happened over the past few years and the ambitious timetable to launch in April 2013.
In August I wrote about the DPLA and had many unanswered questioned. Luckily I had the opportunity to attend the meeting and participate heavily in the backchanel (both virtual and physical). This post is a report of what happened at the general meeting (I was not able to attend the workstream meetings the day before). This is a followup to my last post about the Digital Public Library of America–then I felt like an observer, but the great thing about this project is how easy it is to become a participant.
Looking Back and Ahead
The day started with a welcome from John Palfrey, who reported that through the livestream and mailing lists there were over a thousand active participants in the process. The project seemed two years ago (and still does) seem to him “completely ambitious and almost crazy,” but actually is working out. He emphasized that everything is still “wet clay” and a participatory process, but everything is headed to April 2013 for the public launch with initial version of the service and a fair amount of content being available. We will come back a bit later to exactly what that content is and from what sources it will come.
In this welcome, Palfrey introduced several themes that the day revolved around–that the project is still moldable despite the structure that seems to be there (the “wet clay”), and that it is still completely participatory even though the project will recruit an Executive Director and has a new board. One of the roles of the board will be to ensure that participation remains broad. The credentials of the board and the steering committee are impressive; but they cannot get the project going without a lot of additional support, both financial and otherwise.
The rest of the day was organized to talk about supporting the DPLA, reporting on several of the “hubs” that will make up the first part of the content available, the inaugural board, and the technical and platform components of the DPLA. The complete day, including tweets and photos was captured in a live blog. While much of interest took place that day, I want to focus on the content and the technical implementation as described during the day.
Content: What will be in the DPLA?
Emily Gore started in September of this year as the Director of Content, and has been working since then to get the plans in motion for the initial content in the DPLA. She has been working with seven exisiting state or regional digital libraries as so-called “Service Hubs” and “Content Hubs” to take the steps to begin aggregating metadata that will be harvested for the DPLA and get people to the content. The April 2013 launch will feature exhibits showcasing some of this content–topics include civil rights, prohibition, Native Americans, and a joint presentation with Europeana about immigration.
The idea of these “hubs” is that there are already many large digital libraries with material, staff, and expertise available–as Gore put it, we all have our metadata geeks already who love massaging metadata to make it work together. Dan Cohen (director of the Roy Rosenzweig Center for History and New Media at George Mason University) gave the analogy in his blog of the local institutions having ponds of content, which then are fed into the lake of the service hubs, and then finally into the ocean of the DPLA. The service hubs will offer a full menu of standardized digital services to local institutions, including digitization, metadata consultation, data aggregation, storage services, community outreach, and exhibit building. These collaborations are crucial for several reasons. First, they mean that great content that is already available will finally be widely accessible to the country at large–it’s on the web, but often not findable or portable. Regional content hubs will be able to work with their regions more effectively than any central DPLA staff, which simply will not have the staff to deal with one-to-one relationships with all the potential institutions who have content. The pilot service hubs are Mountain West, Massachusetts, Digital Library of Georgia, Kentucky, Minnesota, Oregon, and South Carolina. The digital hubs project has a two year timeline and $1 million in funding, but for next April they will prepare metadata and content previews for harvest, harvest existing metadata to make it available for launch, and develop exhibitions. After that, the project will move on to new digitization and metadata, aggregation, new services, new partners, and targeted community engagement.
Representatives from two of the service hubs spoke about the projects and collections, which was the best view into what types of content we can expect to see next April. Mary Molinaro from Kentucky gave a presentation called “Kentucky Digital Library: More than just tobacco, bourbon, and horse racing.” She described their earliest digitization efforts as “very boutique–every pixel was perfect”, but it wasn’t cost effective or scalable. They then moved on to a system of mass digitization through automating everything they could and tweaking workflows for volume. Their developers met with developers from Florida and ended up using DAITSS and Blacklight to manage the repository. They are now at the point where they were able to scan 300,000 pages in the last year, and are reaching out to other libraries and archives around the state to offer them “the on-ramp to the DPLA”. She also highlighted what they are doing with oral history search and transcription with the Oral History Metadata Synchronizer and showed some historical newspapers.
Jim Butler from the Minnesota Digital Library spoke about the content in that collection from an educational and outreach point of view. They do a lot of outreach to to local historical societies and libraries and other cultural organizations to find out what collections they have and digitize them, which is the model that all the service hubs will follow. One of the important projects that he highlighted was an effort to create curricular guides to facilitate educator use of the material–the example he showed was A Foot in Two Worlds: Indian Boarding Schools in Minnesota, which has modules to be used in K-12 education. He showed many other examples of material that would be available through the DPLA, including Native American history and cultural materials and images of small town life in 19th and 20th century Minnesota. Their next steps are to work on state/region wide digital library metadata aggregation, major new digitization efforts, and community-sourced digital documentation, particularly in terms of Somali and Hmong communities self-documentation.
Followup comments during the question portion of these presentations emphasized that the goal of having big pockets of content is to work with those smaller pockets of content. This is a pilot business model test case to see how aggregating all these types of content together actually works. It is important to remember that for now, the DPLA is not ingesting any content, only metadata. All the content will remain in the repositories at each content hubs.
An additional component is that all the metadata in the DPLA will be licensed with a CC0 (public domain) license only. This will set the tone that the DPLA is for sharing and reusing metadata and content. It is owned by everyone. This generated some discussion over lunch and via Twitter about what that actually would mean for libraries and if it would cause tension to release material under a public domain license that for-profit entities could repackage and sell back to libraries and schools. Most people that I spoke to felt this was a risk worth taking. Of course, future content in the DPLA will be there under whatever copyright or license terms the rightsholder allows. Presumably most if not all of it will be material in the public domain, but it was suggested, for instance, that authors could bequeath their copyrights to the DPLA or set up a public domain license through something like unglue.it. Either way, libraries and educators should share all the materials they create around DPLA content, and by doing so will mean less duplicate effort.
Technology: How will the DPLA work?
Jeff Licht, a member of the technical development advisory board, spoke about the technical side of the DPLA. The architecture for the system (PDF overview) will have at its core a metadata repository aggregated from various sources described above. An ingester will bring in the metadata in usable form from the service hubs that will have already cleaned up the data, and then an API will expose the content and allow access to front ends or apps. There will also be functions to export the metadata for analysis that cannot easily be done through the API. The metadata schema (PDF) types that they collect will be item, collection, contributor, event.
One of the important points that raised a lot of discussion was that while they have contracted with iFactory to have a front end available by April, this front end doesn’t have more priority or access to the API than something developed by someone else. In fact, while someone could go to dp.la to access content, the planners right now see the DPLA “brand” as sublimated to other points of access such as local public libraries or apps using the content. Again, the CC0 license makes this possible.
The initial front end prototype is due for December, and the new API is due in early November for the Appfest (see below for details). There will be an iterative process between the API and front end between December and March before the April launch, with of course lots of other technical details to sort out. One of the things they need to work on is a good method for sharing contributed modules and code, which hopefully will be done in the next few weeks.
Anyone can participate in this process. You can follow the Dev Portal on the DPLA wiki and the Technical Aspects workstream to participate in decision making. Attending the Appfest hackathon at the Chattanooga Public Library on November 8 and 9 will be a great way to spend time with a group creating an application that will use the metadata available from the hubs (the new API will be completed before the Appfest). This is the time to ask questions and make sure that nothing is being overlooked.
Conclusion: Looking ahead to April 2013
John Palfrey closed the day with reminding everyone that April is just the start, and not to be disappointed with what they see then. If April delivers everything promised during the DPLA Midwest meeting, then it will be a remarkable achievement–but as Doran Weber from the Sloan Foundation pointed out, the DPLA has so far met every one of its milestones on time and on budget.
I found the meeting to be inspirational about the future for libraries to cross boundaries and build exciting new collections. I still have many unanswered questions, but as everyone throughout the day understands, this will be a platform on which we can build and imagine.