I have a confession: XSLT (eXtensible Stylesheet Language Transformations) is one of those library technologies I’ve dreaded learning. While it’s come up several times in my career, I’ve always managed to avoid it. I’m not normally like this—I truly enjoy learning new skills, especially programming languages or coding tools, so much so that I’ll find myself diving into tutorials or books right after returning home from work.
Why is XSLT so abhorrent? I have more opinions about programming languages than I do expertise. Here are some opinions: programming languages should be readable. If they have shorthands, they should be elegant & intuitive. They shouldn’t need much boilerplate. And they should be recognizable to a degree; for better or worse, many programming languages share one of a very small number of lineages which means that recognizing fundamental constructs like functions & arrays isn’t too difficult if you can identify the family a language belongs to.
XSLT’s family is XML. An XSLT script is, in fact, valid XML. And XML is not a programming language, it’s a markup language that’s used for virtually everything. While I understand XML’s ubiquity, I’m not a fan of it in general & I think other serializations make more sense in many situations. As an example, take a list of elements in XML:
<array> <element>One</element> <element>Two</element> <element>Three</element> </array>
versus my favored format, JSON:
"array": [ "one", "two", "three" ]
versus the even cleaner YAML:
Background: IR OAI OMG
Alright now that you’ve read entirely too much about What Eric Thinks of XSLT, why was he forced to learn it? And why did he start speaking in the third person all of the sudden? He should switch back to first person.
My institution wants to start exposing our digital collections more. While the human-facing web presence of our institutional repository is growing, we also need to start publishing our metadata in a machine-readable format. This will allow our collections to be consumed by large aggregators, specifically Calisphere, Worldcat, & DPLA.1 Luckily, libraries already have a well-established standard to turn to for these needs, OAI-PMH. OAI-PMH lets us expose our repository metadata in XML in a way that allows a harvesting application to periodically fetch batches of records, adding new ones & updating ones that have changed.
Right away, there are challenges to exposing our EQUELLA repository’s metadata. We use MODS in our repository, while OAI expects you to use Dublin Core. Luckily, these are two common formats & there’s a lot of information on how to map information between them.2 Unfortunately, our MODS schema is heavily modified. Decisions were made to add or alter elements based on local needs. What’s worse, in order to make certain user-facing fields easier to use, our repository software makes us insert wrapper elements into our metadata schema. All of this combines to make our MODS-to-DC mapping utterly unique & more complicated than usual.
When I went to inspect our repository’s OAI implementation out of the box, I was greeted by records like this one:
<record> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>Expectations</dc:title> <dc:description>Explored the experience of pregnant and parenting teenagers via an installation and a symposium, "Cribs, Classrooms, and Communities: The Teen Pregnancy Controversy."</dc:description> </oai_dc:dc> </metadata> </record>
The default settings were showing only a Dublin Core title & description for each item, which was certainly not bad for no configuration at all, but not desirable. We heavily invested in designing our repository’s upload form & cataloging. We want that work to shine through in the metadata we provide to external services. Now, after applying some very basic XSLT which I’ll cover in the remainder of this post, our OAI endpoint looks quite a bit better. There are numerous fields represented for each record, though our original records are still a bit richer in information.
A Metadata Transformation Language
Let’s talk first about what XSLT is before we talk about how to use it. XSLT is unique in that it’s specifically designed to transform XML documents. This doesn’t necessarily mean it was designed for mapping from one metadata schema to another, as XML is used for more than just metadata & XSLT can do more than just shuffle around values, but it does mean that the language is uniquely suited to that task. XSLT allows us to change certain elements in the original document, alter text, & add or drop pieces of information.
In this post, we’ll specifically look at converting the Library of Congress’ MODS schema to Dublin Core. LOC has provided a handy map between the two which illustrates the complexity of the task. A few things that we need to address:
- All data is going to need a new field, there isn’t really a way to simply “leave” an element in its place
- MODS is hierarchical, meaning some elements have child elements, while Dublin Core is flat with only a top level bearing no children
- Sometimes multiple MODS elements will collapse into a single Dublin Core field e.g.
classificationall file under DC’s
- Inversely, a MODS
name/namePartelement might map to either the DC
Contributorfields depending upon the role of the person being referred to (captured in the
role/roleTermchild of the
- Both schemas have their own slightly differing vocabulary of resource types, stored in MODS’
Type, so what one schema considers a “sound recording-nonmusical” the other considers merely “Sound”, for instance
I see mapping between metadata schemas as a subset of data wrangling, & data wrangling is one of the most common aspects of my Systems Librarian position: I take CSV reports from our student data system & map them into a weird MARC-like format for the Millennium ILS to ingest, I take course lists & turn them into a controlled vocabulary of sorts in our repository, & of course I convert our repository’s MODS into OAI DC. All of these procedures are duck-taped together with custom scripts filled with comments detailing bizarre data behaviors. XSLT is one of the few languages designed for this type of work. While there are software packages like OpenRefine or Stanford’s Data Wrangler, sometimes the power & flexibility of a programming language is preferable. It’s disappointing to me that the only prominent choice is also XML-based.
To try out the examples below, you’ll need an XML document to use as input (preferably MODS, as that’s what the examples target) & an XSLT processor. On Mac OS X, the built-in program
xsltproc does the trick. You can run a document through a transformation using the following syntax on the command line:
> xsltproc --output output.xml mapping.xsl input.xml
For Linux, you can install
xsltproc as part of the libxml package. Windows users can install the Saxon XSLT processor. Most web browsers support XSLT processing, too. If you add a line like
<?xml-stylesheet type='text/xsl' href='name-of-stylesheet.xsl'?> up at the top of an XML file, the browser will automatically run the XML through your stylesheet & present you with the results.
Pretty much all XSLT starts with some boilerplate that looks like this:
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="@*|node()"> <oai_dc:dc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> ... </oai_dc> </xsl:template> </xsl:stylesheet>
You start an XML document, open up an
xsl:stylesheet element with a certain namespace, & define a template inside which you’re going to apply to the incoming document. The
match attribute can differ depending on your objective—here we’re selecting all data in the input document, attributes & nodes—but that’s the basic setup. You then declare the root element of your output document inside the template, which here is the
<oai_dc:dc> tag. What matters is what goes on inside the template, represented in the example with an ellipsis, since that gives us the ability to map elements to new locations.
Confession: I haven’t taken the time to fully learn the usage of the
xsl:template element. You can use multiple of them within an
xsl:stylesheet, they can target different elements in the origin document with their
match attributes, & you can call them later with
xsl:apply-template(s). This lets you modularize your stylesheet into several smaller, focused templates. But I won’t discuss them further, in the hopes that showing other features provides enough detail to communicate the substance of XSLT.
Let’s look at another example, which simply copies the MODS’
identifier element to Dublin Core’s
dc:identifier, skipping everything else in the input document. Going forward, I will leave off the XML prolog
<?xml …?> & all of the wrapping
oai_dc:dc elements for brevity’s sake.
<dc:identifier> <xsl:value-of select="mods/identifier" /> </dc:identifier>
For a MODS document with identifier “10881088”, the full output of this stylesheet is:
<?xml version="1.0"?> <oai_dc:dc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:identifier>10881088</dc:identifier> </oai_dc:dc>
We can see that the root element of the document has been transformed into
oai_dc:dc & our identifier is present inside a
dc:identifier element. So two things from our template above happened: first of all, anything that’s not an
xsl: prefixed element is mostly output on the other end in exactly the same form. This goes for text as well as markup language tags, such as our
oai_dc:dc opening & closing tags which are recreated exactly. Inside
dc:identifier, something else happens: the
xsl:value-of element selects an element from the origin document & replaces itself with that value. The key piece is the
select attribute, which accepts an XPath query it uses on the input.
XPath is another entire language you need to know to use XSLT to transform XML documents. Luckily, if you’ve worked much with XML at all, you are probably familiar with its basics. XPath lets us query a document by providing hierarchical paths such that “mods/identifier” matches an “identifier” element which is the child of a “mods” element. The XPath queries in this post won’t be more sophisticated than that.
Let’s do the same thing but map several elements to different places. The example below is starting to resemble a more fully-fledged stylesheet which could actually be of use.
<dc:identifier> <xsl:value-of select="mods/identifier" /> </dc:identifier> <dc:title> <xsl:value-of select="mods/titleInfo/title" /> </dc:title> <dc:description> <xsl:value-of select="mods/abstract" /> </dc:description>
We’ve now mapped the identifier, abstract, & title from MODS into their appropriate Dublin Core locations. Yay! While this may seem straightforward, it is worth noting we took the MODS’ titleInfo/title element from out of its parent titleInfo element & placed it in the top-level of Dublin Core’s flat schema, handling one of the mapping complexities we’d noted earlier.
However, what happens when we pass a document lacking a title or abstract through our stylesheet? Say we start with input.xml:
<mods> <identifier>10881088</identifier> </mods>
On the command line, we run:
> xsltproc --output output.xml transform.xsl input.xml
We get the resulting output.xml:
<oai:dc> <dc:identifier>10881088</dc:identifier> <dc:title></dc:title> <dc:description></dc:description> </oai:dc>
xsl:value-of element didn’t find any content & thus returned an empty string, we were still telling our transformation to produce
dc:description fields. One way to work around this is using conditional if statements to first check if a field has any text in it before returning its corresponding element in the output schema. If statements in general are extremely useful during transformations, so let’s take a look at some.
<dc:identifier> <xsl:value-of select="mods/identifier" /> </dc:identifier> <xsl:if test="mods/titleInfo/title != ''"> <dc:title> <xsl:value-of select="mods/titleInfo/title" /> </dc:title> </xsl:if>
This transformation does our standard mods/identifier->dc:identifier mapping (hopefully every document has an identifier…), but it uses the
xsl:if element to first test if mods/titleInfo/title has any content before outputting anything. The
xsl:if syntax is
<xsl:if test="condition"> where the “condition” can be one many standard comparisons that returns a “true” or “false” value: equal to “=”, not equal to “!=”, greater than “\>”, & less than “\<“. Note that, since we’re working with XML, we cannot use the greater or less than signs so we must use their XML character entity forms.
Hyper Advanced (not really)
We can get a long way with just
xsl:if, & our innate verve, but let’s learn a few more useful XSLT constructs. Remember earlier when we noticed that, depending on the person’s role, the “namePart” value in MODS might be mapped to one of two DC elements, Creator or Contributor? We can’t actually handle that situation with what we know, because we’ve only mapped singular & not repeating elements. Our “select” & “test” attributes will only select the first matching element in the origin MODS document. We need some kind of loop that lets us iterate over repeated elements while also testing their values.
Below, we use
xsl:for-each to loop over a selected element, & then we apply to tests to each element to determine if it’s referring to an author or an editor.
<xsl:for-each select="mods/name"> <xsl:if test="contains(role/roleTerm, 'author')"> <dc:creator> <xsl:value-of select="namePart" /> </dc:creator> </xsl:if> <xsl:if test="contains(role/roleTerm, 'editor')"> <dc:contributor> <xsl:value-of select="namePart" /> </dc:contributor> </xsl:if> </xsl:for-each>
xsl:for-each in combination with
xsl:if accomplished one of the more onerous data mapping tasks we’re faced with in a rather elegant manner. Inside our
test attributes we also see something new: we’re using a
contains() function. XSLT has many functions which can be used inside certain attributes. The
contains() syntax is straightforward: it accepts two parameters, an XPath & a string of text, returning “true” if the text is found in the value of the XPath or false otherwise.
Alright, now let’s tackle something truly intimidating: how could we possibly handle mapping the two “resource type” vocabularies to one another? We can see now that a very long series of
xsl:if statements inside a
xsl:for-each loop should get the job done. But there’s a slightly nicer method available in
<xsl:for-each select="mods/typeOfResource"> <xsl:variable name="text" select="text()" /> <xsl:choose> <xsl:when test="$text = 'cartographic material'"> <dc:type>Image</dc:type> </xsl:when> <xsl:when test="$text = 'software'"> <dc:type>Software</dc:type> </xsl:when> <!-- covers 3 values: sound recording, sound recording-musical, sound recording-nonmusical --> <xsl:when test="starts-with($text, 'sound recording')"> <dc:type>Sound</dc:type> </xsl:when> <xsl:when test="$text = 'three dimensional object'"> <dc:type>Image</dc:type> </xsl:when> </xsl:choose> </xsl:for-each>
There are a few little tricks in here, but the general frame should be apparent: we loop over all the
typeOfResource elements and, when they match one of the tests in our
xsl:when element, the text or XSLT commands inside the when block are produced. This lets us set up a nice crosswalk between the MODS & DC type vocabularies. It’s also a bit faster than a series of “if” statements, since the
xsl:choose block will exit as soon as a test returns true, while all if statements would execute even if only the very first one was necessary. Finally, it’s also possible to provide an
xsl:otherwise block at the end of a series of
xsl:when statements which is used as a fallback: if none of the “when” tests were true, the fallback is output. If there were overarching “base type” or “unknown type” values in Dublin Core then this would be useful.
The two other new pieces we saw in the above XSLT snippet: we used
xsl:variable to store the text value of the
typeOfResource node, obtained with the
text() function, & then referred to it later by attaching a dollar sign sigil to the variable’s name, much like BASH, Perl, & PHP do with their variables. Looking up the value of the node only once & then using the stand-in variable is another minor speed optimization. We also used the XSLT function
starts-with to test three MODS type values at once. How did I know that function exists? I Googled it, like a professional software developer. The W3Schools reference on functions is a thorough overview. In general, if you find yourself thinking “I bet there’s a handy shortcut function that would make this less painful…” then you should search for one.
After much stumbling, followed by trial & error, followed by more stumbling, our OAI endpoint is looking much better thanks to my XSLT stylesheet. We’re publishing format, creator, contributor, type, & rights information in Dublin Core. I’m certain that my examples here, & the code in my final script, don’t follow XSLT best practices. I’m quite shaky on some of the fundamental mechanics of XSLT, like
xsl:template 3. Nonetheless, I hope this post gives a broad overview of the technology & its application. Go forth & transform!
- Incidentally, I’m also trying to get our collections better indexed by search engines like Google & Google Scholar using Schema.org & other metadata embedded in HTML. That’s a very different beast though & not strongly related to the XSLT work that I am discussing here. ↩
- I referred to the Library of Congress’ guide on the MODS site when constructing our crosswalk. ↩
- I suspect my stylesheets would need far fewer
xsl:ifstatements if I used templates effectively & not just as a boilerplate wrapper element. ↩
Updated 1/20/14: I made a couple changes based on feedback from Scott Young, who also suggested readers check out Linked Data Publishing for Libraries, Archives, and Museums: What Is the Next Step? and The like economy: Social buttons and the data-intensive web for further information.
Now that I’ve driven our blog’s SEO through the roof, it’s time to get nerdy. Social media doesn’t have to be all about memes and Katy Perry. Nope, it can be about metadata, too. Isn’t that wonderful?
In the dark olden days, when a piece of your web content was distributed on a social network, it was impossible to control how it would be presented. At its most simplistic, the destination network would take your URL, turn it into a hyperlink, and be done. There was no preview, no indication of what lay behind that HTTP request. If you used a mature social network, like Facebook, the service might scan through the URL, find some sample text, and offer that as a preview. If you were lucky, it’d find an image and offer that to your potential audience, too. Armed with a slight textual or visual preview, presumably users would be more likely to explore your content.
Nowadays social networks have mostly solved this problem. They try to make external content as appealing as possible. They want you to click on shared links, and thus they want shared links to be more than blue, underlined letters. Images and movies are springing up where there used to be only a stock-ticker timeline of text. This post will explore a couple popular methods of enhancing your web content’s appearance on social networks using embedded metadata.
Anyone who has written a little HTML has probably already set their web page up to be appealing when seen through a third-party service. In the administrative <head> of your websites, where various metadata and external resources like stylesheets lie, there’s probably a line like this:
<meta type="description" content="Innumerable cats followed by robots followed by more cats.">
That’s a self-closing <meta> tag, which the W3C says is for “various kinds of metadata that cannot be expressed using the title, base, link, style, and script elements.” In practice, this means that how <meta> tags are used is up to the whims of applications that consume HTML. Web browsers, for instance, do not display the “description” which we’ve added.
So what’s the point of <meta type=”description”>? Well, search engines use them. Search engine companies like Google have automated programs called bots that continuously crawl the web, downloading the contents of each web page and processing them. Google takes your description seriously and often uses it as the teaser text that appears alongside a search result. So this invisible, pretty much useless HTML element is now defining how our site appears when viewed through a massively popular third party. Now, if you don’t add a description, Google still shows some preview text. But it will guess; it will crawl the text of your page and display what it thinks are the most meaningful snippets. The result is often dismal: text that’s visually hidden using CSS might be chosen, meaningless strings of navigation links might appear, even this blog shows the lede from the latest post rather than a general description of what we’re about.
Here’s another made-up example, based on a search result that has no <meta type=”description”>:
Library ». Special Collections ». Reference ». Mutton Soup: More Adventures of Johnny Mutton by James Proimos — Why do people lie? Do gender and personality …
Pieces of navigation text, characters which are probably visual aids, and titles from a book carousel all appear as disconnected text.
Without further ado, let’s see how we might further enhance a web page uses Twitter’s Cards, a special schema for <meta> tags. Here’s markup for this blog post:
<meta name="twitter:card" content="summary" > <meta name="twitter:creator" content="@phette23" > <meta name="twitter:site" content="@ALA_ACRL" > <meta name="twitter:title" content="Gearing Up Your Sites for Sharing with Twitter & Facebook Meta Tags" > <meta name="twitter:description" content="Add Twitter Cards and Facebook's Open Graph to your website's meta tags for fun and profit." > <meta name="twitter:image" content="http://acrl.ala.org/techconnect/wp-content/uploads/2014/01/msu-twitter-card.png" >
The overall idea should be apparent: the name attribute defines which metadata field you’re using while the content attribute fills in the value for that field. The “twitter:” prefix is a kind of namespacing, which may be familiar if you’ve worked with XML. It basically serves to say “hey, use the Twitter schema to interpret this field’s meaning.” This is useful because, as we’ve already seen, some metadata fields might collide: if Twitter just used “description” as a field name, another different service might do the same, and both applications would get confused when parsing a site catering to both. Namespacing solves this problem.
Let’s walk through the various fields we’ve used:
twitter:card is the type of card, of which there are eight, each with its own rationale and set of fields. The most pertinent ones center around displaying an image, gallery, app, or video alongside your content but you can read their documentation to discover all the types.
twitter:creator is the Twitter account of the creator of the content. This could be very useful: just because someone shares this post, it doesn’t mean they know my Twitter handle or have to properly attribute me. Now, when someone clicks a link to content with @phette23 as the “twitter:creator,” the expanded view automatically shows my username whether it’s included in the tweet text or not. That saves some precious characters, too!
twitter:site is optional and can be a larger organization. Twitter’s own example makes more sense here: when an NYT journalist writes a story, the story can be associated both with the author’s account as well as the publication’s.
twitter:title is the title of the content.
twitter:description is a brief description.
twitter:image is the image that appears alongside the shared content in an expanded view.
The big payoff is that, when someone selects a tweet in their timeline, they’ll suddenly get a much richer, expanded view with affiliated accounts, context, and an image. Here’s a real, live example courtesy of Scott Young and Montana State University:
Can you spot where the “twitter:site,” “twitter:description,” and “twitter:image” appear? What was just a bland URL has exposed its underlying resource in a much fuller manner.
Finally, you can run your page through a validator to ensure that you marked up the tags properly.
Let’s move on to Facebook, the web’s aging but still dominant social network. Facebook does a sensational job of picking images out of a shared URL, but problems can still occur. The day before I drafted this post, I happened to spot this issue:
I’m not sure what happened here, but it appears as if Facebook picked a blown up, pixelated version of the Google logo for the link, which has nothing to do with Google. If I’m scanning through my feed, I might expect to see smiling students for this news item, but an indistinct, jumbled mess does little to attract my attention and I move right along.
Let’s fix that. Credit where it’s due, David Walsh’s post on Facebook meta tags is a great starting place and where I learned about them. Here’s an example:
<meta property="og:image" content="http://24.media.tumblr.com/0fc9023daa303558d036ecd63fd2c24e/tumblr_mjedslIPPH1qbyxr0o1_500.gif" > <meta property="og:title" content="Gearing Up Your Sites for Sharing with Twitter & Facebook Meta Tags"> <meta property="og:url" content="http://acrl.ala.org/techconnect/"> <meta property="og:site_name" content="ACRL Tech Connect"> <meta property="og:type" content="article"> <meta property="og:description" content="How to use meta tags to control how web content appears when shared on social networks" >
Most of this is pretty explanatory. Facebook calls this the “Open Graph Protocol,” thus all the “og:” prefixes which are namespacing these fields just like “twitter:” did above. The markup is straightforward: the property attribute defines what metadata field you’re talking about while the content attribute fills in the value for that field. Here’s a quick listing of the Open Graph fields and what you need to know:
og:image is an associated image. This is perhaps the most important field depending on your goals, since it gives you a chance to put your most eye-catching image alongside your content.
og:title is the title of the work at hand; notice how that means this particular blog post, not something larger (like the blog itself or ACRL)
og:description is a one or two sentence description, similar to a typical <meta type=”description”>.
og:url is the canonical URL of the item you’re sharing. This may not make much sense, but much web content is accessible at multiple URLs these days. Consider this post: if you’re reading shortly after publication, it might be on the Tech Connect home page. A few weeks later, it might be at http://acrl.ala.org/techconnect/?paged=2. But it will always be at http://acrl.ala.org/techconnect/?p=4062. That’s the canonical URL and the one we want associated with it on Facebook.
og:site_name is the larger website upon which a piece of content lives, so that’s the Tech Connect blog in our example. How far to go with this is subjective: is ACRL actually the “site” here? It’s mostly up to how you want content to appear on Facebook, not according to some indisputable web ontology.
og:type let’s you categorize the type of content. The Open Graph has a list of types, which is limited to music, video, article, book, profile, and website at the time of this writing, but it also contains logic for defining your own types. The Open Graph standard notes that “[w]hen the community agrees on the schema for a type, it is added to the list of global types.” David Walsh’s example uses a type of “blog” which, as far as I can tell, is not one of the standard types.
There’s a lot more to the Open Graph Procotol on their official website. We’ve covered it before, too: “Real World Semantic Web? Facebook’s Open Graph Protocol” goes beyond <meta> tags and talks more about the concept of linked data. Facebook’s own Open Graph documentation goes even further, which might be particularly useful if you want to associate your web content with a Facebook app or page. In terms of additional fields, the og:video and og:audio properties associate further media with your web content. For the most part though, simply using og:image and a few other metadata items gives you far greater control over your content’s appearance without too much added complexity. Furthermore, Facebook prefers content with Open Graph metadata; since Facebook has to guess how to format sites that lack OG information, those get downranked in the newsfeed.
While the Open Graph is publicized as a generic way for an object to be represented in any social graph, in practice it’s just for Facebook. However, since the protocol is public and widely used, I wouldn’t be surprised to see more startup social networks piggyback off of Open Graph rather than roll their own meta tags like Twitter did. In fact, I found a StackOverflow answer which suggests that Pinterest and Google+ do exactly this, opportunistically using particular pieces of Open Graph metadata (specifically og:image). So you may get pretty good bang for your buck with the Open Graph as opposed to Twitter’s more idiomatic “cards” which seem to be far too focused on Twitter and particular types of content to be generally useful for other applications.
If you read the StackOverflow link above, you’ll note that Google+ actually prefers Schema.org metadata over og:image, but will fall back to og:image if that’s not available. Google’s “Rich Snippets” technology is very similar to these social networks. Essentially, you tag certain pieces of your page with metadata so that Google can optimize how you’re displayed in search results, precisely paralleling the way social services optimize your shared content’s presentation. Schema.org, microformats, and linked data at large are a huge topic worth several posts of their own, so I won’t go into them too much.
As I note below in the Value Proposal section, it’s worth considering where your audience is and which approach will yield the most return on investment. A simple litmus test: where are your website’s referrals coming from? Organic Google search? Then Schema.org and rich snippets makes sense. Facebook or Pinterest? Hello, Open Graph. Twitter? Time to card it up. In particular, you may detect patterns wherein certain types of content warrant customized approaches. It’s sensible that rich media content like images and videos would be shared heavily on social networks, but perhaps not inspire too much attention from Google’s text-based search engine. On the other hand, textual content may be precisely the opposite: there’s no flashy way to preview it Twitter or Facebook, so don’t bother with enhancing that sharing vector, but consider how something like Schema.org could increase its exposure to search engines and other linked data applications.
One thing I feel I should note: <meta> tags, because they’re invisible and in the <head>, are pretty easy to neglect. Most content management systems don’t make it easy to alter <meta> tags, presumably because they think they know better than you or that it’s too niche for the majority content authors. Remember when I mentioned that this blog doesn’t have a <meta type=”description”>? That’s because WordPress doesn’t make it easy to edit <meta> tags, particularly on a page-by-page basis. Neither does Drupal. Instead, these frameworks tend to configure a few helpful <meta> tags for you but don’t infer a description and certainly won’t fill in Open Graph details for you. Luckily, the advantage of these CMSs is their extensibility, and there are Open Graph and Twitter Cards extensions for WordPress. Hat-tip to Michael Schofield, who informed me that the WP SEO plugin does both. Drupal has a Metatag module which makes editing said tags easier, but doesn’t have anything specifically catering to Twitter or Facebook that I’ve found. One could edit Drupal’s node templates, however, inserting an og:title field on every page with PHP code like
<meta property="og:title" content="<?php print $node->title; ?>">.
Secondly, <meta> tags are a rather flawed solution because each web page has only one <head> and thus one set of <meta>data associated with it. Consider this blog again: when you visit the home page, the last few posts are presented. Each post has its own URL, images, topic, even the authors are distinct. Yet we can’t put a slew of <meta> tags embedded in the body of each post; we only have one <head> to work with, so at best we could place a series of generic information from the blog at large. This is maybe a bit of a false problem; users sharing the blog home page probably don’t want social sites to tease content from any given post. But it seems problematic as the web becomes more and more modular. There are so many interfaces that present not one self-contained piece of content but collections; Meghan’s recent post on a tiled, Pinterest-like digital library display comes to mind. The blunt simplicity of <meta> tags is showing here. This is where more robust linked data technologies come in, since they don’t necessarily rely on a single HTML element but can use attributes of tags (e.g. Schema.org uses the presence of an itemScope attribute, on any tag, to determine where an object begins and ends in markup) instead.
For my library, it just so happens there isn’t a lot of value in Twitter Cards. Twitter lets you search for URLs in tweets just like any other text; I put in my library’s fully qualified domain name and three tweets came up, two of which were from yours truly and the third was a link pointing to a syllabus PDF. People just aren’t sharing our sites very much and that’s rather predictable. We’re a small college without a huge social media presence and don’t have a unique digital collection. Picking which image appears when the library home page is shared that one time is an inefficient use of time.
How much you get out of this social metadata depends on how much your library’s web properties are shared on the web. For some, I imagine that’s a great deal and controlling how content appears could be very valuable. Do you share lots of unique digital collections through social media, or link them on pertinent Wikipedia pages? Are you actively engaged in a social media archiving or content creation project, like NCSU’s #HuntLibrary project on Instagram? Then investing time in optimizing how social networks understand your content is logical.
All too often, libraries are forced to start from scratch each time we roll out a new technology. To avoid this problem librarians should consider middleware as their technology prototyping pipeline. Middleware is a software layer that delivers data in standard XML or JSON formats from a variety of sources. The sources could be distributed in multiple third party databases, residing in HTML, or available as other XML encoded data living on another internet domain. Middleware layers help to integrate these data into more robust and flexible alternative data sources to turnkey products from vendors. A number of turnkey services we rely on do not provide APIs, access to the underlying data or the relevant data librarians need to offer compelling, twenty-first century services our students and faculties require. Middleware design is the process of architecting your own library API.
The right middleware design approach will empower libraries to move data between products and services and deliver it to users where they need it. In the example below, the library does not have access to an API of room reservation data, so a library’s data is held captive. With strategic application of middleware design, the library is free to take the data and reformat it in mobile or tablet-friendly format, greatly improving access and allowing new services for our users.
Problem: This stand alone website allows a student to look up and book available group study rooms in the library. It is a self-service product. But what if we wanted to advertise and show only the currently available study rooms into our website as a data feed? What if we needed *only* currently bookable room data to port into a mobile app? As the website stands we cannot readily access this data other than on a desktop computer.
Library and Information Science foundations can be utilized to extend library data into new interfaces and platforms. Half of digital librarianship is really just the description of data using metadata schemes (usually encoded in XML, but more recently, JSON as well.) Before I can really talk about establishing a prototyping pipeline I need to unpack XML and then I’m going to talk about how building on these foundations we can create the XML feeds we need to extend library data anywhere and everywhere.
A short primer on XML:
Many others have sung the virtues of XML. I personally like a good clean XML feed because to me, it means extensibility. XML supports all kinds of system efficency and data independence. Those are the features I like. Those who develop metadata standards and work with digital library development cite the following when promulgating XML virtues from McDonough, 2008 :
- XML helps ensure platform (and perhaps more critically vendor) independence;
- XML provides the multilingual character support critical to the handling of library materials;
- XML’s extensibility and modularity allow libraries to customize its application within their own operating environments;
- XML helps minimize software development costs by allowing libraries to leverage existing, open source development tools;
- XML, through virtue of being an open standard which enables descriptive markup, may assist in the long-term preservation of electronic materials; and perhaps most importantly
- XML provides a technological basis for interoperability of both content and metadata across library systems.
Now that you’re considering drinking the XML kool-aide, think about what might be possible if you could pull any library data you wanted (in the form of XML) into any other system — like an iPhone or iPad app. To do that you’d need a RESTful API. RESTful APIs are vitally important for extending library data across systems. You can create your own library API with a middleware layer.
One layer I’ve learned in the past couple months is a Tomcat/Jersey stack that allows you to pull in data from multiple sources, and then serialize that data to XML. A recent XML feed that was developed this way is an “available now” XML feed of group rooms in the library that are available in the next hour and can be booked immediately. For this example I pull in an additional Java package to the Jersey program — an HTML parser, Jsoup.
Jersey is implemented on a Tomcat server. Tomcat can run a number of Java based applications but it is essentially a webserver that we are running Jersey from. It also bears noting that Apache Tomcat is not the same as the Apache webserver that many of us know and use for serving HTML pages.
In order to serialize a Java data object to XML Jersey uses a standard MVC architecture, where our data model is the model, the new library web page/mobile app that we display is the view and the Jersey resource file is the controller. In essence the Jersery output is a set of three Java programs that comprise the MVC. The data that Jersey is pulling in from the website is HTML. Since the HTML needs to be parsed and then pulled into a data object, we use another Java library called jsoup. Along with the tutelage of the research programmers in the library, I followed this tutorial on creating a RESTful API with Jersey, that explains the programming annotations needed for creating the web-service, which are rather simple to implement once you have your developer environment set up — for this project I worked entirely in Eclipse since it can also simulate running a Tomcat server on your local machine.
An example of the feed is below (abbreviated):
<models> <model> <date>1/27/2013</date> <endTime>5:00 PM</endTime> <roomName>Collaboration Room 01 - Undergraduate Library</roomName> <startTime>4:00 PM</startTime> </model> <model> <date>1/27/2013</date> <endTime>5:00 PM</endTime> <roomName>Collaboration Room 02 - Undergraduate Library</roomName> <startTime>4:00 PM</startTime> </model> ...
Once you have that feed modeled and serving XML data from the page you are able to pull that into a new system/ interface. Using Apple’s Dashcode I was able to model a prototype of what the room reservation feed might look like in an iPhone app:
Middleware is the digital library prototyping pipeline, it is a profound tool in the digital library toolkit since it is the foundation for new services, initiatives, and the extension of library data. There are some areas that I breezed over in this post, like how to program the HTML screen scraping necessary to pull data in a webpage into a Java data object. I’ll cover screen scraping with Jersey and Jsoup in my next post. I’ll also submit that having access to the underlying database that powered the room reservation website would have been preferable. We could have imported a Java package that acts as a database connector from Jersey to the XML serializer — but alas, as often happens in the wild we also could not get direct database acces to the underlying data in the page. One final thought on the approach used here: the software are open source Java tools — so they are free to download and utilize for your rapid prototyping library needs.
McDonough, Jerome. “Structural Metadata and the Social Limitation of Interoperability: A Sociotechnical View of XML and Digital Library Standards Development.” Presented at Balisage: The Markup Conference 2008, Montréal, Canada, August 12 – 15, 2008. In Proceedings of Balisage: The Markup Conference 2008. Balisage Series on Markup Technologies, vol. 1 (2008). doi:10.4242/BalisageVol1.McDonough01.
Two years after the initial meeting for the Digital Public Library of America, another major planning and update meeting took place in Chicago at DPLA Midwest. At this meeting the steering committee handed the project over to the inaugural board and everyone who has been working on the project talked about what had happened over the past few years and the ambitious timetable to launch in April 2013.
In August I wrote about the DPLA and had many unanswered questioned. Luckily I had the opportunity to attend the meeting and participate heavily in the backchanel (both virtual and physical). This post is a report of what happened at the general meeting (I was not able to attend the workstream meetings the day before). This is a followup to my last post about the Digital Public Library of America–then I felt like an observer, but the great thing about this project is how easy it is to become a participant.
Looking Back and Ahead
The day started with a welcome from John Palfrey, who reported that through the livestream and mailing lists there were over a thousand active participants in the process. The project seemed two years ago (and still does) seem to him “completely ambitious and almost crazy,” but actually is working out. He emphasized that everything is still “wet clay” and a participatory process, but everything is headed to April 2013 for the public launch with initial version of the service and a fair amount of content being available. We will come back a bit later to exactly what that content is and from what sources it will come.
In this welcome, Palfrey introduced several themes that the day revolved around–that the project is still moldable despite the structure that seems to be there (the “wet clay”), and that it is still completely participatory even though the project will recruit an Executive Director and has a new board. One of the roles of the board will be to ensure that participation remains broad. The credentials of the board and the steering committee are impressive; but they cannot get the project going without a lot of additional support, both financial and otherwise.
The rest of the day was organized to talk about supporting the DPLA, reporting on several of the “hubs” that will make up the first part of the content available, the inaugural board, and the technical and platform components of the DPLA. The complete day, including tweets and photos was captured in a live blog. While much of interest took place that day, I want to focus on the content and the technical implementation as described during the day.
Content: What will be in the DPLA?
Emily Gore started in September of this year as the Director of Content, and has been working since then to get the plans in motion for the initial content in the DPLA. She has been working with seven exisiting state or regional digital libraries as so-called “Service Hubs” and “Content Hubs” to take the steps to begin aggregating metadata that will be harvested for the DPLA and get people to the content. The April 2013 launch will feature exhibits showcasing some of this content–topics include civil rights, prohibition, Native Americans, and a joint presentation with Europeana about immigration.
The idea of these “hubs” is that there are already many large digital libraries with material, staff, and expertise available–as Gore put it, we all have our metadata geeks already who love massaging metadata to make it work together. Dan Cohen (director of the Roy Rosenzweig Center for History and New Media at George Mason University) gave the analogy in his blog of the local institutions having ponds of content, which then are fed into the lake of the service hubs, and then finally into the ocean of the DPLA. The service hubs will offer a full menu of standardized digital services to local institutions, including digitization, metadata consultation, data aggregation, storage services, community outreach, and exhibit building. These collaborations are crucial for several reasons. First, they mean that great content that is already available will finally be widely accessible to the country at large–it’s on the web, but often not findable or portable. Regional content hubs will be able to work with their regions more effectively than any central DPLA staff, which simply will not have the staff to deal with one-to-one relationships with all the potential institutions who have content. The pilot service hubs are Mountain West, Massachusetts, Digital Library of Georgia, Kentucky, Minnesota, Oregon, and South Carolina. The digital hubs project has a two year timeline and $1 million in funding, but for next April they will prepare metadata and content previews for harvest, harvest existing metadata to make it available for launch, and develop exhibitions. After that, the project will move on to new digitization and metadata, aggregation, new services, new partners, and targeted community engagement.
Representatives from two of the service hubs spoke about the projects and collections, which was the best view into what types of content we can expect to see next April. Mary Molinaro from Kentucky gave a presentation called “Kentucky Digital Library: More than just tobacco, bourbon, and horse racing.” She described their earliest digitization efforts as “very boutique–every pixel was perfect”, but it wasn’t cost effective or scalable. They then moved on to a system of mass digitization through automating everything they could and tweaking workflows for volume. Their developers met with developers from Florida and ended up using DAITSS and Blacklight to manage the repository. They are now at the point where they were able to scan 300,000 pages in the last year, and are reaching out to other libraries and archives around the state to offer them “the on-ramp to the DPLA”. She also highlighted what they are doing with oral history search and transcription with the Oral History Metadata Synchronizer and showed some historical newspapers.
Jim Butler from the Minnesota Digital Library spoke about the content in that collection from an educational and outreach point of view. They do a lot of outreach to to local historical societies and libraries and other cultural organizations to find out what collections they have and digitize them, which is the model that all the service hubs will follow. One of the important projects that he highlighted was an effort to create curricular guides to facilitate educator use of the material–the example he showed was A Foot in Two Worlds: Indian Boarding Schools in Minnesota, which has modules to be used in K-12 education. He showed many other examples of material that would be available through the DPLA, including Native American history and cultural materials and images of small town life in 19th and 20th century Minnesota. Their next steps are to work on state/region wide digital library metadata aggregation, major new digitization efforts, and community-sourced digital documentation, particularly in terms of Somali and Hmong communities self-documentation.
Followup comments during the question portion of these presentations emphasized that the goal of having big pockets of content is to work with those smaller pockets of content. This is a pilot business model test case to see how aggregating all these types of content together actually works. It is important to remember that for now, the DPLA is not ingesting any content, only metadata. All the content will remain in the repositories at each content hubs.
An additional component is that all the metadata in the DPLA will be licensed with a CC0 (public domain) license only. This will set the tone that the DPLA is for sharing and reusing metadata and content. It is owned by everyone. This generated some discussion over lunch and via Twitter about what that actually would mean for libraries and if it would cause tension to release material under a public domain license that for-profit entities could repackage and sell back to libraries and schools. Most people that I spoke to felt this was a risk worth taking. Of course, future content in the DPLA will be there under whatever copyright or license terms the rightsholder allows. Presumably most if not all of it will be material in the public domain, but it was suggested, for instance, that authors could bequeath their copyrights to the DPLA or set up a public domain license through something like unglue.it. Either way, libraries and educators should share all the materials they create around DPLA content, and by doing so will mean less duplicate effort.
Technology: How will the DPLA work?
Jeff Licht, a member of the technical development advisory board, spoke about the technical side of the DPLA. The architecture for the system (PDF overview) will have at its core a metadata repository aggregated from various sources described above. An ingester will bring in the metadata in usable form from the service hubs that will have already cleaned up the data, and then an API will expose the content and allow access to front ends or apps. There will also be functions to export the metadata for analysis that cannot easily be done through the API. The metadata schema (PDF) types that they collect will be item, collection, contributor, event.
One of the important points that raised a lot of discussion was that while they have contracted with iFactory to have a front end available by April, this front end doesn’t have more priority or access to the API than something developed by someone else. In fact, while someone could go to dp.la to access content, the planners right now see the DPLA “brand” as sublimated to other points of access such as local public libraries or apps using the content. Again, the CC0 license makes this possible.
The initial front end prototype is due for December, and the new API is due in early November for the Appfest (see below for details). There will be an iterative process between the API and front end between December and March before the April launch, with of course lots of other technical details to sort out. One of the things they need to work on is a good method for sharing contributed modules and code, which hopefully will be done in the next few weeks.
Anyone can participate in this process. You can follow the Dev Portal on the DPLA wiki and the Technical Aspects workstream to participate in decision making. Attending the Appfest hackathon at the Chattanooga Public Library on November 8 and 9 will be a great way to spend time with a group creating an application that will use the metadata available from the hubs (the new API will be completed before the Appfest). This is the time to ask questions and make sure that nothing is being overlooked.
Conclusion: Looking ahead to April 2013
John Palfrey closed the day with reminding everyone that April is just the start, and not to be disappointed with what they see then. If April delivers everything promised during the DPLA Midwest meeting, then it will be a remarkable achievement–but as Doran Weber from the Sloan Foundation pointed out, the DPLA has so far met every one of its milestones on time and on budget.
I found the meeting to be inspirational about the future for libraries to cross boundaries and build exciting new collections. I still have many unanswered questions, but as everyone throughout the day understands, this will be a platform on which we can build and imagine.