A Novice’s Intro to XSLTPosted: June 14, 2016 | Author: Eric Phetteplace | Filed under: coding, digital libraries, metadata | 1 Comment »
I have a confession: XSLT (eXtensible Stylesheet Language Transformations) is one of those library technologies I’ve dreaded learning. While it’s come up several times in my career, I’ve always managed to avoid it. I’m not normally like this—I truly enjoy learning new skills, especially programming languages or coding tools, so much so that I’ll find myself diving into tutorials or books right after returning home from work.
Why is XSLT so abhorrent? I have more opinions about programming languages than I do expertise. Here are some opinions: programming languages should be readable. If they have shorthands, they should be elegant & intuitive. They shouldn’t need much boilerplate. And they should be recognizable to a degree; for better or worse, many programming languages share one of a very small number of lineages which means that recognizing fundamental constructs like functions & arrays isn’t too difficult if you can identify the family a language belongs to.
XSLT’s family is XML. An XSLT script is, in fact, valid XML. And XML is not a programming language, it’s a markup language that’s used for virtually everything. While I understand XML’s ubiquity, I’m not a fan of it in general & I think other serializations make more sense in many situations. As an example, take a list of elements in XML:
<array> <element>One</element> <element>Two</element> <element>Three</element> </array>
versus my favored format, JSON:
"array": [ "one", "two", "three" ]
versus the even cleaner YAML:
Background: IR OAI OMG
Alright now that you’ve read entirely too much about What Eric Thinks of XSLT, why was he forced to learn it? And why did he start speaking in the third person all of the sudden? He should switch back to first person.
My institution wants to start exposing our digital collections more. While the human-facing web presence of our institutional repository is growing, we also need to start publishing our metadata in a machine-readable format. This will allow our collections to be consumed by large aggregators, specifically Calisphere, Worldcat, & DPLA.1 Luckily, libraries already have a well-established standard to turn to for these needs, OAI-PMH. OAI-PMH lets us expose our repository metadata in XML in a way that allows a harvesting application to periodically fetch batches of records, adding new ones & updating ones that have changed.
Right away, there are challenges to exposing our EQUELLA repository’s metadata. We use MODS in our repository, while OAI expects you to use Dublin Core. Luckily, these are two common formats & there’s a lot of information on how to map information between them.2 Unfortunately, our MODS schema is heavily modified. Decisions were made to add or alter elements based on local needs. What’s worse, in order to make certain user-facing fields easier to use, our repository software makes us insert wrapper elements into our metadata schema. All of this combines to make our MODS-to-DC mapping utterly unique & more complicated than usual.
When I went to inspect our repository’s OAI implementation out of the box, I was greeted by records like this one:
<record> <metadata> <oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:title>Expectations</dc:title> <dc:description>Explored the experience of pregnant and parenting teenagers via an installation and a symposium, "Cribs, Classrooms, and Communities: The Teen Pregnancy Controversy."</dc:description> </oai_dc:dc> </metadata> </record>
The default settings were showing only a Dublin Core title & description for each item, which was certainly not bad for no configuration at all, but not desirable. We heavily invested in designing our repository’s upload form & cataloging. We want that work to shine through in the metadata we provide to external services. Now, after applying some very basic XSLT which I’ll cover in the remainder of this post, our OAI endpoint looks quite a bit better. There are numerous fields represented for each record, though our original records are still a bit richer in information.
A Metadata Transformation Language
Let’s talk first about what XSLT is before we talk about how to use it. XSLT is unique in that it’s specifically designed to transform XML documents. This doesn’t necessarily mean it was designed for mapping from one metadata schema to another, as XML is used for more than just metadata & XSLT can do more than just shuffle around values, but it does mean that the language is uniquely suited to that task. XSLT allows us to change certain elements in the original document, alter text, & add or drop pieces of information.
In this post, we’ll specifically look at converting the Library of Congress’ MODS schema to Dublin Core. LOC has provided a handy map between the two which illustrates the complexity of the task. A few things that we need to address:
- All data is going to need a new field, there isn’t really a way to simply “leave” an element in its place
- MODS is hierarchical, meaning some elements have child elements, while Dublin Core is flat with only a top level bearing no children
- Sometimes multiple MODS elements will collapse into a single Dublin Core field e.g.
classificationall file under DC’s
- Inversely, a MODS
name/namePartelement might map to either the DC
Contributorfields depending upon the role of the person being referred to (captured in the
role/roleTermchild of the
- Both schemas have their own slightly differing vocabulary of resource types, stored in MODS’
Type, so what one schema considers a “sound recording-nonmusical” the other considers merely “Sound”, for instance
I see mapping between metadata schemas as a subset of data wrangling, & data wrangling is one of the most common aspects of my Systems Librarian position: I take CSV reports from our student data system & map them into a weird MARC-like format for the Millennium ILS to ingest, I take course lists & turn them into a controlled vocabulary of sorts in our repository, & of course I convert our repository’s MODS into OAI DC. All of these procedures are duck-taped together with custom scripts filled with comments detailing bizarre data behaviors. XSLT is one of the few languages designed for this type of work. While there are software packages like OpenRefine or Stanford’s Data Wrangler, sometimes the power & flexibility of a programming language is preferable. It’s disappointing to me that the only prominent choice is also XML-based.
To try out the examples below, you’ll need an XML document to use as input (preferably MODS, as that’s what the examples target) & an XSLT processor. On Mac OS X, the built-in program
xsltproc does the trick. You can run a document through a transformation using the following syntax on the command line:
> xsltproc --output output.xml mapping.xsl input.xml
For Linux, you can install
xsltproc as part of the libxml package. Windows users can install the Saxon XSLT processor. Most web browsers support XSLT processing, too. If you add a line like
<?xml-stylesheet type='text/xsl' href='name-of-stylesheet.xsl'?> up at the top of an XML file, the browser will automatically run the XML through your stylesheet & present you with the results.
Pretty much all XSLT starts with some boilerplate that looks like this:
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:template match="@*|node()"> <oai_dc:dc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> ... </oai_dc> </xsl:template> </xsl:stylesheet>
You start an XML document, open up an
xsl:stylesheet element with a certain namespace, & define a template inside which you’re going to apply to the incoming document. The
match attribute can differ depending on your objective—here we’re selecting all data in the input document, attributes & nodes—but that’s the basic setup. You then declare the root element of your output document inside the template, which here is the
<oai_dc:dc> tag. What matters is what goes on inside the template, represented in the example with an ellipsis, since that gives us the ability to map elements to new locations.
Confession: I haven’t taken the time to fully learn the usage of the
xsl:template element. You can use multiple of them within an
xsl:stylesheet, they can target different elements in the origin document with their
match attributes, & you can call them later with
xsl:apply-template(s). This lets you modularize your stylesheet into several smaller, focused templates. But I won’t discuss them further, in the hopes that showing other features provides enough detail to communicate the substance of XSLT.
Let’s look at another example, which simply copies the MODS’
identifier element to Dublin Core’s
dc:identifier, skipping everything else in the input document. Going forward, I will leave off the XML prolog
<?xml …?> & all of the wrapping
oai_dc:dc elements for brevity’s sake.
<dc:identifier> <xsl:value-of select="mods/identifier" /> </dc:identifier>
For a MODS document with identifier “10881088”, the full output of this stylesheet is:
<?xml version="1.0"?> <oai_dc:dc xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <dc:identifier>10881088</dc:identifier> </oai_dc:dc>
We can see that the root element of the document has been transformed into
oai_dc:dc & our identifier is present inside a
dc:identifier element. So two things from our template above happened: first of all, anything that’s not an
xsl: prefixed element is mostly output on the other end in exactly the same form. This goes for text as well as markup language tags, such as our
oai_dc:dc opening & closing tags which are recreated exactly. Inside
dc:identifier, something else happens: the
xsl:value-of element selects an element from the origin document & replaces itself with that value. The key piece is the
select attribute, which accepts an XPath query it uses on the input.
XPath is another entire language you need to know to use XSLT to transform XML documents. Luckily, if you’ve worked much with XML at all, you are probably familiar with its basics. XPath lets us query a document by providing hierarchical paths such that “mods/identifier” matches an “identifier” element which is the child of a “mods” element. The XPath queries in this post won’t be more sophisticated than that.
Let’s do the same thing but map several elements to different places. The example below is starting to resemble a more fully-fledged stylesheet which could actually be of use.
<dc:identifier> <xsl:value-of select="mods/identifier" /> </dc:identifier> <dc:title> <xsl:value-of select="mods/titleInfo/title" /> </dc:title> <dc:description> <xsl:value-of select="mods/abstract" /> </dc:description>
We’ve now mapped the identifier, abstract, & title from MODS into their appropriate Dublin Core locations. Yay! While this may seem straightforward, it is worth noting we took the MODS’ titleInfo/title element from out of its parent titleInfo element & placed it in the top-level of Dublin Core’s flat schema, handling one of the mapping complexities we’d noted earlier.
However, what happens when we pass a document lacking a title or abstract through our stylesheet? Say we start with input.xml:
<mods> <identifier>10881088</identifier> </mods>
On the command line, we run:
> xsltproc --output output.xml transform.xsl input.xml
We get the resulting output.xml:
<oai:dc> <dc:identifier>10881088</dc:identifier> <dc:title></dc:title> <dc:description></dc:description> </oai:dc>
xsl:value-of element didn’t find any content & thus returned an empty string, we were still telling our transformation to produce
dc:description fields. One way to work around this is using conditional if statements to first check if a field has any text in it before returning its corresponding element in the output schema. If statements in general are extremely useful during transformations, so let’s take a look at some.
<dc:identifier> <xsl:value-of select="mods/identifier" /> </dc:identifier> <xsl:if test="mods/titleInfo/title != ''"> <dc:title> <xsl:value-of select="mods/titleInfo/title" /> </dc:title> </xsl:if>
This transformation does our standard mods/identifier->dc:identifier mapping (hopefully every document has an identifier…), but it uses the
xsl:if element to first test if mods/titleInfo/title has any content before outputting anything. The
xsl:if syntax is
<xsl:if test="condition"> where the “condition” can be one many standard comparisons that returns a “true” or “false” value: equal to “=”, not equal to “!=”, greater than “\>”, & less than “\<“. Note that, since we’re working with XML, we cannot use the greater or less than signs so we must use their XML character entity forms.
Hyper Advanced (not really)
We can get a long way with just
xsl:if, & our innate verve, but let’s learn a few more useful XSLT constructs. Remember earlier when we noticed that, depending on the person’s role, the “namePart” value in MODS might be mapped to one of two DC elements, Creator or Contributor? We can’t actually handle that situation with what we know, because we’ve only mapped singular & not repeating elements. Our “select” & “test” attributes will only select the first matching element in the origin MODS document. We need some kind of loop that lets us iterate over repeated elements while also testing their values.
Below, we use
xsl:for-each to loop over a selected element, & then we apply to tests to each element to determine if it’s referring to an author or an editor.
<xsl:for-each select="mods/name"> <xsl:if test="contains(role/roleTerm, 'author')"> <dc:creator> <xsl:value-of select="namePart" /> </dc:creator> </xsl:if> <xsl:if test="contains(role/roleTerm, 'editor')"> <dc:contributor> <xsl:value-of select="namePart" /> </dc:contributor> </xsl:if> </xsl:for-each>
xsl:for-each in combination with
xsl:if accomplished one of the more onerous data mapping tasks we’re faced with in a rather elegant manner. Inside our
test attributes we also see something new: we’re using a
contains() function. XSLT has many functions which can be used inside certain attributes. The
contains() syntax is straightforward: it accepts two parameters, an XPath & a string of text, returning “true” if the text is found in the value of the XPath or false otherwise.
Alright, now let’s tackle something truly intimidating: how could we possibly handle mapping the two “resource type” vocabularies to one another? We can see now that a very long series of
xsl:if statements inside a
xsl:for-each loop should get the job done. But there’s a slightly nicer method available in
<xsl:for-each select="mods/typeOfResource"> <xsl:variable name="text" select="text()" /> <xsl:choose> <xsl:when test="$text = 'cartographic material'"> <dc:type>Image</dc:type> </xsl:when> <xsl:when test="$text = 'software'"> <dc:type>Software</dc:type> </xsl:when> <!-- covers 3 values: sound recording, sound recording-musical, sound recording-nonmusical --> <xsl:when test="starts-with($text, 'sound recording')"> <dc:type>Sound</dc:type> </xsl:when> <xsl:when test="$text = 'three dimensional object'"> <dc:type>Image</dc:type> </xsl:when> </xsl:choose> </xsl:for-each>
There are a few little tricks in here, but the general frame should be apparent: we loop over all the
typeOfResource elements and, when they match one of the tests in our
xsl:when element, the text or XSLT commands inside the when block are produced. This lets us set up a nice crosswalk between the MODS & DC type vocabularies. It’s also a bit faster than a series of “if” statements, since the
xsl:choose block will exit as soon as a test returns true, while all if statements would execute even if only the very first one was necessary. Finally, it’s also possible to provide an
xsl:otherwise block at the end of a series of
xsl:when statements which is used as a fallback: if none of the “when” tests were true, the fallback is output. If there were overarching “base type” or “unknown type” values in Dublin Core then this would be useful.
The two other new pieces we saw in the above XSLT snippet: we used
xsl:variable to store the text value of the
typeOfResource node, obtained with the
text() function, & then referred to it later by attaching a dollar sign sigil to the variable’s name, much like BASH, Perl, & PHP do with their variables. Looking up the value of the node only once & then using the stand-in variable is another minor speed optimization. We also used the XSLT function
starts-with to test three MODS type values at once. How did I know that function exists? I Googled it, like a professional software developer. The W3Schools reference on functions is a thorough overview. In general, if you find yourself thinking “I bet there’s a handy shortcut function that would make this less painful…” then you should search for one.
After much stumbling, followed by trial & error, followed by more stumbling, our OAI endpoint is looking much better thanks to my XSLT stylesheet. We’re publishing format, creator, contributor, type, & rights information in Dublin Core. I’m certain that my examples here, & the code in my final script, don’t follow XSLT best practices. I’m quite shaky on some of the fundamental mechanics of XSLT, like
xsl:template 3. Nonetheless, I hope this post gives a broad overview of the technology & its application. Go forth & transform!
- Incidentally, I’m also trying to get our collections better indexed by search engines like Google & Google Scholar using Schema.org & other metadata embedded in HTML. That’s a very different beast though & not strongly related to the XSLT work that I am discussing here. ↩
- I referred to the Library of Congress’ guide on the MODS site when constructing our crosswalk. ↩
- I suspect my stylesheets would need far fewer
xsl:ifstatements if I used templates effectively & not just as a boilerplate wrapper element. ↩