Removing the Truthiness from Google

A decade ago, Stephen Colbert introduced the concept of “truthiness”, or a fact that was so because it felt right “from the gut.” When we search for information online, we are always up against the risk that the creator of a page is someone who, like Stephen Colbert’s character doesn’t trust books, because “they’re all fact, no heart.”1 Since sites with questionable or outright false facts that “feel right” often end up at the top of Google search results, librarians teach students how to evaluate online sources for accuracy, relevancy, and so on rather than just trusting the top result. But what if there were a way to ensure that truthiness was removed, and only sites with true information appeared at the top of the results?

This idea is what underlies a new Google algorithm called Knowledge-Based Trust(KBT)2. Google’s original founding principles and the PageRank algorithm were based on academic citation practices–loosely summarized, pages linked to by a number of other pages are more likely to be useful than those with fewer links. The content of the page, while it needs to match the search query, is less crucial to its ranking than outside factors, which is otherwise known as an exogenous model. The KBT, by contrast, is an endogenous model relying on the actual content of the page. Ranking is based on the probability that the page is accurate, and therefore more trustworthy. This is designed to address the problem of sites with high PageRank scores that aren’t accurate, either because their truthiness quotient is high, or because they have gamed the system by scraping content and applying misleading SEO. On the other side, pages with great information that aren’t very popular may be buried.

“Wait a second,” you are now asking yourself, “Google now determines what is true?” The answer is: sort of, but of course it’s not as simple as that. Let’s look at the paper in detail, and then come back to the philosophical questions.

Digging Into the KBT

First, this paper is technical, but the basic information is fairly straightforward. This model is based on extracting facts from a web source, evaluating whether those facts are true or not, and then whether a source is accurate or not. This leads to a determination that the facts are correct in an iterative process. Of course, verifying that determination is essential to ensuring that all the algorithms are working correctly, and this paper describes ways of checking the extracted facts for accuracy.

The extractors are described more fully in an earlier version of this work, Knowledge Vault (KV), which was designed to fill in large-scale knowledge bases such as Freebase by extracting facts from a web source using techniques like Natural Language Processing of text files followed by machine learning, HTML DOM trees, HTML tables, and human processed pages with schema.org metadata. The extractors themselves can perform poorly in creating these triples, however, and this is more common than the facts being wrong, and so sites may be unfairly flagged as inaccurate. The KBT project aims to introduce an algorithm for determining what type of error is present, as well as how to judge sites with many or few facts accurately, and lastly to test their assumptions using real world data against known facts.

The specific example given in the paper is the birthplace of President Barack Obama. The extractor would determine a predicate, subject, object triple from a web source and match these strings to Freebase (for example). This can lead to a number of errors–there is a huge problem in computationally determining the truth even when the semantics are straightforward (which we all know it rarely is). For this example, it’s possible to check data from the web against the known value in Freebase, and so if that extractor works set an option to 1 (for yes) and 0 (for no). Then this can be charted in a two-dimensional or three-dimensional matrix that helps show the probability of a given extractor working, as well as whether the value pulled by the extractor was true or not.

They go on to examine two models for computing the data, single-layer and multi-layer. The single-layer model, which looks at each web source and its facts separately, is easier to work with using standard techniques but is limited because it doesn’t take into account extraction errors. The multi-layer model is more complex to analyze, but takes the extraction errors into account along with the truth errors. I am not qualified to comment on the algorithm math in detail, but essentially it computes probability of accuracy for each variable in turn, ultimately arriving at an equation that estimates how accurate a source is, weighted by the likelihood that source contains those facts. There are additional considerations for precision and recall, as well as confidence levels returned by extractors.

Lastly, they consider how to split up large sources to avoid computational bottlenecks, as well as to merge sources with few facts in order to not penalize them but not accidentally combine unrelated sources. Their experimental results determined that generally PageRank and KBT are orthogonal, but with a few outliers. In some cases, the site has a low PageRank but a high KBT. They manually verified the top three predicates with high extraction accuracy scores for web sources with a high KBT to check what was happening. 85% of these sources were trustworthy without extraction errors and with predicates related to the topic of the page, but only 23% of these sources had PageRank scores over 0.5. In other cases, sources had a low KBT but high PageRank, which included sites such as celebrity gossip sites and forums such as Yahoo Answers. Yes, indeed, Google computer scientists finally have definitive proof that Yahoo Answers tends to be inaccurate.

The conclusion of the article with future improvements reads like the learning outcomes for any basic information literacy workshop. First, the algorithm would need to be able to tell the main topic of the website and filter out unrelated facts, to understand which triples are trivial, to have better comprehension of what is a fact, and to correctly remove sites with data scraped from other sources. That said, for what it does, this is a much more sophisticated model than anything else out there, and at least proves that there is a possibility to computationally determine the accuracy of a web source.

What is Truth, Anyway?

Despite the promise of this model there are clearly many potential problems, of which I’ll mention just a few. The source for this exercise, Freebase, is currently in read-only mode as its data migrates to Wikidata. Google is apparently dropping Freebase to focus on their Open Knowledge Graph, which is partially Freebase/Wikidata content and partially schema.org data 3. One interesting wrinkle is that much of Freebase content cites Wikipedia as a source, which means there are currently recursive citations that must be properly cited before they will be accepted as facts. We already know that Wikipedia suffers from a lack of diversity in contributors and topic coverage, so a focus on content from Wikipedia has the danger of reducing the sources of information from which the KBT could check triples.

That said, most of human knowledge and understanding is difficult to fit into triples. While surely no one would search Google for “What is love?” or similar and expect to get a factual answer, there are plenty of less extreme examples that are unclear. For instance, how does this account for controversial topics? I.e. “anthropogenic global warming is real” vs. “global warming is real, but it’s not anthropogenic.” 97% of scientists agree to the former, but what if you are looking for what the 3% are saying?

And we might question whether it’s a good idea to trust an algorithm’s definition of what is true. As Bess Sadler and Chris Bourg remind us, algorithms are not neutral, and may ignore large parts of human experience, particularly from groups underrepresented in computer science and technology. Librarians should have a role in reducing that ignorance by supporting “inclusion, plurality, participation and transparency.” 4 Given the limitations of what is available to the KBT it seems unlikely that this algorithm would markedly reduce this inequity, though I could see how it could be possible if Wikidata could be seeded with more information about diverse groups.

Librarians take note, this algorithm is still under development, and most likely won’t be appearing in our Google results any time in the near future. And even once it does, we need to ensure that we are still paying attention to nuance, edge cases, and our own sense of truthiness–and more importantly, truth–as we evaluate web sources.

  1. http://thecolbertreport.cc.com/videos/63ite2/the-word—truthiness.
  2. Dong, X. et al. “Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources”. Proceedings of the VLDB Endowment, 2015. Retrieved from http://arxiv.org/abs/1502.03519
  3. https://www.wikidata.org/wiki/Help:FAQ/Freebase
  4. Sadler, Bess and Chris Bourg, “Feminism and the Future of Library Discovery.” Code4Lib Journal 28, April 2015.

Best Practices for Hacking Third-Party Sites

While customizing vendor web services is not the most glamorous task, it’s something almost every library does. Whether we have full access to a templating system, as with LibGuides 2, or merely the ability to insert an HTML header or footer, as on many database platforms, we are tested by platform limitations and a desire to make our organization’s fractured web presence cohesive and usable.

What does customizing a vendor site look like? Let’s look at one example before going into best practices. Many libraries subscribe to EBSCO databases, which have a corresponding administrative side “EBSCOadmin”. Electronic Resources and Web Librarians commonly have credentials for these admin sites. When we sign into EBSCOadmin, there are numerous configuration options for our database subscriptions, including a “branding” tab under the “Customize Services” section.

While EBSCO’s branding options include specifying the primary and secondary colors of their databases, there’s also a “bottom branding” section which allows us to inject custom HTML. Branding colors can be important, but this post is focuses on effectively injecting markup onto vendor web pages. The steps for doing so in EBSCOadmin are numerous and not informative for any other system, but the point is that when given custom HTML access one can make many modifications, from inserting text on the page, to an entirely new stylesheet, to modifying user interface behavior with JavaScript. Below, I’ve turned footer links orange and written a message to my browser’s JavaScript console using the custom HTML options in EBSCOadmin.

customized EBSCO database

These opportunities for customization come in many flavors. We might have access only to a section of HTML in the header or footer of a page. We might be customizing the appearance of our link resolver, subscription databases, or catalog. Regardless, there are a few best practices which can aid us in making modifications that are effective.

General Best Practices

Ditch best practices when they become obstacles

It’s too tempting; I have to start this post about best practices by noting their inherent limitations. When we’re working with a site designed by someone else, the quality of our own code is restricted by decisions they made for unknown reasons. Commonly-spouted wisdom—reduce HTTP requests! don’t use eval! ID selectors should be avoided!—may be unusable or even counter-productive.

To note but one shining example: CSS specificity. If you’ve worked long enough with CSS then you know that it’s easy to back yourself into a corner by using overly powerful selectors like IDs or—the horror—inline style attributes. These methods of applying CSS have high specificity, which means that CSS written later in a stylesheet or loaded later in the HTML document might not override them as anticipated, a seeming contradiction in the “cascade” part of CSS. The hydrogen bomb of specificity is the !important modifier which automatically overrides anything but another !important later in the page’s styles.

So it’s best practice to avoid inline style attributes, ID selectors, and especially !important. Except when hacking on vendor sites it’s often necessary. What if we need to override an inline style? Suddenly, !important looks necessary. So let’s not get caught up following rules written for people in greener pastures; we’re already in the swamp, throwing some mud around may be called for.

There are dozens of other examples that come to mind. For instance, in serving content from a vendor site where we have no server-side control, we may be forced to violate web performance best practices such as sending assets with caching headers and utilizing compression. While minifying code is another performance best practice, for small customizations it adds little but obfuscates our work for other staff. Keeping a small script or style tweak human-readable might be more prudent. Overall, understanding why certain practices are recommended, and when it’s appropriate to sacrifice them, can aid our decision-making.

Test. Test. Test. When you’re done testing, test again

Whenever we’re creating an experience on the web it’s good to test. To test with Chrome, with Firefox, with Internet Explorer. To test on an iPhone, a Galaxy S4, a Chromebook. To test on our university’s wired network, on wireless, on 3G. Our users are vast; they contain multitudes. We try to represent their experiences as best as possible in the testing environment, knowing that we won’t emulate every possibility.

Testing is important, sure. But when hacking a third party site, the variance is more than doubled. The vendor has likely done their own testing. They’ve likely introduced their own hacks that work around issues with specific browsers, devices, or connectivity conditions. They may be using server-side device detection to send out subtly different versions of the site to different users; they may not offer the same functionality in all situations. All of these circumstances mean that testing is vitally important and unending. We will never cover enough ground to be sure our hacks are foolproof, but we better try or they’ll not work at all.

Analytics and error reporting

Speaking of testing, how will we know when something goes wrong? Surely, our users will send us a detailed error report, complete with screenshots and the full schematics of every piece of hardware and software involved. After all, they do not have lives or obligations of their own. They exist merely to make our code more error-proof.

If, however, for some odd reason someone does not report an error, we may still want to know that one occurred. It’s good to set up unobtrusive analytics that record errors or other measures of interaction. Did we revamp a form to add additional validation? Try tracking what proportion of visitors successfully submit the form, how often the validation is violated, how often users submit invalid data multiple times in a row, and how often our code encounters an error. There are some intriguing client-side error reporting services out there that can catch JavaScript errors and detail them for our perusal later. But even a little work with events in Google Analytics can log errors, successes, and everything in between. With the mere information that problems are occurring, we may be able to identify patterns, focus our testing, and ultimately improve our customizations and end-user experience.

Know when to cut your losses

Some aspects of a vendor site are difficult to customize. I don’t want to say impossible, since one can do an awful lot with only a single <script> tag to work with, but unfeasible. Sometimes it’s best to know when sinking more time and effort into a customization isn’t worth it.

For instance, our repository has a “hierarchy browse” feature which allows us to present filtered subsets of items to users. We often get requests to customize the hierarchies for specific departments or purposes—can we change the default sort, can we hide certain info here but not there, can we use grid instead of list-based results? We probably can, because the hierarchy browse allows us to inject arbitrary custom HTML at the top of each section. But the interface for doing so is a bit clumsy and would need to be repeated everywhere a customization is made, sometimes across dozens of places simply to cover a single department’s work. So while many of these change requests are technically possible, they’re unwise. Updates would be difficult and impossible to automate, virtually ensuring errors are introduced over time as I forget to update one section or make a manual mistake somewhere. Instead, I can focus on customizing the site-wide theme to fix other, potentially larger issues with more maintainable solutions.

A good alternative to tricky and unmaintainable customizations is to submit a feature request to the vendor. Some vendors have specific sites where we can submit ideas for new features and put our support behind others’ ideas. For instance, the Innovative Users Group hosts an annual vote where members can select their most desired enhancement requests. Remember that vendors want to make a better product after all; our feedback is valued. Even if there’s no formal system for submitting feature requests, a simple email to our sales representative or customer support can help.

CSS Best Practices

While the above section spoke to general advice, CSS and JavaScript have a few specific peculiarities to keep in mind while working within a hostile host environment.

Don’t write brittle, overly-specific selectors

There are two unifying characteristics of hacking on third-party sites: 1) we’re unfamiliar with the underlying logic of why the site is constructed in a particular way and 2) everything is subject to change without notice. Both of these making targeting HTML elements, whether with CSS or JavaScript, challenging. We want our selectors to be as flexible as possible, to withstand as much change as possible without breaking. Say we have the following list of helpful tools in a sidebar:

<div id="tools">
    <ul>
        <li><span class="icon icon-hat"></span><a href="#">Email a Librarian</a></li>
        <li><span class="icon icon-turtle"></span><a href="#">Citations</a></li>
        <li><span class="icon icon-unicorn"></span><a href="#">Catalog</a></li>
    </ul>
</div>

We can modify the icons listed with a selector like #tools > ul > li > span.icon.icon-hat. But many small changes could break this style: a wrapper layer injected in between the #tools div and the unordered list, a switch from unordered to ordered list, moving from <span>s for icons to another tag such as <i>. Instead, a selector like #tools .icon.icon-hat assumes that little will stay the same; it thinks there’ll be icons inside the #tools section, but doesn’t care about anything in between. Some assumptions have to stay, that’s the nature of customizing someone else’s site, but it’s pretty safe to bet on the icon classes to remain.

In general, sibling and child selectors make for poor choices for vendor sites. We’re suddenly relying not just on tags, classes, and IDs to stay the same, but also the particular order that elements appear in. I’d also argue that pseudo-selectors like :first-child, :last-child, and :nth-child() are dangerous for the same reason.

Avoid positioning if possible

Positioning and layout can be tricky to get right on a vendor site. Unless we’re confident in our tests and have covered all the edge cases, try to avoid properties like position and float. In my experience, many poorly structured vendor sites employ ad hoc box-sizing measurements, float-based layout, and lack a grid system. These are all a recipe for weird interconnections between disparate parts—we try to give a call-out box a bit more padding and end up sending the secondary navigation flying a thousand pixels to the right offscreen.

display: none is your friend

display: none is easily my most frequently used CSS property when I customize vendor sites. Can’t turn off a feature in the admin options? Hide it from the interface entirely. A particular feature is broken on mobile? Hide it. A feature is of niche appeal and adds more clutter than it’s worth? Hide it. The footer? Yeah, it’s a useless advertisement, let’s get rid of it. display: none is great but remember it does affect a site’s layout; the hidden element will collapse and no longer take up space, so be careful when hiding structural elements that are presented as menus or columns.

Attribute selectors are excellent

Attribute selectors, which enable us to target an element by the value of any of its HTML attributes, are incredibly powerful. They aren’t very common, so here’s a quick refresher on what they look. Say we have the following HTML element:

<a href="http://example.com" title="the best site, seriously" target="_blank">

This is an anchor tag with three attributes: href, title, and target. Attribute selectors allow us to target an element by whether it has an attribute or an attribute with a particular value, like so:

/* applies to <a> tags with a "target" attribute */
a[target] {
    background: red;
}
/* applies to <a> tags with an "href" that begin with "http://"
this is a great way to style links pointed at external websites
or one particular external website! */
a[href^="http://"] {
    cursor: help;
}
/* applies to <a> tags with the text "best" anywhere in their "title" attribute */
a[title*="best"] {
    font-variant: small-caps;
}

Why is this useful among the many ways we can select elements in CSS? Vendor sites often aren’t anticipating all the customizations we want to employ; they may not provide handy class and ID styling hooks where we need them. Or, as noted above, the structure of the document may be subject to change either over time or across different pieces of the site. Attribute selectors can help mitigate this by making style bindings more explicit. Instead of saying “change the background icon for some random span inside a list inside a div”, we can say “change the background icon for the link that points at our citation management tool”.

If that’s unclear, let me give another example from our institutional repository. While we have the ability to list custom links in the main left-hand navigation of our site, we cannot control the icons that appear with them. What’s worse, there are virtually no styling hooks available; we have an unadorned anchor tag to work with. But that turns out to be plenty for a selector of form a[href$=hierarchy] to target all <a>s with an href ending in “hierarchy”; suddenly we can define icon styles based on the URLs we’re pointing it, which is exactly what we want to base them on anyways.

Attribute selectors are brittle in their own ways—when our URLs change, these icons will break. But they’re a handy tool to have.

JavaScript Best Practices

Avoid the global scope

JavaScript has a notorious problem with global variables. By default, all variables lacking the var keyword are made global. Furthermore, variables outside the scope of any function will also be global. Global variables are considered harmful because they too easily allow unrelated pieces of code to interact; when everything’s sharing the same namespace, the chance that common names like i for index or count are used in two conflicting contexts increases greatly.

To avoid polluting the global scope with our own code, we wrap our entire script customizations in an immediately-invoked function expression (IIFE):

(function() {
    // do stuff here 
}())

Wrapping our code in this hideous-looking construction gives it its own scope, so we can define variables without fear of overwriting ones in the global scope. As a bonus, our code still has access to global variables like window and navigator. However, global variables defined by the vendor site itself are best avoided; it is possible they will change or are subject to strange conditions that we can’t determine. Again, the fewer assumptions our code makes about how the vendor’s site works, the more resilient it will be.

Avoid calling vendor-provided functions

Oftentimes the vendor site itself will put important functions in the global scope, funtions like submitForm or validate where their intention seems quite obvious. We may even be able to reverse engineer their code a bit, determining what the parameters we should pass to these functions are. But we must not succumb to the temptation to actually reference their code within our own!

Even if we have a decent handle on the vendor’s current code, it is far too subject to change. Instead, we should seek to add or modify site functionality in a more macro-like way; instead of calling vendor functions in our code, we can automate interactions with the user interface. For instance, say the “save” button is in an inconvenient place on a form and has the following code:

<button type="submit" class="btn btn-primary" onclick="submitForm(0)">Save</button>

We can see that the button saves the form by calling the submitForm function when it’s clicked with a value of 0. Maybe we even figure out that 0 means “no errors” whereas 1 means “error”.1 So we could create another button somewhere which calls this same submitForm function. But so many changes break our code; if the meaning of the “0” changes, if the function name changes, or if something else happens when the save button is clicked that’s not evident in the markup. Instead, we can have our new button trigger the click event on the original save button exactly as a user interacting with the site would. In this way, our new save button should emulate exactly the behavior of the old one through many types of changes.

{{Insert Your Best Practices Here}}

Web-savvy librarians of the world, what are the practices you stick to when modifying your LibGuides, catalog, discovery layer, databases, etc.? It’s actually been a while since I did customization outside of my college’s IR, so the ideas in this post are more opinion than practice. If you have your own techniques—or disagree with the ones in this post!—we’d love to hear about it in the comments.

Notes

  1. True story, I reverse engineered a vendor form where this appeared to be the case.

Educating Your Campus about Predatory Publishers

The recent publication of Monica Berger and Jill Cirasella’s piece in College and Research Libraries News “Beyond Beall’s List: Better understanding predatory publishers” is a reminder that the issue of “predatory publishers” continues to require focus for those working in scholarly communication. Berger and Cirasella have done a exemplary job of laying out some of the issues with Beall’s list, and called on librarians to be able “to describe the beast, its implications, and its limitations—neither understating nor overstating its size and danger.”

At my institution academic deans have identified “predatory” journals as an area of concern, and I am sure similar conversations are happening at other institutions. Here’s how I’ve “described the beast” at my institution, and models for services we all can provide, whether subject librarian or scholarly communication librarian.

What is a Predatory Publisher? And Why Does the Dean Care?

The concept of predatory publishers became much more widely known in 2013 with a publication of an open access sting by John Bohannon in Science, which I covered in this post. As a recap, Bohannon created a fake but initially believable poor quality scientific article, and submitted it to open access journals. He found that the majority of journals accepted the poor quality paper, 45% of which were included in the Directory of Open Access Journals. At the time of publication in October 2013 the response to this article was explosive in the scholarly communications world. It seems that more than a year later the reaction continues to spread. Late in the fall semester of 2014, library administration asked me to prepare a guide about predatory publishers, due to concern among the deans that unscrupulous publishers might be taking advantage of faculty. This was a topic I’d been educating faculty about on an ad hoc basis for years, but I never realized we needed to address it more systematically. That all has changed, with senior library administration now doing regular presentations about predatory publishers to faculty.

If we are to be advocates of open access, we need to focus on the positive impact that open access has rather than dwell for too long on the bad sides of it. We also need faculty to be clear on their own goals for making their work open access so that they may make more informed choices. Librarians have limited faculty bandwidth on the topic, and so focusing on education about self-archiving articles (otherwise known as green open access) or choosing no-fee (also known as gold) open access journals is a better way to achieve advocacy goals than suggesting faculty choose only a certain set of gold open access journals. Unless we are offering money for paying article fees, we also don’t have much say about where faculty choose to publish. Education about how to choose a journal and a license responsibly is what we should focus on, even if it diverges from certain ideals (see Meredith Farkas on choosing creative commons licenses.)

Understanding the Needs and Preparing the Material

As I mentioned, my library administration asked for a guide that that they could use in presentations and share with faculty. In preparing this guide, I worked with our library’s Scholarly Communications committee (of which I am co-chair) to determine the format and content.

We decided that adding this material to our existing Open Access research guide would be the best move, since it was already up and we shared the URL widely already. We have a robust series of Open Access Week events (which I wrote about last fall) and this seemed to ideal place to continue engaging people. That said, we determined that the guide needed an overhaul to make it more clear that open access was an on-going area of concern, not a once a year event. Since faculty are not always immediately thinking of making work open access but of the mechanics of publishing, I preferred to start with the title “Publishing Your Own Work”.

To describe its features a bit more, I wanted to start from the mindset of self-archiving work to make it open access with a description of our repository and Peter Suber’s useful guide to making one’s own work open access. I then continued with an explanation of article publication fees, since I often get questions along those lines. They are not unique to open access journals, and don’t imply any fee to accept for publication, which was a fear that I heard more than once during Open Access Week last year. I only then discussed the concept of predatory journals, with the hope that a basic understanding of the process would allay fears. I then present a list of steps to research a journal. I thought these steps were more common sense than anything, but after conversations with faculty and administration, I realized that my intuition about what type of journal I am dealing with is obvious because I have daily practice and experience. For people new to the topic I tried to break down research into easy steps that help them to figure out where a journal is on the continuum from outright scams to legitimate but new or unusual journals. It was also important to me to emphasize self-archiving as a strategy no matter the journal publication model.

Lastly, while most academic libraries have a model of liaison librarians engaging in scholarly communications activities, the person who spends every day working on these issues is likely to be more versed in emerging trends. So it is important to work with liaisons to help them research journals and to identify quality open access journals in their disciplines. We plan to add this information to the guide in a future version.

Taking it on the Road

We felt that in-person instruction on these matters with faculty was a crucial next step, particularly for people who publish in traditional journals but want to make their work available. Traditional journals’ copyright transfer agreements can be predatory, even if we don’t think about it in those terms. Taking inspiration from the ACRL Scholarly Communications Roadshow I attended a few years ago, I decided to take the curriculum from that program and offer it to faculty and graduate students. We read through three publication agreements as a group, and then discussed how open the publishers were to reuse of material, or whether they mentioned it at all. We then included a section on addenda to contracts for negotiation about additional rights.

The first workshop received modest attendance, but included some thoughtful conversations, and we have promised to run it again. Some people may never have read their agreements closely, and never realized they were doing something illegal or not specifically allowed by, for instance, sharing an article they wrote with their students. That concrete realization is more likely to spur action than more abstract arguments about the benefits of open access.

Escaping the Predator Metaphor

If I could go back, I would get rid of the concept of “predator” attached to open access journals. Let’s call it instead unscrupulous entrants into an emerging business model. That’s not as catchy, but it explains why this has happened. I would argue, personally, that the hybrid gold journals by large publishers are just as predatory, as they capitalize on funding requirements to make articles open access with high fees. They too are trying new business models, and those may not be tenable either. As I said above, choosing a journal with eyes wide open and understanding all the ramifications of different publication models is the only way forward. To suggest that faculty are innocently waiting to be pounced on by predators is to deny their agency and their ability to make choices about their own work. There may be days where that metaphor seems apt, but I think overall this is a damaging mentality to librarians interested in promoting new models of scholarly communication. I hope we can provide better resources and programming to escape this, as well as to help administration to understand how to choose to fund open access initiatives.

In the comments I’d like to hear more suggestions about how to escape the “predator” metaphor, as well as your own techniques for educating faculty on your campus.

GIS and Geospatial Data Tools

I was recently appointed the geography subject librarian for my library, which was mildly terrifying considering that I do not have a background in geography. But I was assigned the subject because of my interest in data visualization, and since my appointment I’ve learned a few things about the awesome potential opportunities to integrate Geographic Information Systems (GIS) and geospatial visualization tools into information literacy instruction and library services generally.  A little bit of knowledge about GIS and geospatial visualization goes a long way, and is useful across a variety of disciplines, including social sciences, business, humanities and environmental studies and sciences.   If you are into open data (who isn’t?) and you like maps and / or data visualization (who doesn’t?!) then it’s definitely worth it to learn about some tools and resources to work with geospatial information.

About GIS and Geospatial Data

Geographic Information Systems, or GIS, are software tools that enable visualizing and interpreting data (social, demographic, economic, political, topographic, spatial, natural resources, etc.) using maps and geospatial data. Often data is visualized using layers, where a base map (containing, for example, a political map of a city) or tiles are overlaid with shapes, data points, or choropleth shading. For example, in the map below, a map of districts in Tokyo is overlaid with data points representing the number of seniors living in the area: 1

You may be familiar with Google Earth, which has a lot of features similar to a GIS (but is arguably not really a GIS, due to its lack of data analysis and query tools typically found in a fully-featured GIS). You can download a free Pro version of Google Earth that enables you to import GIS data. GIS data can appear in a variety of formats, and while there isn’t space here to go into each of them, a few common formats you might come across include Shapefiles, KML, and GeoJSON2  Shapefiles, as the name suggests, represent shapes (e.g., polygons) as layers of vector data that can be visualized in GIS programs and Google Earth Pro.  You may also come across KML files (Keyhole Markup Language), which is an XML-style standard  for representing geographic data, and is commonly used with Google Earth and Google Maps.  GeoJSON is another format for representing geospatial information that is ideal for use with web services.  The various formats of GIS and geospatial data deserve a full post on their own, and I plan to write a follow-up post exploring some of these formats and how they are used in greater detail.

GIS/Geospatial Visualization Tools

ArcGIS (ESRI)

ArcGIS is arguably the industry standard for GIS software, and the maker of ArcGIS (ESRI) publishes manuals and guides for GIS students and practitioners.  There are a few different ArcGIS products:  ArcGIS for Desktop, ArcGIS Online, and ArcGIS server.  Personally I am only familiar with ArcGIS online, but you can do some pretty cool things with a totally free account, like create this map of where drones can and cannot fly in the United States: 3

ArcGIS can be very powerful and is particularly useful for complex geospatial datasets and visualizations (particularly visualizations that might require multiple layers of data or topographic / geologic data). A note about signing up with ArcGIS online:  You don’t actually need to sign up for a ‘free trial’ to explore the software – you can just create a free account that, as I understand it, is not limited to a trial period.  Not all features may be available in the completely free account.

CartoDB

CartoDB is both an open source application and a freemium cloud service that can be used to make some pretty amazing geospatial visualizations that can be embedded in web pages, like this choropleth that visualizes the amount of various kinds of pollution across Los Angeles.4

CartoDB’s aesthetics are really strong, and default map settings tend to be pretty gorgeous.  It also leverages Torque to enable animations (which is what’s behind the heatmap animation of this map showing Twitter activity related to Ferguson, MO over time).5  CartoDB can import Shapefiles, GeoJSON, and .csv files, and has a robust SQL API (built on PostGreSQL) that can be used to import and export data. CartoDB also has its own JavaScript library (CartoDB.js) that can be leveraged for building attractive custom apps.

More JavaScript Libraries

In addition to CartoDB.js mentioned above, there are lots of other flexible JavaScript libraries for mapping and geospatial visualization on the scene that can be leveraged for visualizing geospatial data:

  • OpenLayers – OpenLayers enables pulling in ’tile’ layers as base maps from a variety of sources, as well as enabling parsing of vector data in a wide range of formats, such as GeoJSON and KML.
  • Leaflet.js – A fairly user-friendly and lightweight library used for creating basic interactive, mobile-friendly maps.  In my opinion, Leaflet is a good library to get started with if you’re just jumping in to geospatial visualization.
  • D3.js – Everyone’s favorite JavaScript charting library also has some geospatial visualization features for certain kinds of maps, such as this choropleth example.
  • Mapbox Mapbox.js is a JavaScript API library built on top of Leaflet.js, but Mapbox also offers a suite of tools for more extensive mapping and geospatial visualization needs

Open Geospatial Data

Librarians wanting to integrate geospatial data visualization and GIS into interdisciplinary instruction can take advantage of open data sets that are increasingly available online. Sui (2014) notes that increasingly large data sets are being released freely and openly on the web, which is an exciting trend for GIS and open data enthusiasts. However, Sui also notes that the mere fact that data is legally released and made accessible “does not necessarily mean that data is usable (unless one has the technical expertise); thus they are not actually used at all.”6  Libraries could play a crucial role in helping users understand and interpret public data by integrating data visualization into information literacy instruction.

Some popular places to find open data that could be used in geospatial visualiation include:

  • Data.gov  Since 2009, Data.gov has published thousands of public open datasets, including datasets containing geographic and geospatial information.  As of this month, you can now open geospatial data files directly in CartoDB (requires a CartoDB account) to start making visualizations.  There isn’t a huge amount of geospatial data available currently, but Data.gov will hopefully benefit from initiatives like Project Open Data, which was launched in 2013 by the White House and designed to accelerate the publishing of open data sets by government agencies.
  • Google Public Data Explorer – This is a somewhat small set of public data that Google has gathered from other open data repositories (such as Eurostat) that can be directly visualized using Google charting tools.  For example, you could create a visualization of European population change by country using data available through the Public Data Explorer.  While the currently available data is pretty limited, Google has prepared a kind of open data metadata standard (Data Set Publishing Language, or DSPL) that might increase the availability of data through the explorer if the standard takes off.
  • publicdata.eu – The destination for Europe’s public open data, a nice feature of publicdata.eu is the ability to filter down to datasets that contain Shapefiles (.shp files) that can be directly imported into GIS software or Google Earth Pro.
  • OpenStreetMap (OSM) –  Open, crowdsourced street map data that can be downloaded or referenced to create basemaps or other geospatial visualizations that rely on transportation networks (roads, railways, walking paths, etc.).  OpenStreetMap data are open, so for those who would prefer to make applications that are based entirely on open data (rather than commercial solutions), OSM can be combined with JavaScript libraries like Leaflet.js for fully open geospatial applications.

GIS and Geospatial Visualization In the Library

I feel like I’ve only really scratched the surface with the possibilities for libraries to get involved with GIS and geospatial data.  Libraries are doing really exciting things with these technologies, whether it’s creating new ways of interacting with historical maps, lending GPS units, curating and preserving geospatial data, exploring geospatial linked data possibilities with GeoSPARQL or integrating GIS or geospatial visualization into information literacy / instruction programs.  For more ideas about integrating GIS and geospatial visualization into library instruction and services, check out these guides:

(EDIT 4/13) Also be sure to check out ALA’s Map and Geospatial Information Round Table (MAGIRT).  Thanks to Paige Andrew and Kathy Weimer for pointing out this awesome resource in the comments.

If you’re working on something awesome related to geospatial data in your library and would be interested in writing about it for ACRL TechConnect, contact me on Twitter @lpmagnuson or drop me a line in the comments!

Notes

  1. AtlasPublisher. Tokyo Senior Population. https://www.arcgis.com/home/webmap/viewer.html?webmap=6990a8c5e87b42ee80701cf985383d5d.  (Note:  Apologies if you have trouble seeing or zooming in on embedded visualizations in this post; the interaction behavior of these embedded iframes can be a little unpredictable if your cursor gets near them.  It’s definitely a drawback of embedding these interactive visualizations as iframes.)
  2. The Open Geospatial Consortium is an organization that gathers and shares information about geographic and geospatial data formats, and details about a variety of geospatial file formats and standards can be found on its website:  http://www.opengeospatial.org/.
  3. ESRI. A Nation of Drones. http://story.maps.arcgis.com/apps/MapSeries/?appid=79798a56715c4df183448cc5b7e1b999
  4. Lauder, Thomas Suh (2014).  Pollution Burdenshttp://graphics.latimes.com/responsivemap-pollution-burdens/.
  5. YMMV, but the performance of map animations that use Torque seems to be a little tricky, especially when embedded in an iFrame.  I tried to embed the Ferguson Twitter map into this post (because it is really cool looking), and it really slowed down page loading, and the script seemed to get stuck at times.
  6. Sui, Daniel. “Opportunities and Impediments for Open GIS.” Transactions in GIS, 18.1 (2014): 1-24.