A Forray into Publishing Open Data on GitHub

While we’ve written about using GitHub for publishing before, in this post I will explore publishing data on GitHub, as opposed to a presentation or academic paper. There are a few services where one can publish research data—FigShare comes to mind—but I wanted to try GitHub because I’m already familiar with the service, it seems suitable for publishing data alongside the scripts used to obtain and process it, and its focus on version control makes it particularly apt for publishing a work in progress. However, even with free services like GitHub available, open data still has hurdles to overcome. How can I, a lowly librarian with no grant funding or experience in this area, publish an open data set such that others can locate and reuse it? Let’s find out.

Background

As Lauren introduced in her last post, we here at ACRL Tech Connect are performing research into coding in libraries; how people learn to code, what learning resources they use, what languages they use. As part of this research, I wanted to compare what our survey respondents reported with a bulk analysis of GitHub repositories under library organizations. The Code4Lib wiki has an excellent page listing many library GitHub accounts, and GitHub has a nice API that reports, among many other things, the various languages used in a project. Those two sources of information seemed like a perfect match, so I wrote a few scripts to mash them together.

Publishing scripts that extract and analyze data is important. One cannot trust the results of a single scientific experiment or a poor sample set. Providing the programs used to collect data aims to allow reproducibility so future researchers can verify or build upon prior results. While we perhaps think of science as being quite established by now, data and reproducibility are major issues in most fields. Ask any data librarian and they’ll tell you; managing the preservation and distribution of research data is not a solved or simple problem. Furthermore, every so often another meta-research study will show that only some dismal percentage of experiments can be replicated.1

My own data is not so valuable. No cure for a debilitating disease rests on the number of bytes of Standard ML in your university’s GitHub account. But on principle I want to let my results be repeatable and, what’s more, if someone does find an error in my scripts or data I want it corrected. Even if my initial conclusions are off, someone might be able to construct a stronger study from their basis.

Step #1: Obtain a DOI

As the first step of publishing my data, I wanted to obtain a Digital Object Identifier. Sure, putting my work up on GitHub gives me a URL I can reference, but leaving it at that adds a lot of uncertainty. What if I change my GitHub username, which is contained in the URL? What if I transfer the repository to a new owner? What if GitHub goes out of business? While none of these are likely scenarios, they’re still worth guarding against. DOI providers essentially stick to a pact that their identifiers will continue to work for perpetuity. While that’s not always the case, I feel like grabbing a DOI is still The Right Thing To Do for pubishers at the present moment.

We can use Zenodo to secure a DOI. GitHub already has a fine guide named Making Your Code Citable, but I’ll lightly outline the process here.

First, we create a Zenodo account reusing our GitHub credentials. Zenodo will list out our repositories and we can click the On button next to one to ready it for publishing. This button establishes a “web hook” between events happening in that GitHub repository and Zenodo; when we go to publish a release, Zenodo will be aware of it.

This was the only step that tripped me up a bit. GitHub’s “releases” are not a part of the git version control system, they’re an added feature of the hosting environment. But in my mind they’re identical to git’s “tags” that one uses to label particular points in a repository’s history. Indeed, when we push a tag up to GitHub, it’ll show up on our repository’s releases page. But it appears tags are not technically releases, or don’t trigger the right web hook, because when I pushed a typical “v1.0.0” tag to GitHub, Zenodo didn’t notice. Instead, I had to go to my releases page, Draft a new release, and then Choose an existing tag to associate the version tag with a GitHub release. The title and description entered at this stage are available later in Zenodo.

The final step is back in Zenodo, where we can mint a DOI and describe our project further. We have a powerful set of fields for describing our project in Zenodo, including type (e.g. data set, software, presentation, publication), publication date, list of authors, open-ended description, list of keywords, access rights, license, funding agency, alternative identifiers (e.g. PubMed ID), and more. Zenodo also has a “communities” feature where we can deposit our research in a collection with a disciplinary focus; I put my data in the “Library and Information Science” group.

Step #2: Document the Data’s Schema

Obtaining a DOI is fine and all, but I also wanted to document my data more thoroughly. While it’s not a complicated data set, I’m familiar with the challenges that an unknown data schema presents for end users. All too often at work, I’m forced to revise data processing routines because a new outlier appears. There’ll be a string of text where I’m expecting only integers, a blank entry in what I thought was a required field, or an ID that doesn’t conform to the anticipated pattern (punctuation appears in a barcode! a random letter prefixes an otherwise numeric ID!).

To make our data’s structure clear, we can use the Data Package standard from the Open Knowledge Foundation (OKFN), specifically the Tabular Data Package subset which was designed for the CSV (comma-separated values) format.2 Documenting our data is straightforward with these standards; we place a “datapackage.json” file alongside our data files and fill in a few fields. Here’s an example:

{
    "name": "libs-github-api", // must be URL-friendly, e.g. no spaces 
    "description": "library GitHub projects",
    "license": "CC0 Public Domain",
    "keywords": ["libraries", "programming languages"],
    "resources": [ // list of files 
        {
            "name": "summary",
            "path": "data/summary.csv", // UNIX-style path relative to datapackage.json 
            "format": "csv",
            "mediatype": "text/csv",
            "schema": { // outline of fields within this file 
                "fields": [
                    {
                        "name": "language",
                        "type": "string", /* from a controlled list of data types
                        could also be integer, number, date, etc. */
                        "description": "name of the programming language",
                        "constraints": {
                            "required": true,
                            "unique": true
                        }
                    }
                    // our schema would list a few more fields here… 
                ]
            }
        }
    ]
}

Note that the comments above aren’t valid in JSON, I include them simply to provide some inline explanation.

While it requires a little reading to figure out how to fill out datapackage.json fields, many are self-evident. The appeal of the standard becomes evident in the schema section; we can tell consumers what types to expect from our data and other particularities of a given CSV column. Does a column contain empty values? Then required will be absent or explicitly set to false. Does a column contain both integers and text? Then the type “string” warns consumers not to anticipate only numeric values. What’s more, we can provide a regular expression in the pattern constraint which specifies exactly how a field may be formatted. Even strange barcodes with surprise punctuation could be documented precisely.

I would say there’s a lot more to the Data Package standards, but the truth is they’re elegant and concise. One can read all three (Data Package, Tabular Data Package, and JSON Table Schema) in a matter of minutes, look at an instructive example or two, and be ready to reveal their data’s structure in a standardized way. There is great depth available in the way one describes individual resources and their data schema.

Why spend all this time with a data package when we’ve already done something similar with Zenodo? The data package documentation solves a couple problems. First of all, packaging up our CSV alongside structured data about its nature addresses findability. There’s tons of open data out there, the issue is it can be scattered and difficult to find. If someone is looking for statistics on programming language usage, how would they go about finding my data? Searching GitHub will be challenging; the keywords one uses (“programming language”, the ambiguous “libraries”, etc.) will likely retrieve many repositories which don’t contain open data, and GitHub, while it does have a decent advanced search form, doesn’t have the facets to make retrieving a particular data set straightforward. One cannot, for instance, filter search results by a repository’s license or the format (CSV, JSON) of the data contained therein.

Data packages address the issue of findability by providing for the possibility of a registry that aggregates all the data sets it knows about. Once a datapackage.json appears, suddenly information like whether the format is CSV or JSON, what the license is, who created it, and what subject keywords are related to a repository become clear. The Open Knowledge Foundation already has a strong proof-of-concept registry, albeit one that lists only around a hundred data sets.

Since Data Package is an open standard, any third-party can easily parse its metadata and provide search facets based on the fields that are present. This is how the standard addresses a second issue; machine readability. Documenting data sets is good, necessary even, but it often only helps humans. I can write a five-page paper meticulously detailing my data’s collection methodology and structure, but that’s asking researchers to do a lot of reading. Now consider that their research might be on a grand scale; imagine if they needed to read a hundred five page papers describing ad hoc data schemas!

Instead, creating a machine readable description lets my data be processed quickly by a specially designed program. As a somewhat trivial example, I already used the OKFN’s Data Package Validator to ensure my schema documentation met their standard. As a more interesting use case, the OKFN also defines an optional “views” section of the data package standard which allows applications to automatically create charts from our data.

Reflection

While I’m glad that tools like Zenodo and standards like Data Package exist for publishing data, there’s still a lot of work to be done in this arena. Every time I make a new release on GitHub, which arguably should happen with even minor changes to my data or scripts, I have to refill the extensive Zenodo form. Zenodo also doesn’t detect the GitHub repository’s license, which is hardly blameworthy given that that information isn’t present as structured data but mere text in a readme file. However, when publishing a new version of the same underlying data, it doesn’t fill in the license or other information from its own previous items.

There’s a ton of efficiency left on the table in the data publishing process this post describes. Specifically, an integration between Zenodo and the datapackage.json metadata would alleviate a number of problems. Rather than repeatedly filling out a form in Zenodo, one could simply ensure changes were reflected in the datapackage.json and publish a new version on GitHub. Many fields between the two are redundant, though each also has its unique value; Zenodo asks for typical academic publishing information (e.g. publication type, links to prior versions) while Data Package asks for a data schema.

As a final area of concern, the open-ended “license” field is going to eventually limit the utility of the machine readable information in a Data Package.3 Perhaps this is my inner librarian unnecessarily freaking out here, but uncontrolled fields which affect resource reuse are bad news. Defaulting to authors specifying an arbitrary string of text as a license is precisely the problem that the Digital Public Library of America and other large digital libraries are facing, as their corpuses contain thousands of different rights statements.4 Zenodo provides a substantial list of licenses to choose from, but then does a poor job of automatically detecting one even if hints are available via GitHub or a previous incarnation of the publication. GitHub itself should probably make licenses for repositories required and controlled as I could see that being a vital facet in their advanced search as well as interesting data to expose to researchers via their API.

  1. I read something about this a month or two ago, but wasn’t able to relocate the source. Scouring the web, there’s a Washington Post article from January on the phenomenon of irreproducible research, which in turn points to a PLoS Med article from 2005 “Why Most Published Research Findings Are False“. Other studies along these lines are “A Survey on Data Reproducibility in Cancer Research Provides Insights into Our Limited Ability to Translate Findings from the Laboratory to the Clinic” in PLoS ONE and “Drug development: Raise standards for preclinical cancer research” in Nature.
  2. I might have been able to explore another intriguing project, Research Objects, which has the apt tagline “enabling reproducible, transparent research.” However, the Data Package standards were so easy to find and follow conceptually that I chose them.
  3. And to be fair, I did see one example where the licenses JSON property was specified as an array of objects containing a license name and URL, which might be easier to consume in script depending on what’s available at the URL.
  4. Aside: I don’t mean to argue that arbitrary license strings should be prohibited, because no controlled vocabulary is going to enumerate all possible choices. But there’s a lot of good work being done to make licenses easier to specify—think of Creative Commons with their composable, versioned licenses which can be referred to by URL. Defaulting to a controlled list of license types or at least pointing to a preferred vocabulary would help here.

Comments are closed.