Analyzing Usage Logs with OpenRefine

Background

Like a lot of librarians, I have access to a lot of data, and sometimes no idea how to analyze it. When I learned about linked data and the ability to search against data sources with a piece of software called OpenRefine, I wondered if it would be possible to match our users’ discovery layer queries against the Library of Congress Subject Headings. From there I could use the linking in LCSH to find the Library of Congress Classification, and then get an overall picture of the subjects our users were searching for. As with many research projects, it didn’t really turn out like I anticipated, but it did open further areas of research.

At California State University, Fullerton, we use an open source application called Xerxes, developed by David Walker at the CSU Chancellor’s Office, in combination with the Summon API. Xerxes acts as an interface for any number of search tools, including Solr, federated search engines, and most of the major discovery service vendors. We call it the Basic Search, and it’s incredibly popular with students, with over 100,000 searches a month and growing. It’s also well-liked – in a survey, about 90% of users said they found what they were looking for. We have monthly files of our users’ queries, so I had all of the data I needed to go exploring with OpenRefine.

OpenRefine

OpenRefine is an open source tool that deals with data in a very different way than typical spreadsheets. It has been mentioned in TechConnect before, and Margaret Heller’s post, “A Librarian’s Guide to OpenRefine” provides an excellent summary and introduction. More resources are also available on Github.

One of the most powerful things OpenRefine does is to allow queries against open data sets through a function called reconciliation. In the open data world, reconciliation refers to matching the same concept among different data sets, although in this case we are matching unknown entities against “a well-known set of reference identifiers” (Re-using Cool URIs: Entity Reconciliation Against LOD Hubs).

Reconciling Against LCSH

In this case, we’re reconciling our discovery layer search queries with LCSH. This basically means it’s trying to match the entire user query (e.g. “artist” or “cost of assisted suicide”) against what’s included in the LCSH linked open data. According to the LCSH website this includes “all Library of Congress Subject Headings, free-floating subdivisions (topical and form), Genre/Form headings, Children’s (AC) headings, and validation strings* for which authority records have been created. The content includes a few name headings (personal and corporate), such as William Shakespeare, Jesus Christ, and Harvard University, and geographic headings that are added to LCSH as they are needed to establish subdivisions, provide a pattern for subdivision practice, or provide reference structure for other terms.”

I used the directions at Free Your Metadata to point me in the right direction. One note: the steps below apply to OpenRefine 2.5 and version 0.8 of the RDF extension. OpenRefine 2.6 requires version 0.9 of the RDF extension. Or you could use LODRefine, which bundles some major extensions and I hear is great, but personally haven’t tried. The basic process shouldn’t change too much.

(1) Import your data

OpenRefine has quite a few file type options, so your format is likely already supported.

 Screenshot of importing data

(2) Clean your data

In my case, this involves deduplicating by timestamp and removing leading and trailing whitespaces. You can also remove weird punctuation, numbers, and even extremely short queries (<2 characters).

(3) Add the RDF extension.

If you’ve done it correctly, you should see an RDF dropdown next to Freebase.

Screenshot of correctly installed RDF extension

(4) Decide which data you’d like to search on.

In this example, I’ve decided to use just queries that are less than or equal to four words, and removed duplicate search queries. (Xerxes handles facet clicks as if they were separate searches, so there are many duplicates. I usually don’t, though, unless they happen at nearly the same time). I’ve also experimented with limiting to 10 or 15 characters, but there were not many more matches with 15 characters than 10, even though the data set was much larger. It depends on how much computing time you want to spend…it’s really a personal choice. In this case, I chose 4 words because of my experience with 15 characters – longer does not necessarily translate into more matches. A cursory glance at LCSH left me with the impression that the vast majority of headings (not including subdivisions, since they’d be searched individually) were 4 words or less. This, of course, means that your data with more than 4 words is unusable – more on that later.

Screenshot of adding a column based on word count using ngrams

(5) Go!

Shows OpenRefine reconciling

(6) Now you have your queries that were reconciled against LCSH, so you can limit to just those.

Screenshot of limiting to reconciled queries

Finding LC Classification

First, you’ll need to extract the cell.recon.match.id – the ID for the matched query that in the case of LCSH is the URI of the concept.

Screenshot of using cell.recon.match.id to get URI of concept

At this point you can choose whether to grab the HTML or the JSON, and create a new column based on this one by fetching URLs. I’ve never been able to get the parseJson() function to work correctly with LC’s JSON outputs, so for both HTML and JSON I’ve just regexed the raw output to isolate the classification. For more on regex see Bohyun Kim’s previous TechConnect post, “Fear No Longer Regular Expressions.”

On the raw HTML, the easiest way to do it is to transform the cells or create a new column with:

replace(partition(value,/<li property=”madsrdf:classification”>(<[^>]+>)*([A-Z]{1,2})/)[1],/<li property=”madsrdf:classification”>(<[^>]+>)*([A-Z]{1,2})/,”$2″).

Screenshot of using regex to get classification

You’ll note this will only pull out the first classification given, even if some have multiple classifications. That was a conscious choice for me, but obviously your needs may vary.

(Also, although I’m only concentrating on classification for this project, there’s a huge amount of data that you could work with – you can see an example URI for Acting to see all of the different fields).

Once you have the classifications, you can export to Excel and create a pivot table to count the instances of each, and you get a pretty table.

Table of LC Classifications

Caveats & Further Explorations

As you can guess by the y-axis in the table above, the number of matches is a very small percentage of actual searches. First I limited to keyword searches (as opposed to title/subject), then of those only ones that were 4 or fewer words long (about 65% of keyword searches). Of those, only about 1000 of the 26000 queries matched, and resulted in about 360 actual LC Classifications. Most months I average around 500, but in this example I took out duplicates even if they were far apart in time, just to experiment.

One thing I haven’t done but am considering is allowing matches that aren’t 100%. From my example above, there are another 600 or so queries that matched at 50-99%. This could significantly increase the number of matches and thus give us more classifications to work with.

Some of this is related to the types of searches that students are doing (see Michael J DeMars’ and my presentation “Making Data Less Daunting” at Electronic Resources & Libraries 2014, which this article grew out of, for some crazy examples) and some to the way that LCSH is structured. I chose LCSH because I could get linked to the LC Classification and thus get a sense of the subjects, but I’m definitely open to ideas. If you know of a better linked data source, I’m all ears.

I must also note that this is a pretty inefficient way of matching against LCSH. If you know of a way I could download the entire set, I’m interested in investigating that way as well.

Another approach that I will explore is moving away from reconciliation with LCSH (which is really more appropriate for a controlled vocabulary) to named-entity extraction, which takes natural language inputs and tries to recognize or extract common concepts (name, place, etc). Here I would use it as a first step before trying to match against LCSH. Free Your Metadata has a new named-entity extraction extension for OpenRefine, so I’ll definitely explore that option.

Planned Research

In the end, although this is interesting, does it actually mean anything? My next step with this dataset is to take a subset of the search queries and assign classification numbers. Over the course of several months, I hope to see if what I’ve pulled in automatically resembles the hand-classified data, and then draw conclusions.

So far, most of the peaks are expected – psychology and nursing are quite strong departments. There are some surprises though – education has been consistently underrepresented, based on both our enrollment numbers and when you do word counts (see our presentation for one month’s top word counts). Education students have a robust information literacy program. Does this mean that education students do complex searches that don’t match LCSH? Do they mostly use subject databases? Once again, an area for future research, should these automatic results match the classifications I do by hand.

What do you think? I’d love to hear your feedback or suggestions.

About Our Guest Author

Jaclyn Bedoya has lived and worked on three continents, although currently she’s an ER Librarian at CSU Fullerton. It turns out that growing up in Southern California spoils you, and she’s happiest being back where there are 300 days of sunshine a year. Also Disneyland. Reach her @spamgirl on Twitter or jaclynbedoya@gmail.com


Getting Started with APIs

There has been a lot of discussion in the library community regarding the use of web service APIs over the past few years.  While APIs can be very powerful and provide awesome new ways to share, promote, manipulate and mashup your library’s data, getting started using APIs can be overwhelming.  This post is intended to provide a very basic overview of the technologies and terminology involved with web service APIs, and provides a brief example to get started using the Twitter API.

What is an API?

First, some definitions.  One of the steepest learning curves with APIs involves navigating the terminology, which unfortunately can be rather dense – but understanding a few key concepts makes a huge difference:

  • API stands for Application Programming Interface, which is a specification used by software components to communicate with each other.  If (when?) computers become self-aware, they could use APIs to retrieve information, tweet, post status updates, and essentially run most day-to-do functions for the machine uprising. There is no single API “standard” though one of the most common methods of interacting with APIs involves RESTful requests.
  • REST / RESTful APIs  - Discussions regarding APIs often make references to “REST” or “RESTful” architecture.  REST stands for Representational State Transfer, and you probably utilize RESTful requests every day when browsing the web. Web browsing is enabled by HTTP (Hypertext Transfer Protocol) – as in http://example.org.  The exchange of information that occurs when you browse the web uses a set of HTTP methods to retrieve information, submit web forms, etc.  APIs that use these common HTTP methods (sometimes referred to as HTTP verbs) are considered to be RESTful.  RESTful APIs are simply APIs that leverage the existing architecture of the web to enable communication between machines via HTTP methods.

HTTP Methods used by RESTful APIs

Most web service APIs you will encounter utilize at the core the following HTTP methods for creating, retrieving, updating, and deleting information through that web service.1  Not all APIs allow each method (at least without authentication) but some common methods for interacting with APIs include:

    • GET – You can think of GET as a way to “read” or retrieve information via an API.  GET is a good starting point for interacting with an API you are unfamiliar with.  Many APIs utilize GET, and GET requests can often be used without complex authentication.  A common example of a GET request that you’ve probably used when browsing the web is the use of query strings in URLs (e.g., www.example.org/search?query=ebooks).
    • POST – POST can be used to “write” data over the web.  You have probably generated  POST requests through your browser when submitting data on a web form or making a comment on a forum.  In an API context, POST can be used to request that an API’s server accept some data contained in the POST request – Tweets, status updates, and other data that is added to a web service often utilize the POST method.
    • PUT – PUT is similar to POST, but can be used to send data to a web service that can assign that data a unique uniform resource identifier (URI) such as a URL.  Like POST, it can be used to create and update information, but PUT (in a sense) is a little more aggressive. PUT requests are designed to interact with a specific URI and can replace an existing resource at that URI or create one if there isn’t one.
    • DELETE – DELETE, well, deletes – it removes information at the URI specified by the request.  For example, consider an API web service that could interact with your catalog records by barcode.2 During a weeding project, an application could be built with DELETE that would delete the catalog records as you scanned barcodes.3

Understanding API Authentication Methods

To me, one of the trickiest parts of getting started with APIs is understanding authentication. When an API is made available, the publishers of that API are essentially creating a door to their application’s data.  This can be risky:  imagine opening that door up to bad people who might wish to maliciously manipulate or delete your data.  So often APIs will require a key (or a set of keys) to access data through an API.

One helpful way to contextualize how an API is secured is to consider access in terms of identification, authentication, and authorization.4  Some APIs only want to know where the request is coming from (identification), while others require you to have a valid account (authentication) to access data.  Beyond authentication, an API may also want to ensure your account has permission to do certain functions (authorization).  For example, you may be an authenticated user of an API that allows you to make GET requests of data, but your account may still not be authorized to make POST, PUT, or DELETE requests.

Some common methods used by APIs to store authentication and authorization include OAuth and WSKey:

  • OAuth - OAuth is a widely used open standard for authorization access to HTTP services like APIs.5  If you have ever sent a tweet from an interface that’s not Twitter (like sharing a photo directly from your mobile phone) you’ve utilized the OAuth framework.  Applications that already store authentication data in the form of user accounts (like Twitter and Google) can utilize their existing authentication structures to assign authorization for API access.  API Keys, Secrets, and Tokens can be assigned to authorized users, and those variables can be used by 3rd party applications without requiring the sharing of passwords with those 3rd parties.
  • WSKey (Web Services Key) – This is an example from OCLC, that is conceptually very similar to OAuth.  If you have an OCLC account (either through worldcat.org or oclc.org account) you can request key access.  Your authorization – in other words, what services and REST requests you are permitted to access – may be dependent upon your relationship with an OCLC member organization.

Keys, Secrets, Tokens?  HMAC?!

API authorization mechanisms often require multiple values in order to successfully interact with the API.  For example, with the Twitter API, you may be assigned an API Key and a corresponding Secret.  The topic of secret key authentication can be fairly complex,6 but fundamentally a Key and its corresponding Secret are used to authenticate requests in a secure encrypted fashion that would be difficult to guess or decrypt by malicious third-parties.  Multiple keys may be required to perform particular requests – for example, the Twitter API requires a key and secret to access the API itself, as well as a token and secret for OAuth authorization.

Probably the most important thing to remember about secrets is to keep them secret.  Do not share them or post them anywhere, and definitely do not store secret values in code uploaded to Github 7 (.gitignore – a method to exclude files from a git repository – is your friend here). 8  To that end, one strategy that is used by RESTful APIs to further secure secret key value is an HMAC header (hash-based method authentication code).  When requests are sent, HMAC uses your secret key to sign the request without actually passing the secret key value in the request itself. 9

Case Study:  The Twitter API

It’s easier to understand how APIs work when you can see them in action.  To do these steps yourself, you’ll need a Twitter account.  I strongly recommend creating a Twitter account separate from your personal or organizational accounts for initial experimentation with the API.  This code example is a very simple walkthrough, and does not cover securing your applications’ server (and thus securing the keys that may be stored on that server).  Anytime you authorize access to a Twitter account to API access you may be exposing it to some level of vulnerability.  At the end of the walkthrough, I’ll list the steps you would need to take if your account does get compromised.

1.  Activate a developer account

Visit dev.twitter.com and click the sign in area in the upper right corner.  Sign in with your Twitter account. Once signed in, click on your account icon (again in the upper right corner of the page) and then select the My Applications option from the drop-down menu.

Screenshot of the Twitter Developer Network login screen

2.  Get authorization

In the My applications area, click the Create New App button, and then fill out the required fields (Name, Description, and Website where the app code will be stored).  If you don’t have a web server, don’t worry, you can still get started testing out the API without actually writing any code.

3.  Get your keys

After you’ve created the application and are looking at its settings, click on the API Keys tab.  Here’s where you’ll get the values you need.  Your API Access Level is probably limited to read access only.  Click the “modify app permissions” link to set up read and write access, which will allow you to post through the API.  You will have to associate a mobile phone number with your Twitter account to get this level of authorization.

Screenshot of Twitter API options that allow for configuraing API read and write access.

Scroll down and note that in addition to an API Key and Secret, you also have an access token associated with OAUTH access.  This Token Key and Secret are required to authorize account activity associated with your Twitter user account.

4.  Test Oauth Access / Make a GET call

From the application API Key page, click the Test OAuth button.  This is a good way to get a sense of the API calls.  Leave the key values as they are on the page, and scroll down to the Request Settings Area.  Let’s do a call to return the most recent tweet from our account.

With the GET request checked, enter the following values:

Request URI:

Request Query (obviously replace yourtwitterhandle with… your actual Twitter handle):

  • screen_name=yourtwitterhandle&count=1

For example, my GET request looks like this:

Screenshot of the GET request setup screen for OAuth testing.

Click “See OAuth signature for this request”.  On the next page, look for the cURL request.  You can copy and paste this into a terminal or console window to execute the GET request and see the response (there will be a lot more of response text than what’s posted here):

* SSLv3, TLS alert, Client hello (1):
[{"created_at":"Sun Apr 20 19:37:53 +0000 2014","id":457966401483845632,
"id_str":"457966401483845632",
"text":"Just Added: The Fault in Our Stars by John Green; 
2nd Floor PZ7.G8233 Fau 2012","

As you can see, the above response to my cURL request includes the text of my account’s last tweet:

image00

What to do if your Twitter API Key or OAuth Security is Compromised

If your Twitter account suddenly starts tweeting out spammy “secrets to weight loss success” that you did not authorize (or other tweets that you didn’t write), your account has been compromised.  If you can still login with your username and password, it’s likely that your OAuth Keys have been compromised.  If you can’t log in, your account has probably been hacked.10  Your account can be compromised if you’ve authorized a third party app to tweet, but if your Twitter account has an active developer application on dev.twitter.com, it could be your own application’s key storage that’s been compromised.

Here are the immediate steps to take to stop the spam:

  1. Revoke access to third party apps under Settings –> Apps.  You may want to re-authorize them later – but you’ll probably want to reset the password for the third-party accounts that you had authorized.
  2. If you have generated API keys, log into dev.twitter.com and re-generate your API Keys and Secrets and your OAuth Keys and Secrets.  You’ll have to update any apps using the keys with the new key and secret information – but only if you have verified the server running the app hasn’t also been compromised.
  3. Reset your Twitter account password.11
5.  Taking it further:  Posting a new titles Twitter feed

So now you know a lot about the Twitter API – what now?  One way to take this further might involve writing an application to post new books that are added to your library’s collection.  Maybe you want to highlight a particular subject or collection – you can use some text output from your library catalog to post the title, author, and call number of new books.

The first step to such an application could involve creating an app that can post to the Twitter API.  If you have access to a  server that can run PHP, you can easily get started by downloading this incredibly helpful PHP wrapper.

Then in the same directory create two new files:

  • settings.php, which contains the following code (replace all the values in quotes with your actual Twitter API Key information):
<?php

$settings = array {
 ‘oath_access_token’ => “YOUR_ACCESS_TOKEN”,
 ‘oath_access_token_secret’ => “YOUR_ACCESS_TOKEN_SECRET”,
 ‘consumer_key’ => “YOUR_API_KEY”,
 ‘consumer_secret’ => “YOUR_API_KEY_SECRET”,
);

?>
  • and twitterpost.php, which has the following code, but swap out the values of ‘screen_name’ with your Twitter handle, and change the ‘status’ value if desired:
<?php

//call the PHP wrapper and your API values
require_once('TwitterAPIExchange.php');
include 'settings.php';

//define the request URL and REST request type
$url = "https://api.twitter.com/1.1/statuses/update.json";
$requestMethod = "POST";

//define your account and what you want to tweet
$postfields = array(
  'screen_name' => 'YOUR_TWITTER_HANDLE',
  'status' => 'This is my first API test post!'
);

//put it all together and build the request
$twitter = new TwitterAPIExchange($settings);
echo $twitter->buildOauth($url, $requestMethod)
->setPostfields($postfields)
->performRequest();

?>

Save the files and run the twitterpost.php page in your browser. Check the Twitter account referenced by the screen_name variable.  There should now be a new post with the contents of the ‘status’ value.

This is just a start – you would still need to get data out of your ILS and feed it to this application in some way – which brings me to one final point.

Is there an API for your ILS?  Should there be? (Answer:  Yes!)

Getting data out of traditional, legacy ILS systems can be a challenge.  Extending or adding on to traditional ILS software can be impossible (and in some cases may have been prohibited by license agreements).  One of the reasons for this might be that the architecture of such systems was designed for a world where the kind of data exchange facilitated by RESTful APIs didn’t yet exist.  However, there is definitely a major trend by ILS developers to move toward allowing access to library data within ILS systems via APIs.

It can be difficult to articulate exactly why this kind of access is necessary – especially when looking toward the future of rich functionality in emerging web-based library service platforms.  Why should we have to build custom applications using APIs – shouldn’t our ILS systems be built with all the functionality we need?

While libraries should certainly promote comprehensive and flexible architecture in the ILS solutions they purchase, there will almost certainly come a time when no matter how comprehensive your ILS is, you’re going to wonder, “wouldn’t it be nice if our system did X”?  Moreover, consider how your patrons might use your library’s APIs; for example, integrating your library’s web services other apps and services they already to use, or to build their own applications with your library web services. If you have web service API access to your data – bibliographic, circulation, acquisition data, etc. – you have the opportunity to meet those needs and to innovate collaboratively.  Without access to your data, you’re limited to the development cycle of your ILS vendor, and it may be years before you see the functionality you really need to do something cool with your data.  (It may still be years before you can find the time to develop your own app with an API, but that’s an entirely different problem.)

Examples of Library Applications built using APIs and ILS API Resources

Further Reading

Michel, Jason P. Web Service APIs and Libraries. Chicago, IL:  ALA Editions, 2013. Print.

Richardson, Leonard, and Michael Amundsen. RESTful Web APIs. Sebastopol, Calif.: O’Reilly, 2013.

 

About our Guest Author:

Lauren Magnuson is Systems & Emerging Technologies Librarian at California State University, Northridge and a Systems Coordinator for the Private Academic Library Network of Indiana (PALNI).  She can be reached at lauren.magnuson@csun.edu or on Twitter @lpmagnuson.

 

Notes

  1. create, retrieve, update, and delete is sometimes referred to by acronym: CRUD
  2. For example, via the OCLC Collection Management API: http://www.oclc.org/developer/develop/web-services/wms-collection-management-api.en.html
  3. For more detail on these and other HTTP verbs, http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
  4. https://blog.apigee.com/detail/do_you_need_api_keys_api_identity_vs._authorization
  5. Google, for example: https://developers.google.com/accounts/docs/OAuth2
  6. To learn a lot more about this, check out this web series: http://www.youtube.com/playlist?list=PLB4D701646DAF0817
  7. http://www.securityweek.com/github-search-makes-easy-discovery-encryption-keys-passwords-source-code
  8. Learn more about .gitignore here:  https://help.github.com/articles/ignoring-files
  9. An nice overview of HMAC is here: http://www.wolfe.id.au/2012/10/20/what-is-hmac-and-why-is-it-useful
  10. Here’s what to do if you’re account is hacked and you can’t log in:  https://support.twitter.com/articles/185703-my-account-has-been-hacked
  11. More information, and further steps you can take are here:  https://support.twitter.com/articles/31796-my-account-has-been-compromised

Lightweight Project Management Tools in the Real World

My life got extra complicated in the last few months. I gave birth to my first child in January, and in between the stress of a new baby, unexpected hospital visits, and the worst winter in 35 years, it was a trying time. While I was able to step back from many commitments during my 8 week maternity leave, I didn’t want to be completely out the loop, and since I would come back to three conferences back to back, I needed to be able to jump back in and monitor collaborative projects from wherever. All of us have times in our lives that are this hectic or even more so, but even in the regular busy thrum of our professional lives it’s too easy to let ongoing commitments like committee work completely disappear from our mental landscapes other than the nagging feeling that you are missing something.

There are various methods and tools to enhance productivity, which we’ve looked at before. Some basic collaboration tools such as Google Docs are always good to have any time you are working on a group project that builds into something like a presentation or report. But for committee work or every day work in a department, something more specialized can be even better. I want to look at some real-life examples of using lightweight project management tools to keep projects that you work on with others going strong—or not so strong, depending on how they are used. Over the past 4-5 months I’ve gotten experience using Trello for committee work and Asana for work projects. Both of them have some great features, but as always the implementation doesn’t depend entirely on the software’s functionality. Beyond my experience with these two implementations I’ll address a few other tools and my experience with effective usage of them.

Asana

I have the great fortune of having an entire wall of my office painted with white board paint, Asana Screenshotand use it to sketch out ideas and projects. For that to be useful, I need to be physically be in the office. So before I went on maternity leave, I knew I needed to get all my projects at work organized in a way that I could give tasks I would normally do to others, as well as monitor what was happening on large on-going projects. I had used Asana before in another context, so I decided to give it a try for this purpose. Asana has projects, tasks, and due dates that anyone in a workspace can follow and assign. It’s a pretty flexible system–the screenshot shows one potential way of setting it up, but we use different models for different projects, and there are many ideas out there. My favorite feature is project templates, which I use in another workspace that I share with my graduate assistant. This allows you to create a new project based on a standard series of steps, which means that she could create new projects while I was away based on the normal workflow we follow and I could work on them when I returned. All of this requires a very strict attention to keeping projects organized, however, and if you don’t have an agreed upon system for naming and organizing tasks they can get out of hand very quickly.

We also use Asana as part of our help request system. We wanted to set up a system to track requests from all the library staff not only for my maternity leave but in general. I looked at many different systems, but they were almost all too heavy-duty for what we needed. I made our own very lightweight system using the Webform module in Drupal on our intranet. Staff submits requests through that form, which sends an email using a departmental email address to our Issue Tracking queue in Asana. Once the task is completed we explain the problem in an Asana comment (or just mark completed if it’s a normal request such as new user account), and then send a reply to the requestor through the intranet. They can see all the requests they’ve made plus the replies through that system. The nice thing about doing it this way is that everything is in one place–trouble tickets become projects with tasks very easily.

Trello

Trello screenshotTrello is designed to mimic the experience of using index cards or sticky notes on a wall to track ideas and figure out what is going on at a glance. This is particularly useful for ongoing work where you have multiple projects in a set of pipelines divvied up among various people. You can easily see how many ideas you have in the inception stage and how many are closer to completion, which can be a good motivator to move items along. Another use is to store detailed project ideas and notes and then sort them into lists once you figure out a structure.

Trello starts with a virtual board, which is divided into lists of cards. Trello cards can be assigned to specific people, and anyone can follow a card to get notifications. Clicking on a card brings up a whole set of additional options, including who is working on the project, attachments, due dates, color coding, and anything else you might want. The screenshot shows how the LITA Education Committee uses Trello to plan educational offerings. The white areas with small boxes indicate cards (we use one card per program/potential idea) that are active and assigned, the gray areas indicate cards which haven’t been touched in a while and so probably need followup. Not surprisingly, there are many more cards, many of which are inactive, at the beginning of the pipeline than at the end with programs already set up. This is a good visual reminder that we need to keep things moving along.

In this case I didn’t set up Trello, and I am not always the best user of it. Using this for committee work has been useful, but there are a few items to keep in mind for it to actually work to keep projects going. First, and this goes for everything, including analog cards or sticky notes, all the people working on the project need to check into it on a regular basis and use it consistently. One thing that I found was important to do to get it into a regular workflow was turn on email notifications. While it would be nice to stay out of email more, most of us are used to finding work show up there, and if you have a sane relationship to your inbox (i.e. you don’t use it to store work in progress), it can be helpful to know to log in to work on something. I haven’t used the mobile app yet, but that is another option for notifications.

Other Tools

While I have started using Asana and Trello more heavily recently, there are a number of other tools out there that you may need to use in your job or professional life. Here are a few:

Box

Many institutions have some sort of “cloud” file system now such as Box or Google Drive. My work uses Box, and I find it very useful for parts of projects where I need many people (but a slightly different set each time) to collaborate on completing a single task. I upload a spreadsheet that I need everyone to look at, use the information to do something, and then add additional information to the spreadsheet. This is a very common scenario that organizations often use a shared drive to accomplish, but there are a number of problems with that approach. If you’ve ever been confronted with the filename “Spring2014_report-Copy-Copy-DRAFT.xlsx” or not been able to open a file because someone else left it open on her desktop and went to lunch, you know what I mean. Instead of that, I upload the file to Box, and assign a task to the usernames of all the people I need to look at the document. They can use a tool called Box Edit to open the file in Excel and any changes they make are immediately saved back to the shared document, just as a Google Doc would do. They can then mark the task complete, and the system only sends email reminders to people who haven’t yet finished the task.

ALA Connect

This section is only relevant to people working on projects with an American Library Association group, whether a committee or interest group. Since this happens to most people working in academic libraries at some point, I think it’s worth considering. But if not,  skip to the conclusion. ALA Connect is the central repository for institutional memory and documents for work around ALA, including committees and interest groups. It can also be a good place to work on project collaboratively, but it takes some setup. As a committee chair, I freely admit that I need to organize my own ALA Connect page much better. My normal approach was to use an online document (so something editable by everyone) for each project and file each document under a subcommittee heading, but in practice I find it way too hard to find the right document to see what each subcommittee is working on. I am going to experiment with a new approach. I will create “groups” for each project, and use the Group Headings sidebar to organize these. If you’re on a committee and not the chair, you don’t have access to reorganize the sidebar or posts, but suggest this approach to your chair if you can’t find anything in “General News & Discussions”. Also, try to document the approach you’ve taken so future chairs will know what you did, and let other chairs know what works for your committee.

You also need to make a firm commitment as a chair to hold certain types of discussions on your committee mailing list, and certain discussions on ALA Connect, and then to document any pertinent mailing list discussions on ALA Connect. That way you won’t be unable to figure out where you are on the project because half your work is in email and half on ALA Connect. (This obviously goes for any other tool other than email as well).

 Conclusion

With all the tools above, you really have no excuse to be running projects through email, which is not very effective unless everyone you are working with is very strict with their email filing and reply times. (Hint: they aren’t—see above about a sane relationship with your inbox.) But any tool requires a good plan to understand how its strengths mesh with work you have to accomplish. If your project is to complete a document by a certain date, a combination of Google Docs or Box (or ALA Connect for ALA work) and automated reminders might be best. If you want to throw a lot of ideas around and then organize them, Trello or Asana might work. Since these are all free to try, explore a few tools before starting a big project to see what works for you and your collaborators. Once you pick one, dedicate a bit of time on a weekly or monthly basis to keeping your virtual workspace organized. If you find it’s no longer working, figure out why. Did the scope of your project change over time, and a different tool is now more effective? This can happen when you are planning to implement something and switch over from the implementation to ongoing work using the new system. Or maybe people have gotten complacent about checking in on work to do. Explore different types of notifications or mobile apps to reinvigorate your team.

I would love to hear about your own approach to lightweight project management with these tools or others in the comments.

 


One Shocking Tool Plus Two Simple Ideas That Will Forever Chage How You Share Links

The Click Economy

The economy of the web runs on clicks and page views. The way web sites turn traffic into profit is complex, but I think we can get away with a broad gloss of the link economy as long as we acknowledge that greater underlying complexity exists. Basically speaking, traffic (measured in clicks, views, unique visitors, length of visit, etc.) leads to ad revenue. Web sites benefit when viewers click on links to their pages and when these viewers see and click on ads. The scale of the click economy is difficult to visualize. Direct benefits from a single click or page view are minuscule. Profits tend to be nonexistent or trivial on any scale smaller than unbelievably massive. This has the effect of making individual clicks relatively meaningless, but systems that can magnify clicks and aggregate them are extremely valuable. What this means for the individual person on the web is that unless we are Ariana Huffington, Sheryl Sandberg, Larry Page, or Mark Zuckerberg we probably aren’t going to get rich off of clicks. However, we do have impact and our online reputations can significantly influence which articles and posts go viral. If we understand how the click economy works, we can use our reputation and influence responsibly. If we are linking to content we think is good and virtuous, then there is no problem with spreading “link juice” indiscriminately. However, if we want to draw someone’s attention to content we object to, we can take steps to link responsibly and not have our outrage fuel profits for the content’s author. 1 We’ve seen that links benefit the site’s owners in two way: directly through ad revenues and indirectly through “link juice” or the positive effect that inbound links have on search engine ranking and social network trend lists. If our goal is to link without benefiting the owner of the page we are linking to, we will need a separate technique for each the two ways a web site benefits from links.

For two excellent pieces on the click economy, check out see Robinson Meyer’s Why are Upworthy Headlines Suddenly Everywhere?2 in the Atlantic Monthly and Clay Johnson’s book The Information Diet especially The New Journalists section of chapter three3

Page Rank

Page Rank is the name of a key algorithm Google uses to rank web pages it returns. 4 It counts inbound links to a page and keeps track of the relative importance of the sites the links come from. A site’s Page Rank score is a significant part of how Google decides to rank search results. 5 Search engines like Google recognize that there would be a massive problem if all inbound links were counted as votes for a site’s quality. 6 Without some mechanism to communicate “I’m linking to this site as an example of awful thinking” there really would be no such thing as bad publicity and a website with thousands of complaints and zero positive reviews would shoot to the top of search engine rankings. For example, every time a librarian used martinlutherking.org (A malicious propaganda site run by the white supremacist group Stormfront) as an example in a lesson about web site evaluation, the page would rise in Google’s ranking and more people would find it in the course of natural searches for information on Dr. King. When linking to malicious content, we can avoid increasing its Page Rank score, by adding the rel=“nofollow” attribute to the anchor link tag. A normal link is written like this:

<a href=“http://www.horriblesite.com/horriblecontent/“ target=”_blank”>This is a horrible page.</a>

This link would add the referring page’s reputation or “link juice” to the horrible site’s Page Rank. To fix that, we need to add the rel=“nofollow” attribute.

<a href=“http://www.horriblesite.com/horriblecontent/“ target=”_blank” rel=“nofollow”>This is a horrible page.</a>

This addition communicates to the search engine that the link should not count as a vote for the site’s value or reputation. Of course, not all linking takes place on web pages anymore. What happens if we want to share this link on Facebook or Twitter? Both Facebook and Twitter automatically add rel=“nofollow” to their links (you can see this if you view page source), but we should not rely on that alone. Social networks aggregate links and provide their own link juice similarly to search engines. When sharing links on social networks, we’ll want to employ a tool that keeps control of the link’s power in our own hands. donotlink.com is an very interesting tool for this purpose.

donotlink.com

donotlink.com is a service that creates safe links that don’t pass on any reputation or link juice. It is ideal for sharing links to sites we object to. On one level, it works similarly to a URL shortener like bit.ly or tinyurl.com. It creates a new URL customized for sharing on social networks. On deeper levels, it does some very clever stuff to make sure no link juice dribbles to the site being linked. They explain what, why, and how very well on their site. Basically speaking donotlink.com passes the link through a new URL that uses javascript, a robots.txt file, and the nofollow and noindex link attributes to both ask search engines and social networks to not apply link juice and to make it structurally difficult to do ignore these requests. 7 This makes donotlink.com’s link masking service an excellent solution to the problem of web sites indirectly profiting from negative attention.

Page Views & Traffic

All of the techniques listed above will deny a linked site the indirect benefits of link juice. They will not, however, deny the site the direct benefits from increased traffic or views and clicks on the pages advertisements. There are ways to share content without generating any traffic or advertising revenues, but these involve capturing the content and posting it somewhere else so they raise ethical questions about respect for intellectual property. So I suggest using only with both caution and intentionality. A quick and easy way to direct traffic to content without benefiting the hosting site is to use a link to Google’s cache of the page. If you can find a page in a Google search, clicking the green arrow next to the URL (see image) will give the option of viewing the cached page. Then just copy the full URL and share that link instead of the original. Viewers can read the text without giving the content page views. Not all pages are visible on Google, so the Wayback Machine from the Internet Archive is a great alternative. The Wayback Machine provides access to archived version of web pages and also has a mechanism (see the image on the right) for adding new pages to the archive.

screengrab of google cache

Screengrab of Google Cache

screengrab of wayback machine

Caching a site at the wayback machine

Both of these solutions rely on external hosts and if the owner of the content is serious about erasing a page, there are processes for removing content from both Google’s cache and the Wayback Machine archives. To be certain of archiving content, the simplest solution is to capture a screenshot and share the image file. This gives you control over the image, but may be unwieldy for larger documents. In these cases saving as a PDF may be a useful workaround. (Personally, I prefer to use the Clearly browser plugin with Evernote, but I have a paid Evernote account and am already invested in the Evernote infrastructure.)

Summing up

In conclusion, there are a number of steps we can take when we want to be responsible with how we distribute link juice. If we want to share information without donating our online reputation to the information’s owner, we can use donotlink.com to generate a link that does not improve their search engine ranking. If we want to go a step further, we can link to a cached version of the page or share a screenshot.

Notes

  1. Using outrageous or objectionable content to generate web traffic is a black-hat SEO technique known as “evil hooks.” There is a lot of profit in “You won’t believe what this person said!” links.
  2. http://www.theatlantic.com/technology/archive/2013/12/why-are-upworthy-headlines-suddenly-everywhere/282048/
  3. The Information Diet, page 35-41
  4. https://en.wikipedia.org/wiki/PageRank
  5. Matt Cuts How Search Works Video.
  6. I’ve used this article http://www.nytimes.com/2010/11/28/business/28borker.html to explain this concept to my students. It is also referenced by donotlink.com in their documentation.
  7. javascript is slightly less transparent to search engines and social networks than is HTML, robots.txt is a file on a web server that tells search engine bots which pages to crawl (it works more like a no trespassing sign than a locked gate), noindex tells bots not to add the link to its index.

Higher ‘Professional’ Ed, Lifelong Learning to Stay Employed, Quantified Self, and Libraries

The 2014 Horizon Report is mostly a report on emerging technologies. Many academic librarians carefully read its Higher Ed edition issued every year to learn about the upcoming technology trends. But this year’s Horizon Report Higher Ed edition was interesting to me more in terms of how the current state of higher education is being reflected on the report than in terms of the technologies on the near-term (one-to-five year) horizon of adoption. Let’s take a look.

A. Higher Ed or Higher Professional Ed?

To me, the most useful section of this year’s Horizon Report was ‘Wicked Challenges.’ The significant backdrop behind the first challenge “Expanding Access” is the fact that the knowledge economy is making higher education more and more closely and directly serve the needs of the labor market. The report says, “a postsecondary education is becoming less of an option and more of an economic imperative. Universities that were once bastions for the elite need to re-examine their trajectories in light of these issues of access, and the concept of a credit-based degree is currently in question.” (p.30)

Many of today’s students enter colleges and universities with a clear goal, i.e. obtaining a competitive edge and a better earning potential in the labor market. The result that is already familiar to many of us is the grade and the degree inflation and the emergence of higher ed institutions that pursue profit over even education itself. When the acquisition of skills takes precedence to the intellectual inquiry for its own sake, higher education comes to resemble higher professional education or intensive vocational training. As the economy almost forces people to take up the practice of lifelong learning to simply stay employed, the friction between the traditional goal of higher education – intellectual pursuit for its own sake – and the changing expectation of higher education — creative, adaptable, and flexible workforce – will only become more prominent.

Naturally, this socioeconomic background behind the expansion of postsecondary education raises the question of where its value lies. This is the second wicked challenge listed in the report, i.e. “Keeping Education Relevant.” The report says, “As online learning and free educational content become more pervasive, institutional stakeholders must address the question of what universities can provide that other approaches cannot, and rethink the value of higher education from a student’s perspective.” (p.32)

B. Lifelong Learning to Stay Employed

Today’s economy and labor market strongly prefer employees who can be hired, retooled, or let go at the same pace with the changes in technology as technology becomes one of the greatest driving force of economy. Workers are expected to enter the job market with more complex skills than in the past, to be able to adjust themselves quickly as important skills at workplaces change, and increasingly to take the role of a creator/producer/entrepreneur in their thinking and work practices. Credit-based degree programs fall short in this regard. It is no surprise that the report selected “Agile Approaches to Change” and “Shift from Students as Consumers to Students as Creators” as two of the long-range and the mid-range key trends in the report.

A strong focus on creativity, productivity, entrepreneurship, and lifelong learning, however, puts a heavier burden on both sides of education, i.e. instructors and students (full-time, part-time, and professional). While positive in emphasizing students’ active learning, the Flipped Classroom model selected as one of the key trends in the Horizon report often means additional work for instructors. In this model, instructors not only have to prepare the study materials for students to go over before the class, such as lecture videos, but also need to plan active learning activities for students during the class time. The Flipped Classroom model also assumes that students should be able to invest enough time outside the classroom to study.

The unfortunate side effect or consequence of this is that those who cannot afford to do so – for example, those who have to work on multiple jobs or have many family obligations, etc. – will suffer and fall behind. Today’s students and workers are now being asked to demonstrate their competencies with what they can produce beyond simply presenting the credit hours that they spent in the classroom. Probably as a result of this, a clear demarcation between work, learning, and personal life seems to be disappearing. “The E-Learning Predictions for 2014 Report”  from EdTech Europe predicts that ‘Learning Record Stores’, which track, record, and quantify an individual’s experiences and progress in both formal and informal learning, will be emerging in step with the need for continuous learning required for today’s job market. EdTech Europe also points out that learning is now being embedded in daily tasks and that we will see a significant increase in the availability and use of casual and informal learning apps both in education but also in the workplace.

C. Quantified Self and Learning Analytics

Among the six emerging technologies in the 2014 Horizon Report Higher Education edition, ‘Quantified Self’ is by far the most interesting new trend. (Other technologies should be pretty familiar to those who have been following the Horizon Report every year, except maybe the 4D printing mentioned in the 3D printing section. If you are looking for the emerging technologies that are on a farther horizon of adoption, check out this article from the World Economic Forum’s Global Agenda Council on Emerging Technologies, which lists technologies such as screenless display and brain-computer interfaces.)

According to the report, “Quantified Self describes the phenomenon of consumers being able to closely track data that is relevant to their daily activities through the use of technology.” (ACRL TechConnect has covered personal data monitoring and action analytics previously.) Quantified self is enabled by the wearable technology devices, such as Fitbit or Google Glass, and the Mobile Web. Wearable technology devices automatically collect personal data. Fitbit, for example, keeps track of one’s own sleep patterns, steps taken, and calories burned. And the Mobile Web is the platform that can store and present such personal data directly transferred from those devices. Through these devices and the resulting personal data, we get to observe our own behavior in a much more extensive and detailed manner than ever before. Instead of deciding on which part of our life to keep record of, we can now let these devices collect about almost all types of data about ourselves and then see which data would be of any use for us and whether any pattern emerges that we can perhaps utilize for the purpose of self-improvement.

Quantified Self is a notable trend not because it involves an unprecedented technology but because it gives us a glimpse of what our daily lives will be like in the near future, in which many of the emerging technologies that we are just getting used to right now – the mobile, big data, wearable technology – will come together in full bloom. Learning Analytics,’ which the Horizon Report calls “the educational application of ‘big data’” (p.38) and can be thought of as the application of Quantified Self in education, has been making a significant progress already in higher education. By collecting and analyzing the data about student behavior in online courses, learning analytics aims at improving student engagement, providing more personalized learning experience, detecting learning issues, and determining the behavior variables that are the significant indicators of student performance.

While privacy is a natural concern for Quantified Self, it is to be noted that we ourselves often willingly participate in personal data monitoring through the gamified self-tracking apps that can be offensive in other contexts. In her article, “Gamifying the Quantified Self,” Jennifer Whitson writes:

Gamified self-tracking and participatory surveillance applications are seen and embraced as play because they are entered into freely, injecting the spirit of play into otherwise monotonous activities. These gamified self-improvement apps evoke a specific agency—that of an active subject choosing to expose and disclose their otherwise secret selves, selves that can only be made penetrable via the datastreams and algorithms which pin down and make this otherwise unreachable interiority amenable to being operated on and consciously manipulated by the user and shared with others. The fact that these tools are consumer monitoring devices run by corporations that create neoliberal, responsibilized subjectivities become less salient to the user because of this freedom to quit the game at any time. These gamified applications are playthings that can be abandoned at whim, especially if they fail to pleasure, entertain and amuse. In contrast, the case of gamified workplaces exemplifies an entirely different problematic. (p.173; emphasis my own and not by the author)

If libraries and higher education institutions becomes active in monitoring and collecting students’ learning behavior, the success of an endeavor of that kind will depend on how well it creates and provides the sense of play to students for their willing participation. It will be also important for such kind of learning analytics project to offer an opt-out at any time and to keep the private data confidential and anonymous as much as possible.

D. Back to Libraries

The changed format of this year’s Horizon Report with the ‘Key Trends’ and the ‘Significant Challenges’ has shown the forces in play behind the emerging technologies to look out for in higher education much more clearly. A big take-away from this report, I believe, is that in spite of the doubt about the unique value of higher education, the demand will be increasing due to the students’ need to obtain a competitive advantage in entering or re-entering the workforce. And that higher ed institutions will endeavor to create appropriate means and tools to satisfy students’ need of acquiring and demonstrating skills and experience in a way that is appealing to future employers beyond credit-hour based degrees, such as competency-based assessments and a badge system, is another one.

Considering that the pace of change at higher education tends to be slow, this can be an opportunity for academic libraries. Both instructors and students are under constant pressure to innovate and experiment in their teaching and learning processes. Instructors designing the Flipped Classroom model may require a studio where they can record and produce their lecture videos. Students may need to compile portfolios to demonstrate their knowledge and skills for job interviews. Returning adult students may need to acquire the habitual lifelong learning practices with the help from librarians. Local employers and students may mutually benefit from a place where certain co-projects can be tried. As a neutral player on the campus with tech-savvy librarians and knowledgeable staff, libraries can create a place where the most palpable student needs that are yet to be satisfied by individual academic departments or student services are directly addressed. Maker labs, gamified learning or self-tracking modules, and a competency dashboard are all such examples. From the emerging technology trends in higher ed, we see that the learning activities in higher education and academic libraries will be more and more closely tied to the economic imperative of constant innovation.

Academic libraries may even go further and take up the role of leading the changes in higher education. In his blog post for Inside Higher Ed, Joshua Kim suggests exactly this and also nicely sums up the challenges that today’s higher education faces:

  • How do we increase postsecondary productivity while guarding against commodification?
  • How do we increase quality while increasing access?
  • How do we leverage technologies without sacrificing the human element essential for authentic learning?

How will academic libraries be able to lead the changes necessary for higher education to successfully meet these challenges? It is a question that will stay with academic libraries for many years to come.


My First Hackathon & WikipeDPLA

Almost two months ago, I attended my first hackathon during ALA’s Midwinter Meeting. Libhack was coordinated by the Library Code Year Interest Group. Much credit is due to coordinators Zach Coble, Emily Flynn, Jesse Saunders, and Chris Strauber. The University of Pennsylvania graciously hosted the event in their Van Pelt Library.

What’s a hackathon? It’s a short event, usually a day or two, wherein coders and other folks get together to produce software. Hackathons typically work on a particular problem, application, or API (a source of structured data). LibHack focused on APIs from two major library organizations: OCLC and the Digital Public Library of America (DPLA).

Impressions & Mixed Content

Since this was my first hackathon and the gritty details below may be less than relevant to all our readers, I will front-load my general impressions of Libhack rather than talk about the code I wrote. First of all, splitting the hackathon into two halves focused on different APIs and catering to different skill levels worked well. There were roughly equal numbers of participants in both the structured, beginner-oriented OCLC group and the more independent DPLA group.

Having representatives from both of the participating institutions was wonderful. While I didn’t take advantage of the attending DPLA staff as much as I should have, it was great to have a few people to answer questions. What’s more, I think DPLA benefited from hearing about developers’ experiences with their API. For instance, there are a few metadata fields in their API which might contain an array or a string depending upon the record. If an application assumes one or the other, chances are it breaks at some point and the programmer has to locate the error and write code that handles either data format.

Secondly, the DPLA API is currently available only over unencrypted HTTP. Thus due to the mixed content policies of web browsers it is difficult to call the HTTP API on HTTPS pages. For the many HTTP sites on the web this isn’t a concern, but I wanted to call the DPLA API from Wikipedia which only serves content over HTTPS. To work around this limitation, users have to manually override mixed content blocking in their browser, a major limitation for my project. DPLA already had plans to roll out an HTTPS API, but I think hearing from developers may influence its priority.

Learn You Some Lessons

Personally, I walked away from Libhack with a few lessons. First of all, I had to throw away my initial code before creating something useful. While I had a general idea in mind—somehow connect DPLA content related to a given Wikipedia page—I wasn’t sure what type of project I should create. I started writing a command-line tool in Python, envisioning a command that could be passed a Wikipedia URL or article title and return a list of related items in the DPLA. But after struggling with a pretty unsatisfying project for a couple hours, including a detour into investigating the MediaWiki API, I threw everything aside and took a totally different approach by building a client-side script meant to run in a web browser. In the end, I’m a lot happier with the outcome after my initial failure. I love the command line, but the appeal of such a tool would be niche at best. What I wrote has a far broader appeal.1

Secondly, I worked closely with Wikipedian Jake Orlowitz.2 While he isn’t a coder, his intimate knowledge of Wikipedia was invaluable for our end product. Whenever I had a question about Wikipedia’s inner workings or needed someone to bounce ideas off of, he was there. While I blindly starting writing some JavaScript without a firm idea of how we could embed it onto Wikipedia pages, it was Jake who pointed me towards User Scripts and created an excellent installation tour.3 In other groups, I heard people discussing metadata, subject terms, and copyright. I think that having people of varied expertise in a group is advantageous when compared with a group solely composed of coders. Many hackathons explicitly state that non-programmers are welcome and with good reason; experts can outline goals, consider end-user interactions, and interpret API results. These are all invaluable contributions which are also hard to do with one’s face buried in a code editor.

While I did enjoy my hackathon experience, I was expecting a bit more structure and larger project groups. I arrived late, which doubtless didn’t help, but the DPLA groups were very fragmented. Some projects were only individuals, while others (like ours) were pairs. I had envisioned groups of at least four, where perhaps one person would compose plans and documentation, another would design a user interface, and the remainder would write back-end code. I can’t say that I was at all disappointed, but I could have benefited from the perspectives of a larger group.

What is WikipeDPLA?

So what did we build at Libhack anyway? As previously stated, we made a Wikipedia user script. I’ve dubbed it WikipeDPLA, though you can find it as FindDPLA on Wikipedia. Once installed, the script will query DPLA’s API on each article you visit, inserting related items towards the top.

WikipeDPLA in action

How does it work?

Here’s a step-by-step walkthrough of how WikipeDPLA works:

When you visit a Wikipedia article, it collects a few pieces of information about the article by copying text from the page’s HTML: the article’s title, any “X redirects here” notices, and the article’s categories.

First, WikipeDPLA constructs a DPLA query using the article’s title. Specifically, it constructs a JSONP query. JSONP is a means of working around the web’s same-origin policy which lets scripts manipulate data on other web pages. It works by including a script tag with a specially constructed URL containing a reference to one of your JavaScript functions:

<script src="//example.com/jsonp-api?q=search+term&callback=parseResponse"></script>

In responding to this request, the API plays a little trick; it doesn’t just return raw data, since that would be invalid JavaScript and thus cause a parsing error in the browser. Instead, it wraps the data in the function we’ve provided it. In the example above, that’s parseResponse:

parseResponse({
    "results": [
        {"title": "Searcher Searcherson",
        "id": 123123,
        "genre": "Omphaloskepsis"},
        {"title": "Terminated Term",
        "id": 321321,
        "genre": "Literalism"}
    ]
});

This is valid JavaScript; parseResponse receives an object which contains an array of search result records, each with some minimal metadata. This pattern has the handy feature that, as soon as our query results are available, they’re immediately passed to our callback function.

WikipeDPLA’s equivalent of parseResponse looks to see if there are any results. If the article’s title doesn’t return any results, then it’ll try again with any alternate titles culled from the article’s redirection notices. If those queries are also fruitless, it starts to go through the article’s categories.

Once we’ve guaranteed that we have some results from DPLA, we parse the API’s metadata into a simpler subset. This subset consists of the item’s title, a link to its content, and an “isImage” Boolean value noting whether or not the item is an image. With this simpler set of data in hand, we loop through our results to build a string of HTML which is then inserted onto the page. Voìla! DPLA search results in Wikipedia.

Honing

After putting the project together, I continued to refine it. I used the “isImage” Boolean to put a small image icon next to an item’s link. Then, after the hackathon, I noticed that my script was a nuisance if a user started reading a page anywhere other than at its start. For instance, if you start reading the Barack Obama article at the Presidency section, you will read for a moment and then suddenly be jarred as the DPLA results are inserted up top and push the rest of the article’s text down the page. In order to mitigate this behavior, we need to know if the top of the article is in view before inserting our results HTML. I used a jQuery visibility plug-in and an event listener on window scroll events to fix this.

Secondly, I was building a project with several targets: a user script for Wikipedia, a Grease/Tampermonkey user script4, and a (as yet inchoate) browser extension. To reuse the same basic JavaScript but in these different contexts, I chose to use the make command. Make is a common program used for projects which have multiple platform targets. It has an elegantly simple design: when you run make foo inside of a directly, make looks in a file named “makefile” for a line labelled “foo:” and then executes the shell command on the subsequent line. So if I have the following makefile:

hello:
    echo 'hello world!'

bye:
    echo 'goodbye!'

clean:
    rm *.log *.cache

Inside the same directory as this makefile, the commands make hello, make bye, and make clean respectively would print “hello world!” to my terminal, print “goodbye!”, and delete all files ending in extension “log” or “cache”. This contrived example doesn’t help much, but in my project I can run something like make userscript and the Grease/Tampermonkey script is automatically produced by prepending some header text to the main WikipeDPLA script. Similarly, make push produces all the various platform targets and then pushes the results up to the GitHub repo, saving a significant amount of typing on the command line.

These bits of trivia about interface design and tooling allude to a more important idea: it’s vital to choose projects that help you learn, particularly in a low-stakes environment like a hackathon. No one expects greatness from a product duct taped together in a few hours, so seize the opportunity to practice rather than aim for perfection. I didn’t have to write a makefile, but I chose to spend time familiarizing myself with a useful tool.

What’s Next?

While I am quite happy with my work at Libhack I do have plans for improvement. My main goal is to turn WikipeDPLA into a browser extension, for Chrome and perhaps Firefox. An extension offers a couple advantages: it can avoid the mixed-content issue with DPLA’s HTTP-only API5 and it is available even for users who aren’t logged in to Wikipedia. It would also be nice to expand my approach to encompassing other major digital library APIs, such as Europeana or Australia’s Trove.

And, of course, I want to attend more hackathons. Libhack was a very positive event for me, both in terms of learning and producing something useful, so I’m encouraged and hope other library conferences offer collaborative coding opportunities.

Other Projects

Readers should head over to LITA Blog where organizer Zach Coble has a report on libhack which details several other projects created at the Midwinter hackathon. Or you could just follow @HistoricalCats on Twitter.

Notes

  1. An aside related to learning to program: being a relatively new coder, I often think about advice I can give others looking to start coding. One common question is “what language should I learn first?” There’s one stock response, that it’s important not to worry too much about this choice because learning the fundamentals of one language will enable you to learn others quickly. But that dodges the question, because what people want to hear is a proper noun like “Ruby” or “Python” or “JavaScript.” And JavaScript, despite not being nearly as user friendly as those other two options, is a great starting place because it lets you work on the web with little effort. All of this to say; if I didn’t know JavaScript fairly well, I would not have been able to make something so useful.
  2. Shameless plug: Jake works on the Wikipedia Library, an interesting project that aims to connect Wikipedian researchers with source material, from subscription databases and open access repositories alike.
  3. User Scripts are pieces of JavaScript that a user can choose to insert whenever they are signed into and browsing Wikipedia. They’re similar to Greasemonkey user scripts, except the scripts only apply to Wikipedia. These scripts can do anything from customize the site’s appearance to insert new content, which is exactly what we did.
  4. Greasemonkey is the Firefox add-on for installing scripts that run on specified sites or pages; Tampermonkey is an analogous extension for Chrome.
  5. How’s that for acronyms?

Bash Scripting: automating repetitive command line tasks

Introduction

One of my current tasks is to develop workflows for digital preservation procedures. These workflows begin with the acquisition of files – either disk images or logical file transfers – both of which end up on a designated server. Once acquired, the images (or files) are checked for viruses. If clean, they are bagged using Bagit and then copied over to a different server for processing.1 This work is all done at the command line, and as you can imagine, it gets quite repetitive. It’s also a bit error-prone since our file naming conventions include a 10-digit media ID number, which is easily mistyped. So once all the individual processes were worked out, I decided to automate things a bit by placing the commands into a single script. I should mention here that I’m no Linux whiz- I use it as needed which sometimes is daily, sometimes not. This is the first time I’ve ever tried to tie commands together in a Bash script, but I figured previous programming experience would help.

Creating a Script

To get started, I placed all the virus check commands for disk images into a script. These commands are different than logical file virus checks since the disk has to be mounted to get a read. This is a pretty simple step – first add:

#!/bin/bash

as the first line of the file (this line should not be indented or have any other whitespace in front of it). This tells the kernel which kind of interpreter to invoke, in this case, Bash. You could substitute the path to another interpreter, like Python, for other types of scripts: #!/bin/python.

Next, I changed the file permissions to make the script executable:

chmod +x myscript

I separated the virus check commands so that I could test those out and make sure they were working as expected before delving into other script functions.

Here’s what my initial script looked like (comments are preceded by a #):

#!/bin/bash

#mount disk
sudo mount -o ro,loop,nodev,noexec,nosuid,noatime /mnt/transferfiles/2013_072/2013_072_DM0000000001/2013_072_DM0000000001.iso /mnt/iso

#check to verify mount
mountpoint -q /mnt/iso && echo "mounted" || "not mounted"

#call the Clam AV program to run the virus check
clamscan -r /mnt/iso > "/mnt/transferfiles/2013_072/2013_072_DM0000000001/2013_072_DM0000000001_scan_test.txt"

#unmount disk
sudo umount /mnt/iso

#check disk unmounted
mountpoint -q /mnt/iso && echo "mounted" || "not mounted"

All those options on the mount command? They give me the piece of mind that accessing the disk will in no way alter it (or affect the host server), thus preserving its authenticity as an archival object. You may also be wondering about the use of “&&” and “||”.  These function as conditional AND and OR operators, respectively. So “&&” tells the shell to run the first command, AND if that’s successful, it will run the second command. Conversely, the “||” tells the shell to run the first command OR if that fails, run the second command. So the mount check command can be read as: check to see if the directory at /mnt/iso is a mountpoint. If the mount is successful, then echo “mounted.” If it’s not, echo “not mounted.” More on redirection.

Adding Variables

You may have noted that the script above only works on one disk image (2013_072_DM0000000001.iso), which isn’t very useful. I created variables for the accession number, the digital media number, and the file extension, since they all changed depending on the disk image information.The file naming convention we use for disk images is consistent and strict. The top level directory is the accession number. Within that, each disk image acquired from that accession is stored within it’s own directory, named using it’s assigned number. The disk image is then named by a combination of the accession number and the disk image number. Yes, it’s repetitive, but it keeps track of where things came from and links to data we have stored elsewhere. Given that these disks may not be processed for 20 years, such redundancy is preferred.

At first I thought the accession number, digital media number, and extension variables would be best entered at the initial run command; type in one line to run many commands. Each variable is separated by a space, the .iso at the end is the extension for an optical disk image file:

$ ./virus_check.sh 2013_072 DM0000000001 .iso

In Bash, scripts run with arguments are named $1 for the first variable, $2 for the second, and so on. This actually tripped me up for a day or so. I initially thought the $1, $2, etc. variables names used by the book I was referencing were for examples only, and that the first variables I referenced in the script would automatically map in order, so if 2013_072 was the first argument, and $accession was the first variable, $accession = 2013_072 (much like when you pass in a parameter to a Python function). Then I realized there was a reason that more than one reference book and/or site used the $1, $2, $3 system for variables passed in as command line arguments. I assigned each to it’s proper variable, and things were rolling again.

#!/bin/bash

#assign command line variables
$1=$accession
$2=$digital_media
$3=$extension

#mount disk
sudo mount -o ro,loop,noatime /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} /mnt/iso<span style="line-height: 1.5em;"> </span>

Note: variables names are often presented without curly braces; it’s recommended to place them in curly braces when adjacent to other strings.2

Reading Data

After testing the script a bit, I realized I  didn’t like passing the variables in via the command line. I kept making typos, and it was annoying not to have any data validation done in a more interactive fashion. I reconfigured the script to prompt the user for input:

read -p "Please enter the accession number" accession

read -p "Please enter the digital media number" digital_media

read -p "Please enter the file extension, include the preceding period" extension

After reviewing some test options, I decided to test that $accession and $digital_media were valid directories, and that the combo of all three variables was a valid file. This test seems more conclusive than simply testing whether or not the variables fit the naming criteria, but it does mean that if the data entered is invalid, the feedback given to the user is limited. I’m considering adding tests for naming criteria as well, so that the user knows when the error is due to a typo vs. a non-existent directory or file. I also didn’t want the code to simply quit when one of the variables is invalid – that’s not very user-friendly. I decided to ask the user for input until valid input was received.

read -p "Please enter the accession number" accession
until [ -d /mnt/transferfiles/${accession} ]; do
     read -p "Invalid. Please enter the accession number." accession
done

read -p "Please enter the digital media number" digital_media
until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
     read -p "Invalid. Please enter the digital media number." digital_media
done

read -p  "Please enter the file extension, include the preceding period" extension
until [ -e/mnt/transferfiles/${accession}/${accession}_${digital_media}/
${accession}_${digital_media}${extension} ]; do
     read -p "Invalid. Please enter the file extension, including the preceding period" extension
done

Creating Functions

You may have noted that the command used to test if a disk is mounted or not is called twice. This is done on purpose, as it was a test I found helpful when running the virus checks; the virus check runs and outputs a report even if the disk isn’t mounted. Occasionally disks won’t mount for various reasons. In such cases, the resulting report will state that it scanned no data, which is confusing because the disk itself possibly could have contained no data. Testing if it’s mounted eliminates that confusion. The command is repeated after the disk has been unmounted, mainly because I found it easy to forget the unmount step, and testing helps reinforce good behavior. Given that the command is repeated twice, it makes sense to make it a function rather than duplicate it.

check_mount () {
     #checks to see if disk is mounted or not
     mountpoint -q /mnt/iso && echo "mounted" || "not mounted" 
}

Lastly, I created a function for the input variables. I’m sure there’s prettier, more concise ways of writing this function, but since it’s still being refined and I’m still learning Bash scripting, I decided to leave it for now. I did want it placed in it’s own function because I’m planning to add additional code that will notify me if the virus check is positive and exit the program, or if it’s negative, bag the disk image and corresponding files, and copy them over to another server where they’ll wait for further processing.

get_image () {
     #gets data from user, validates it    
     read -p "Please enter the accession number" accession
     until [ -d /mnt/transferfiles/${accession} ]; do
          read -p "Invalid. Please enter the accession number." accession
     done

     read -p "Please enter the digital media number" digital_media
     until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
          read -p "Invalid. Please enter the digital media number." digital_media
     done

     read -p  "Please enter the file extension, include the preceding period" extension
     until [ -e /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} ]; do
          read -p "Invalid. Please enter the file extension, including the preceding period" extension
     done
}

Resulting (but not final!) Script

#!/bin/bash
#takes accession number, digital media number, and extension as variables to mount a disk image and run a virus check

check_mount () {
     #checks to see if disk is mounted or not
     mountpoint -q /mnt/iso && echo "mounted" || "not mounted"
}

get_image () {
     #gets disk image data from user, validates info
     read -p "Please enter the accession number" accession
     until [ -d /mnt/transferfiles/${accession} ]; do
          read -p "Invalid. Please enter the accesion number." accession
     done

     read -p "Please enter the digital media number" digital_media
     until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
          read -p "Invalid. Please enter the digital media number." digital_media
     done

     read -p  "Please enter the file extension, include the preceding period" extension
     until [ -e/mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} ]; do
          read -p "Invalid. Please enter the file extension, including the preceding period" extension
     done
}

get_image

#mount disk
sudo mount -o ro,loop,noatime /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} /mnt/${extension} 

check_mount

#run virus check
sudo clamscan -r /mnt/iso > "/mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}_scan_test.txt"

#unmount disk
sudo umount /mnt/iso

check_mount

Conclusion

There’s a lot more I’d like to do with this script. In addition to what I’ve already mentioned, I’d love to enable it to run over a range of digital media numbers, since they often are sequential. It also doesn’t stop if the disk isn’t mounted, which is an issue. But I thought it served as a good example of how easy it is to take repetitive command line tasks and turn them into a script. Next time, I’ll write about the second phase of development, which will include combining this script with another one, virus scan reporting, bagging, and transfer to another server.

Suggested References

An A-Z Index of the Bash command line for Linux

The books I used, both are good for basic command line work, but they only include a small section for actual scripting:

Barrett, Daniel J., Linux Pocket Guide. O’Reilly Media, 2004.

Shotts, Jr., William E. The Linux Command Line: A complete introduction. no starch press. 2012.

The book I wished I used:

Robbins, Arnold and Beebe, Nelson H. F. Classic Shell Scripting. O’Reilly Media, 2005.

Notes

  1. Logical file transfers often arrive in bags, which are then validated and virus checked.
  2. Linux Pocket Guide

What is Node.js & why do I care?

At its simplest, Node.js is server-side JavaScript. JavaScript is a popular programming language, but it almost always runs inside a web browser. So JavaScript can, for instance, manipulate the contents of this page by being included inside <script> tags, but it doesn’t get to play around with the files on our computer or tell our server what HTML to send like the PHP that runs this WordPress blog.

Node is interesting for more than just being on the server side. It provides a new way of writing web servers while using an old UNIX philosophy. Hopefully, by the end of this post, you’ll see its potential and how it differentiates itself from other programming environments and web frameworks.

Hello, World

To start, let’s do some basic Node programming. Head over to nodejs.org and click Install.1 Once you’ve run the installer, a node executable will be available for you on the command line. Any script you pass to node will be interpreted and the results displayed. Let’s do the classic “hello world” example. Create a new file in a text editor, name it hello.js, and put the following on its only line:

console.log('Hello, world!');

If you’ve written JavaScript before, you may recognize this already. console.log is a common debugging method which prints strings to your browser’s JavaScript console. In Node, console.log will output to your terminal. To see that, open up a terminal (on Mac, you can use Terminal.app while on Windows both cmd.exe and PowerShell will work) and navigate to the folder where you put hello.js. Your terminal will likely open in your user’s home folder; you can change directories by typing cd followed by a space and the subdirectory you want to go inside. For instance, if I started at “C:\users\me” I could run cd Documents to enter “C:\users\me\Documents”. Below, we open a terminal, cd into the Documents folder, and run our script to see its results.

$ cd Documents
$ node hello.js
Hello, world!

That’s great and all, but it leaves a lot to be desired. Let’s do something a little more sophisticated; let’s write a web server which responds “Hello!” to any request sent to it. Open a new file up, name it server.js, and write this inside:

var http = require('http');
http.createServer(handleRequest).listen(8888);
function handleRequest (request, response) {
  response.end( 'Hello!' );
}

In our terminal, we can run node server.js and…nothing happens. Our prompt seems to hang, not outputting anything but also not letting us type another command. What gives? Well, Node is running a web server and it’s waiting for responses. Open up your web browser and navigate to “localhost:8888″; the exclamation “Hello!” should appear. In four lines of code, we just wrote an HTTP server. Sure, it’s the world’s dumbest server that only says “Hello!” over and over no matter what we request from it, but it’s still an achievement. If you’re the sort of person who gets giddy at how easy this was, then Node.js is for you.

Let’s walk through server.js line-by-line. First, we import the core HTTP library that comes with Node. The “require” function is a way of loading external modules into your script, similar to how the function of the same name does in Ruby or import in Python. The HTTP library gives us a handy “createServer” method which receives HTTP requests and passes them along to a callback function. On our 2nd line, we call createServer, pass it the function we want to handle incoming requests, and set it to listen for requests sent to port 8888. The choice of 8888 is arbitrary; we could choose any number over 1024, while operating systems often restrict the lower ports which are already in use by specific protocols. Finally, we define our handleRequest callback which will receive a request and response object for each HTTP request. Those objects have many useful properties and methods, but we simply called the response object’s end method which sends a response and optionally accepts some data to put into that response.

The use of callback functions is very common in Node. If you’ve written JavaScript for a web browser you may recognize this style of programming; it’s the same as when you define an event listener which responds to mouse clicks, or assign a function to process the result of an AJAX request. The callback function doesn’t executive synchronously in the same order you wrote it in your code, it waits for some “event” to occur, whether that event is a click or an AJAX request returning data.

In our HTTP server example, we also see a bit of what makes Node different from other server-side languages like PHP, Perl, Python, and Ruby. Those languages typically work with a web server, such as Apache, which passes certain requests over to the languages and serves up whatever they return. Node is a server, it gives you low-level access to the inner workings of protocols like HTTP and TCP. You don’t need to run Apache and have requests sent to Node: it handles them on its own.

Who cares?

Some of you are no doubt wondering: what exactly is the big deal? Why am I reading about this? Surely, the world has enough programming languages, and JavaScript is nothing new, even server-side JavaScript isn’t that new.2 There are already plenty of web servers out there. What need does Node.js fill?

To answer that, we must revisit the origins of Node. The best way to understand is to watch Ryan Dahl present on the impetus for creating Node. He says, essentially, that other programming frameworks are doing IO (input/output) wrong. IO comes in many forms: when you’re reading or writing to a file, when you’re querying databases, and when you’re receiving and sending HTTP requests. In all of these situations, your code asks for data…waits…and waits…and then, once it has the data, it manipulates it or performs some calculation, and then sends it somewhere else…and waits…and waits. Basically, because the code is constantly waiting for some IO operation, it spends most of its time sitting around rather than crunching digits like it wants to. IO operations are commonly the bottlenecks in programs, so we shouldn’t let our code just stop every time they perform one.

Node not only has a beneficial asynchronous programming model but it has developed other advantages as well. Because lots of people already know how to write JavaScript, it’s started up much quicker than languages which are entirely new to developers. It reuses Google Chrome’s V8 as a JavaScript interpreter, giving it a big speed boost. Node’s package manager, NPM, is growing at a tremendous rate, far faster than its sibling package managers for Java, Ruby, and Python. NPM itself is a strong point of Node; it’s learned from other package managers and has many excellent features. Finally, other programming languages were developed to be all-purpose tools. Node, while it does share the same all-purpose utility, is really intended for the web. It’s meant to write web servers and handle HTTP intelligently.

Node also follows many UNIX principles. Doug McIlroy succinctly summarized the UNIX philosophy as “Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.” NPM does a great job letting authors write small modules which work well together. This has been tough previously in JavaScript because web browsers have no “require” function; there’s no native way for modules to define and load their dependencies, which resulted in the popularity of large, complicated libraries.3 jQuery is a good example; it’s tremendously popular and it includes hundreds of functions in its API, while most sites that use it really only need a few. Large, complicated programs are more difficult to test, debug, and reason about, which is why UNIX avoided them.

Many Node modules also support streams that allow you to pipe data through a series of programs. This is analogous to how BASH and other shells let you pipe text from one command to another, with each command taking the output of the last as its input. To visualize this, see Stream Playground written by John Resig, creator of jQuery. Streams allow you to plug-in different functionality in when needed. This pseudocode shows how one might read a CSV from a server’s file system (the core “fs” library stands for “file system”), filter out certain rows, and send it over HTTP:

fs.createReadStream('spreadsheet.csv').pipe(filter).pipe(http);
// Want to compress the response? Just add another pipe.
fs.createReadStream('spreadsheet.csv').pipe(filter).pipe(compressor).pipe(http);

Streams have the advantage of limiting how much memory a program uses because only small portions of data are being operated on at once. Think of the difference between copying a million-line spreadsheet all at once or line-by-line; the second is less likely to crash or run into the limit of how much data the system clipboard can hold.

Libraryland Examples

Node is still very new and there aren’t a lot prominent examples of library usage. I’ll try to present a few, but I think it’s more worth knowing about as a major trend in web development.

Most amusingly, Ed Summers of the Library of Congress and Sean Hannan of Johns Hopkins University made a Cataloging Highscores page that presents original cataloging performed in WorldCat in a retro arcade-style display. This app uses the popular socket.io module that establishes a real-time connection between your browser and the server, a strength of Node. Any web service that needs to be continually updated is a prime candidate for Node.js: current news articles, social media streams, auto-complete suggestions as a user types in search terms, and chat reference all come to mind. In fact, SpringShare’s LibChat uses socket.io as well, though I can’t tell if it’s using Node on the server or PHP. A similar example of real-time updating, also by Ed Summers, is Wikistream which streams the dizzying number of edits happening on various Wikipedias through your browser.4

There was a lightning talk on Node at Code4Lib 2010 which mentions writing a connector to the popular Apache Solr search platform. Aaron Coburn’s proposed talk for Code4Lib 2014 mentions that Amherst is using Node to build the web front-end to their Fedora-based digital library.

Tools You Can Use

With the explosive growth of NPM, there are already tons of useful tools written in Node. While many of these are tools for writing web servers, like Express, some are command line programs you can use to accomplish a variety of tasks.

Yeoman is a scaffolding application that makes it easy to produce various web apps by giving you expert templates. You can install separate generators that produce templates for things like a Twitter Bootstrap site, a JavaScript bookmarklet, a mobile site, or a project using the Angular JavaScript MVC framework. Running yo angular to invoke the Angular generator gives you a lot more than just a base HTML file and some JavaScript libraries; it also provides a series of Grunt tasks for testing, running a development server, and building a site optimized for production. Grunt is another incredibly useful Node project, dubbed “the JavaScript task runner.” It lets you pick from hundreds of community plugins to automate tedious tasks like minifying and concatenating your scripts before deploying a website.

Finally, another tool that I like is phantomas which is a Node project that works with PhantomJS to run a suite of performance tests on a site. It provides more detailed reports than any other performance tool I’ve used, telling you things like how many DOM queries ran and median latency of HTTP requests.

Learn More

Nodeschool.io features a growing number of lessons on using Node. Better yet, the lessons are actually written in Node, so you install them with NPM and verify your results on the command line. There are several topics, from basics to using streams to working with databases.

Nettuts+, always a good place for coding tutorials, has an introduction to Node which takes you from installation to coding a real-time server. If you want to learn about writing a real-time chat application with socket.io, they have a tutorial for that, too.

If you want a broad and thorough overview, there are a few introductory books on Node, with The Node Beginner Book offering several free chapters. O’Reilly’s Node for Front-End Developers is also a good starting point.

How to Node is a popular blog with articles on various topics, though some are too in-depth for beginners. I’d head here if you want to learn more on a specific topic, such as streams, or working with particular databases like MongoDB.

Finally, the Node API docs are a good place to go when you get stuck using a particular core module.

Notes

  1. If you use a package manager, such as Homebrew on Mac OS X or APT on Linux, Node is likely available within it. One caveat I have noticed is that the stock Debian/Ubuntu apt-get install nodejs is a few major versions behind; you may want to add Chris Lea’s PPA to get a current version. If you’re subject to the whims of your IT department, you may need to convince them to install Node for you, or talk to your sysadmin to get it on your server. Since it’s a rather new technology, don’t be surprised if you have to explain what it is and why you want to try it out.
  2. Previous projects, including Rhino from Mozilla and Narwhal, have let people use JavaScript outside the server. Node, however, has caught on far more than either of these projects, for some of the reasons outlined in this post.
  3. RequireJS is one project that’s trying to address this need. The ECMAScript standard that defines JavaScript is also working on native modules but they’re in draft form and it’ll be a long time before all browsers support them.
  4. If you’re curious, the code for both Cataloging Highscores and Wikistream are open source and available on GitHub.

An Incomplete Solution: Working with Drupal, Isotope, and JQuery BBQ Plugin

I write a lot of “how-to” posts.  This is fine and I actually think it’s fun…until I have a month like this past one, in which I worry that I have no business telling anyone “how-to” do anything. In which I have written the following comment multiple times:

//MF - this is a bad way to do this.

In the last four weeks I have struggled mightily to make changes to the JavaScript in a Drupal module (the aforementioned Views Isotope).  I have felt lost, I have felt jubilation, I have sworn profusely at my computer and in the end, I have modifications that mostly work.  They basically do what we need.  But the changes I made perform differently in different browsers, are likely not efficient, and I’m not sure the code is included in the right place.  Lots of it might be referred to as “hacky.”

However, the work needed to be done and I did not have time to be perfect.  This needed to mostly work to show that it could work and so I could hand off the concepts and algorithms to someone else for use in another of our sites.  Since I have never coded in JavaScript or jQuery, I needed to learn the related libraries on the fly and try to relate them back to previous coding experiences.  I needed to push to build the house and just hope that we can paint and put on doors and locks later on.

I decided to write this post because I think that maybe it is important to share the narrative of “how we struggle” alongside the “how-to”.  I’ll describe the original problem I needed to tackle, my work towards a solution, and the remaining issues that exist.  There are portions of the solution that work fine and I utilize those portions to illustrate the original requirements.  But, as you will see, there is an assortment of unfinished details.  At the end of the post, I’ll give you the link to my solution and you can judge for yourself.

The Original Problem

We want to implement a new gallery to highlight student work on our department’s main website.  My primary area of responsibility is a different website – the digital library – which is the archival home for student work.  Given my experience in working with these items and also with Views Isotope, the task of developing a “proof of concept” solution fell to me.  I needed to implement some of the technically trickier things in our proposed solution for the gallery pages in order to prove that the features are feasible for the main website.  I decided to use the digital library for this development because

  • the digital library already has the appropriate infrastructure in place
  • both the digital library and our main website are based in Drupal 7
  • the “proof of concept” solution, once complete, could remain in the digital library as a browse feature for our historical collections of student work

The full requirements for the final gallery are outside of the scope of this post, but our problem area is modifying the Views Isotope module to do some advanced things.

First, I need to take the existing Views Isotope module and modify it to use hash history on the same page as multiple filters.  Hash history with Isotope is implemented using the jQuery BBQ plugin, as is demonstrated on the Isotope website. This essentially means that when a user clicks on a filter, a hash value is added to the URL and becomes a line in the browser’s history.  This allows one to click the back button to view the grid with the previous filters applied.

Our specific use case is the following: when viewing galleries of student work from our school, a user can filter the list of works by several filter options, such as degree or discipline, (i.e., Undergraduate Architecture).  These filter options are powered by term references in Drupal, as we saw in my earlier Isotope post.  If the user clicks on an individual work to see details, clicking on the back button should return them to the already filtered list – they should not have to select the filters again.

Let’s take a look at how the end of the URL should progress.  If we start with:

.../student-work-archives

Then, we select Architecture as our discipline, we should see the URL change to :

.../student-work-archives#filter=.architecture

Everything after the # is referred to as the hash.  If we then click on Undergraduate, the URL will change to:

.../student-work-archives#filter=.undergraduate.architecture

If we click our Back button, we should go back to

.../student-work-archives#filter=.architecture

With each move, the Isotope animation should fire and the items on the screen should only be visible if they contain term references to the vocabulary selected.

Further, the selected filter options should be marked as being currently selected.  Items in one vocabulary which require an item in another vocabulary should be hidden and shown as appropriate.  For example, if a user selects Architecture, they should not be able to select PhD from the degree program, because we do not offer a PhD degree in Architecture, therefore there is no student work.  Here is an example of how the list of filters might look.

filter-list-compOnce Architecture is selected, PhD is removed as a option.  Once Undergraduate is selected, our G1, G2 and G3 options are no longer available.

A good real-world example of the types of features we need can be seen at NPR’s Best Books of 2013 site.  The selected filter is marked, options that are no longer available are removed and the animation fires when the filter changes.  Further, when you click the Back button, you are taken back through your selections.

The Solution

It turns out that the jQuery BBQ plugin works quite nicely with Isotope, again as demonstrated on the Isotope website.  It also turns out that support for BBQ is included in Drupal core for the Overlay module.   So theoretically this should all play nicely together.

The existing views-isotope.js file handles filtering as the menu items are clicked.  The process is basically as follows:

  • When the document is ready
    • Identify the container of items we want to filter, as well as the class on the items and set up Isotope with those options.
    • Pre-select All in each set of filters.
    • Identify all of the possible filter options on the page.
    • If a filter item is clicked,
      • first, check to make sure it isn’t already selected, if so, escape
      • then remove the “selected” class from the currently selected option in this set
      • add the “selected” class to the current item
      • set up an “options” variable to hold the value of the selected filter(s)
      • check for other items in other filter sets with the selected class and add them all to the selected filter value
      • call Isotope with the selected filter value(s)

To add the filter to the URL we can use bbq.pushState, which will add “a ‘state’ into the browser history at the current position, setting location.hash and triggering any bound hashchange event callbacks (provided the new state is different than the previous state).”

$bbq.pushState( options);

We then want to handle what’s in the hash when the browser back button is clicked, or if a user enters a URL with the hash value applied.  So we add an option for handling the hashchange event mentioned above.  Instead of calling isotope from the link click function, we call it from the hashchange event portion.  Now our algorithm looks more like this, with the items in bold added:

  • Include the misc/jquery.ba-bbq.js for BBQ (I have to do this explicitly because I don’t use Overlay)
  • When the document is ready
    • identify the container of items we want to filter, as well as the class on the items and set up Isotope with those options.
    • Pre-select All in each set of filters.
    • Identify all of the possible filter options on the page.
    • If a filter item is clicked,
      •   first, check to make sure it isn’t already selected, if so, escape
      •  then remove the “selected” class from the currently selected option in this set
      • add the “selected” class to the current item
      • set up an “options” variable to hold the value of the selected filter(s)
      • check for other items in other filter sets with the selected class and add them all to the selected filter value
      • push “options” to URL and trigger hashchange (don’t call Isotope yet)
    • If a hashchange event is detected
      • create new “hashOptions” object according to what’s in the hash, using the deparam.fragment function from jQuery BBQ
      • manipulate css classes such as “not-available” (ie. If Architecture is selected, apply to PhD) and “selected” based on what’s in “hashOptions”
      • call Isotope with “hashOptions” as the parameter
    • trigger hashchange event to pick up anything that’s in the URL when the page loads

I also updated any available pager links so that they link not just to the appropriate page, but also so that the filters are included in the link.  This is done by appending the hash value to the href attribute for each link with a class “pager”.

And it works.  Sort of…

The Unfinished Details

Part of the solution described above only works on Chrome and – believe it or not – Internet Explorer.  In all browsers, clicking the back button works just as described above, as long as one is still on the page with the list of works.  However, when linking directly to page with the filters included (as we are doing with the pager) or hitting back from a page that does not have the hash present (say, after visiting an individual item), it does not work on Firefox or Safari.  I think this may have to do with the deparam.fragment function, because that appears to be where it gets stuck, but so far can’t track it down.  I could directly link to window.location.hash, but I think that’s a security issue (what’s to stop someone from injecting something malicious after the hash?)

Also, in order to make sure the classes are applied correctly, it feels like I do a lot of “remove it from everywhere, then add it back”.  For example, if I select Architecture, PhD is then hidden from the degree list by assigning the class “not-available”.  When a user clicks on City and Regional Planning or All, I need that PhD to appear again.  Unfortunately, the All filter is handled differently – it is only present in the hash if no other options on the page are selected.  So, I remove “not-available” from all filter lists on hashchange and then reassign based on what’s in the hash.  It seems like it would be more efficient just to change the one I need, but I can’t figure it out.  Or maybe I should alter the way All is handled completely – I don’t know.

I also made these changes directly in the views-isotope.js, which is a big no-no.  What happens when the module is updated?  But, I have never written a custom module for Drupal which included JavaScript, so I’m not even sure how to do it in a way that makes sense.  I have only made changes to this one file.  Can I just override it somewhere?  I’m not sure.  Until I figure it out, we have a backup file and lots and lots of comments.

All of these details are symptoms of learning on the fly.  I studied computer science, so I understand conceptual things like how to loop through an array or return a value from a function, but that was more than ten years ago and my practical coding experience since then has not involved writing jQuery or Javascript.  Much of what I built I pieced together from the Drupal documentation, the Views Isotope documentation and issue queue, the Isotope documentation, the jQuery BBQ documentation and numerous visits to the w3schools.com pages on jQuery and Javascript.  I also frequently landed on the jQuery API Documentation page.

It is hard to have confidence in a solution when building while learning.  When I run into a snag, I have to consider whether or not the problem is the entire approach, as opposed to a syntax error or a limitation of the library.  Frequently, I find an answer to an issue I’m having, but have to look up something from the answer in order to understand it.  I worry that the code contains rookie mistakes – or even intermediate mistakes – which will bite us later, but it is difficult to do an exhaustive analysis of all the available resources.  Coding elegantly is an art which requires more than a basic understanding of how the pieces play together.

Inelegant code, however, can still help make progress.  To see the progress I have made, you can visit https://ksamedia.osu.edu/student-work-archives and play with the filters. This solution is good because it proves we can develop our features using Isotope, BBQ and Views Isotope.  The trick now is figuring out how to paint and put locks and doors on our newly built house, or possibly move a wall or two.