Visualizing DSpace Data with Google Fusion Table & Viewshare

During my time as the Digital Resources Librarian at Kenyon College I had the opportunity to work with The Community Within collection, which explores black history in Knox County, Ohio.  At the beginning of the project, our goal for this collection was simple: to make a rich set of digitized materials publicly available through our DSpace repository, the Digital Resource Commons (DRC).  However, once the collection was published in the DRC, a new set of questions emerged. How do we drive people to the collection? Can we create more interesting interfaces or virtual exhibits for the collection? How do we tie it all together? To answer these questions, we started exploring the digital humanities landscape, looking for low cost tools we could integrate with our existing DSpace collections.  We started to think about the collection and associated metadata as a data set, which contained elements we could use to create a display different than the standard list of items.  We wanted to facilitate the discovery of individual items by displaying them to our users in different visual contexts, such as maps or timelines.

Two tools that emerged from this exploration were Google Fusion Tables, a Google product, and Viewshare, which is provided by National Digital Information Infrastructure and Preservation Program (NDIIPP) at the Library of Congress.  Google Fusion Tables provides a platform for researchers to upload and share data sets, which can then be displayed in seven different visualization formats (map, scatter plot, intensity map).  Various examples of the results can be seen in their gallery, which also illustrates the wide range of organizations using the tool, including academic research institutions, news organizations and government agencies.  Viewshare, according to their website, “is a free platform for generating and customizing views (interactive maps, timelines, facets, tag clouds) that allow users to experience your digital collections.”  While it does many of the same things as Google Fusion in allowing users to create visualizations of data sets, it is more specifically geared towards cultural heritage collections.

Both tools are freely available and allow users to import data from a variety of sources.  Because the tools are easy to use, it is possible to get started quickly in manipulating and sharing your data.  Each tool provides a space for the uploaded data and accompanying views, but also allows for you to embed this information in other web locations.  In the case of The Community Within, we created an exhibit which links to materials about churches in the collection using an embedded Google Fusion map display.

This blog entry will walk through how to successfully export and manipulate data from DSpace in order to take advantage of these tools, as well as how to embed the resulting interface components back into DSpace or other collection websites.

The How-To – DSpace and Google Fusion

1.  First, start with a DSpace collection.  Our example collection is a photo collection of art on the campus at Ohio State University.  In the screenshot below, we are already logged in as a collection administrator.

Note. Click the images to see them in their full-size.

A DSpace Collection

2.  We need to export the metadata.  So, click on “Export Metadata” (under Context).  This will download a .csv file.

Save the csv file.

3.  When you open the .csv, you may notice that metadata added to the collection at different times in different ways may show up differently.  We want to fix this before we send this file anywhere.

CSV data, pre-edit

Edited CSV data

4.  Save the file as a .csv file.  If you are given a choice, be sure to select a comma as the separating punctuation.

5.  Open Google Fusion.  If you do not use Google Drive (formerly Docs), you will need to login with a Google account or sign up for one.  Go to drive.google.com.

6.  Once you are logged in, click on Create > More > Fusion Table (experimental).
Select Create, Other, Fusion Table
7.  On the next screen, we’re going to select “From this computer”, then click on Browse to get to the csv we created above.  Once the file is in the Browse text box, click on Next.
Browse for file
8.  Check that your data looks ok, then click on Next again.  A common problem occurs here when your spreadsheet editor chooses a separator other than a comma.  Fixing is easy enough, just click Back and indicate the correct separator character.
Check your data
9.  On the next screen describe your table, then click on Finish.
Describe your table, and click Finish
10.  We have a Fusion table.  Now, let’s create our visualization.  Click on Visualize > Map.

Click on Visualize, then Map

Because our collection already contained Geocodes in the dc.coverage.spatial column, the map is automatically created.  However, if you would like to use a different column, you can change it by selecting the Location field to the top left of the map.  Google Fusion tables can also create the map using an address, instead of a latitude/longitude pair.  If the map is zoomed far back, zoom in before you get the embed code to make sure the zoom is appropriate on your Dspace page.

We have a map

11.  Now, let’s embed our map back in DSpace.  In Google Fusion, click on “Get embeddable link” at the top of the map.  In the dialog which comes up, copy the text in the field “Paste HTML to embed in a website” (Note: your table must be shared for this to work.  Google should prompt you to share the table if you try to get an embeddable link for an unshared table.  If not, just click on Share in your Fusion window and make the table public.)

Copy the link text
12.  Now, back in DSpace, click on Edit Collection.  In one of the HTML fields (I usually use Introductory Text) and paste the text you copied.

Paste the embed code

13.  Here’s a huge gotcha.  I have pasted the embed code below.  If you paste it just like this and click on Save, the Collection page will disappear because there is nothing between the tags.  We need to add something between the opening and closing <iframe></iframe> tag.  Usually, I use “this browser does not support frames.”

<iframe width=”500″ height=”300″ scrolling=”no” frameborder=”no” src=”https://www.google.com/fusiontables/embedviz?viz=MAP&amp;q=select+col4+from+1Fqwl_ugZxBx3vCXLVEfnujSpYJa9F0IICVqHLYw&amp;h=false&amp;lat=40.00118408791957&amp;lng=-83.016412&amp;z=10&amp;t=1&amp;l=col4″></iframe>

<iframe width=”500″ height=”300″ scrolling=”no” frameborder=”no” src=”https://www.google.com/fusiontables/embedviz?viz=MAP&amp;q=select+col4+from+1Fqwl_ugZxBx3vCXLVEfnujSpYJa9F0IICVqHLYw&amp;h=false&amp;lat=40.00118408791957&amp;lng=-83.016412&amp;z=10&amp;t=1&amp;l=col4″>This browser does not support frames.</iframe>

14.  Now, click on Save.  This will take you back to your collection homepage, which now has a map.
Embedded Map
15.  One last thing – that info window in the map is not really user friendly.  Let’s go back go Google Fusion and fix it.  Just click on “Configure info window” above the Fusion map.  It will bring up a dialog which allows you to choose which fields you want to show, as well as modify the markup so that, for example, links display as links.
Modify the info window
16.  No need to re-embed, just head back to your DSpace page and click refresh.
Final embedded map
Done!  You can play with the settings at various points along the way to make the map smaller or larger.

The How-To – DSpace and Viewshare

We can complete the same process using Viewshare.  If you skipped to this section, go back and read steps 1-4 above.

Back?  Ok.  So we should have a .csv of exported metadata from our DSpace collection.

1.  Log into Viewshare.  You will have to request an account if you don’t have one.
2.  From the homepage, click on Upload Data.

Click on Upload Data

3.  There are a multitude of source options, but we’re going to use the .csv we created above, so we select “From a file on your computer.”

Select "from a file"
4.  Browse for the file, then click on Upload.

5.  In the Preview Window, you can edit the field names to more user friendly alternatives.  You can also click the check box under Enabled to include or not include certain fields.  You can also select field types, so that data is formatted correctly (as in, links) and can be used for visualizations (as in dates or locations).

Edit the data

6.  When you have finished editing, click on Save.  You will now see the dataset in your list of Data.  Click on Build to build a visualization.

Select Build

7.  You can pick any layout, but I usually pick the One Column for simplicity’s sake.

Select a layout

8.  The view will default to List, but really, we already have a list.  Let’s click on the Add a View tab to create another visualization.  For this example, we’re going to select Timeline.

Select a Timeline View

9. There are a variety of settings for this visualization.  Select the field which contains the date (in our case, we just have one date, so we leave End Date blank), decide how you want to color the timeline and what unit you want to use.  Timeline lens lets you decide what is included in the pop-up.  Click on Save (top right) when you are finished selecting values.

Select options for View

10.  We have created a timeline.  Now we need to embed it back in DSpace. Click on Embed in the top menu.

Now we have a timeline

11.  Copy the embed code.

Copy the embed code

12.  Again, back in DSpace, we will click on Edit Collection and paste the embed code into one of the HTML fields.  And, again, it is essential that there is some text between the tags.

Paste the embed code

Now we have an embedded timeline!

An embedded timeline

Depending on the space available on your DSpace homepage, you may want to adjust the top and bottom time bands so that the timeline displays more cleanly.

Of course, there are a few caveats.  For example, this approach works best with collections that are complete.  If items are still being added to the collection, the collection manager will need to build in a workflow to refresh the visualization from time to time.  This is done by re-exporting, re-uploading, and re-embedding.  Also, Google Fusion Tables is officially an “experimental” product.  It is important to keep your data elsewhere as well, and to be aware that your Fusion visualizations may not be permanent.

However, this solution provides an easy, code-free way to improve the user interface to a collection.  Similar approaches may also work using platforms not described here. For example, here’s a piece on using Viewshare with Omeka, another open source collection management system.  The goal is to let each tool do what it does best, then make the results play nicely together.  This is a free and relatively painless way to achieve that goal.

About our Guest Author: Meghan Frazer is the Digital Resources Curator for the Knowlton School of Architecture at The Ohio State University.  She manages the school archives as well as the KSA Digital Library, and spends lots of time wrangling Drupal for the digital library site. Her professional interests include digital content preservation and data visualization.  Before attending library school, Meghan worked in software quality assurance and training and has a bachelor’s degree in Computer Science.  You can send tweets in her direction using @meghanfrazer.

2012 Eyeo Festival

Tuesday June 5th through Friday June 8th 2012, 500 creatives from numerous fields such as, computer science, art, design, data visualization, gathered together to listen, converse, and participate in the second Eyeo Festival. Held in Minneapolis, MN at the Walker Art Center, the event organizers created an environment of learning, exchange, exploration, and fun. There were various workshops with some top names leading the way. Thoughtfully curated presentations throughout the day complemented keynotes held nightly in party-like atmospheres: Eyeo was an event not to be missed. Ranging from independent artists to the highest levels of innovative companies, Eyeo offered inspiration on many levels.

Why the Eyeo Festival?
As I began to think about what I experienced at the Eyeo Festival, I struggled to express exactly how impactful this event was for me and those I connected with. In a way, Eyeo is like TED and in fact, many presenters have given TED talks. Eyeo has a more targeted focus on art, design, data, and creative code but it is also so much more than that. With an interactive art and sound installation, Zygotes, by Tangible Interaction kicking off the festival, though the video is a poor substitute to actually being there, it still evokes a sense of wonder and possibility. I strongly encourage anyone who is drawn to design, data, art, interaction or to express their creativity through code to attend this outstanding creative event and follow the incredible people that make up the impressive speaker list.

I went to the Eyeo Festival because I like to seek out what professionals in other fields are doing. I like staying curious and stretching outside my comfort zone in big ways, surrounding myself with people doing things I don’t understand, and then trying to understand them. Over the years I’ve been to many library conferences and there are some amazing events with excellent programming but they are, understandably, very library-centric. So, to challenge myself, I decided to go to a conference where there would be some content related to libraries but that was not a library conference. There are many individuals and professions outside of libraries that care about many of the same values and initiatives we do, that work on similar kinds of problems, and have the same drive to make the world a better place. So why not talk to them, ask questions, learn, and see what their perspective is? How do they approach and solve problems? What is their process in creating? What is their perspective and attitude? What kind of communities are they part of and work with?

I was greatly inspired by the group of librarians who have attended the SXSWi Festival which has grown further over the years. There are a now a rather large number of librarians speaking about and advocating for libraries in such an innovative and elevated platform. There is even a Facebook Group where professionals working in libraries, archives, and museums can connect with each other for encouragement, support, and collaborations in relation to SXSWi. Andrea Davis, Reference & Instruction Librarian at the Dudley Knox Library, Naval Postgraduate School in Monterey, CA, has been heavily involved in offering leadership in getting librarians to collaborate at SXSW. She states, “I’ve found it absolutely invigorating to get outside of library circles to learn from others, and to test the waters on what changes and effects are having on those not so intimately involved in libraries. Getting outside of library conferences keeps the blood flowing across tech, publishing, education. Insularity doesn’t do much for growth and learning.”

I’ve also been inspired by librarians who have been involved in the TED community, such as Janie Herman and her leadership with Princeton Public Library’s local TEDx in addition to her participation in the TEDxSummit in Doha, Qatar. Additionally, Chrystie Hill, the Community Relations Director at OCLC, has given more than one TedX talk about libraries. Seeing our library colleagues represent our profession in arenas broader than libraries is energizing and infectious.

Librarians having a seat at the table and a voice at two of the premier innovative gatherings in the world is powerful. This concept of librarians embedding themselves in communities outside of librarianship has been discussed in a number of articles including The Undergraduate Science Librarian and In the Library With the Lead Pipe.

Highlights
Rather than giving detailed comprehensive coverage of Eyeo, you’ll see a glimpse of a few presentations plus a number of resources so that you can see for yourself some of the amazing, collaborative work being done. Presenter’s names link to the full talk that you can watch for yourself. Because a lot of the work being done is interactive and participatory in some way, I encourage you to seek these projects out and interact with them. The organizers are in the midst of processing a lot of videos and putting them up on the Eyeo Festival Vimeo channel; I highly recommend watching them and checking back for more.

Ben Fry
Principal of Fathom, a Boston based design and data visualization firm, and co-initiator of the programming language Processing, Ben Fry’s work in data visualization and design is worth delving into. In his Eyeo presentation, 3 Things, the project that most stood out was the digitization project Fathom produced for GE: http://fathom.info/latest/category/ge. Years of annual reports were beautifully digitized and incorporated into an interactive web application they built from scratch. When faced with scanning issues, they built a tool that improved the scanned results.

Jer Thorp
Data artist in residence for the New York Times, and former geneticist, Jer Thorp’s range in working with data, art, and design is far and wide. Thorp is one of the few founders of the Eyeo Festival and in his presentation Near/Far he discussed several data visualization projects with the focus on storytelling. The two main pieces that stood out from Jer’s talk was his encouragement to dive into data visualization. He even included 10 year old, Theodore Zaballos’ handmade visualization of The Illiad which was rather impressive. The other piece that stood out was his focus on data visualization in context to location and people owning their own data versus a third party. This lead into the Open Paths project he showcased. He has also presented to librarians at the Canadian library conference, Access 2011.

Jen Lowe
Jen Lowe was by far the standout from all of the amazing Ignite Eyeo talks. She spoke about how people are intrinsically inspired by storytelling and the need for those working with data to focus on storytelling through the use of visualizing data and the story it tells. She works for the Open Knowledge Foundation in addition to running Datatelling and she has her library degree (she’s one of us!).

Jonathan Harris
Jonathan Harris gave one of the most personal and poignant presentations at Eyeo. In a retrospective of his work, Jonathan covered years of work interwoven with personal stories from his life. Jonathan is an artist and designer and his work life and personal life are rarely separated. Each project began with the initial intention and ended with a more critical inward examination from the artist. The presentation led to his most recent endeavor, the Cowbird project, where storytelling once again emerges strongly. In describing this project he focused on the idea that technology and software could be used for good, in a more human way, created by “social engineers” to build a community of storytellers. He describes Cowbird as “a community of storytellers working to build a public library of human experience.”

Additional people + projects to delve into:

Fernanda Viegas and Martin Wattenberg of the Google Big Picture data visualization group. Wind Map: http://hint.fm/wind/

Kyle McDonald: http://kylemcdonald.net/

Tahir Temphill: http://tahirhemphill.com/ and his latest work, Hip Hop Word Count: http://staplecrops.com/index.php/hiphop_wordcount/

Julian Oliver: http://julianoliver.com/

Nicholas Felton of Facebook: http://feltron.com/

Aaron Koblin of the Google Data Arts Group: http://www.aaronkoblin.com/ and their latest project with the Tate Modern: http://www.exquisiteforest.com/

Local Projects: http://localprojects.net/

Oblong Industries: http://oblong.com/

Eyebeam Art + Technology Center: http://eyebeam.org/

What can libraries get from the Eyeo Festival?

Libraries and library work are everywhere at this conference. That this eclectic group of creative people were often thinking about and producing work similar to librarians is thrilling. There is incredible potential for libraries to embrace some of the concepts and problems in many of the presentations I saw and conversations I was part of. There are multiple ways that libraries could learn from and perhaps participate in this broader community and work across fields.

People love libraries and these attendees were no exception. There were attendees from numerous private/corporate companies, newspapers, museums, government, libraries, and more. I was not the only library professional in attendance so I suspect those individuals might see the potential I see, which I also find really exciting. The drive behind every presenter and attendee was by far creativity in some form, the desire to make something, and communicate. The breadth of creativity and imagination that I saw reminded me of a quote from David Lankes in his keynote from the New England Library Association Annual Conference:

“What might kill our profession is not ebooks, Amazon or Google, but a lack of imagination. We must envision a bright future for librarians and the communities they serve, then fight to make that vision a reality. We need a new activist librarianship focused on solving the grand challenges of our communities. Without action we will kill librarianship.”

If librarianship is in need of more imagination and perhaps creativity too, there is a world of wonder out there in terms of resources to help us achieve this vision.

The Eyeo Festival is but one place where we can become inspired, learn, and dream and then bring that experience back to our libraries and inject our own imagination, ideas, experimentation, and creativity into the work we do. By doing the most creative, imaginative library work we can do will inspire our communities; I have seen it first hand. Eyeo personally taught me that I need to fail more, focus more, make more, and have more fun doing it all.

Getting Creative with Data Visualization: A Case Study

The Problem

At the NCSU Libraries, my colleagues and I in the Research and Information Services department do a fair bit of instruction, especially to classes from the university’s First Year Writing Program. Some new initiatives and outreach have significantly increased our instruction load, to the point where it was getting more difficult for us to effectively cover all the sessions that were requested due to practical limits of our schedules. By way of a solution, we wanted to train some of our grad assistants, who (at the time of this writing) are all library/information science students from that school down the road, in the dark arts of basic library instruction, to help spread the instruction burden out a little.

This would work great, but there’s a secondary problem: since UNC is a good 40 minute drive away, our grad assistants tend to have very rigid schedules, which are fixed well in advance — so we can’t just alter our grad assistants’ schedules on short notice to have them cover a class. Meanwhile, instruction scheduling is very haphazard, due to wide variation in how course slots are configured in the weekly calendar, so it can be hard to predict when instruction requests are likely to be scheduled. What we need is a technique to maximize the likelihood that a grad student’s standing schedule will overlap with the timing of instruction requests that we do get — before the requests come in.

Searching for a Solution – Bar graph-based analysis

The obvious solution was to try to figure out when during the day and week we provided library instruction most frequently. If we could figure this out, we could work with our grad students to get their schedules to coincide with these busy periods.

Luckily, we had some accrued data on our instructional activity from previous semesters. This seemed like the obvious starting point: look at when we taught previously and see what days and times of day were most popular. The data consisted of about 80 instruction sessions given over the course of the prior two semesters; data included date, day of week, session start time, and a few other tidbits. The data was basically scraped by hand from the instruction records we maintain for annual reports; my colleague Anne Burke did the dirty work of collecting and cleaning the data, as well as the initial analysis.

Anne’s first pass at analyzing the data was to look each day of the week in terms of courses taught in the morning, afternoon, and evening. A bit of hand-counting and spreadsheet magic produced this:

Instruction session count by day of week and time of day, Spring 2010-Fall 2011

This chart was somewhat helpful — certainly it’s clear that Monday, Tuesday and Thursday are our busiest days — but but it doesn’t provide a lot of clarity regarding times of day that are hot for instruction.  Other than noting that Friday evening is a dead time (hardly a huge insight), we don’t really get a lot of new information on how the instruction sessions shake out throughout the week.

Let’s Get Visual – Heatmap-based visualization

The chart above gets the fundamentals right — since we’re designing weekly schedules for our grad assistants, it’s clear that the relevant dimensions are days of week and times of day. However, there are basically two problems with the stacked bar chart approach: (1) The resolution of the stacked bars — morning, afternoon and evening — is too coarse. We need to get more granular if we’re really going to see the times that are popular for instruction; (2) The stacked bar chart slices just don’t fit our mental model of a week. If we’re going to solve a calendaring problem, doesn’t it make a lot of sense to create a visualization that looks like a calendar?

What we need is a matrix — something where one dimension is the day of the week and the other dimension is the hour of the day (with proportional spacing) — just like a weekly planner. Then for any given hour, we need something to represent how “popular” that time slot is for instruction. It’d be great if we had some way for closely clustered but non-overlapping sessions to contribute “weight” to each other, since it’s not guaranteed that instruction session timing will coincide precisely.

When I thought about analyzing the data in these terms, the concept of a heatmap immediately came to mind. A heatmap is a tool commonly used to look for areas of density in spatial data. It’s often used for mapping click or eye-tracking data on websites, to develop an understanding of the areas of interest on the website. A heatmap’s density modeling works like this: each data point is mapped in two dimensions and displayed graphically as a circular “blob” with a small halo effect; in closely-packed data, the blobs overlap. Areas of overlap are drawn with more intense color, and the intensity effect is cumulative, so the regions with the most intense color correspond to the areas of highest density of points.

I had heatmaps on the brain since I had just used them extensively to analyze user interaction patterns with a touchscreen application that I had recently developed.

Heatmap example from my previous work, tracking touches on a touchscreen interface. The heatmap is overlaid onto an image of the interface.

Part of my motivation for using heatmaps to solve our scheduling problem was simply to use the tools I had at hand: it seemed that it would be a simple matter to convert the instruction data into a form that would be amenable to modeling with the heatmap software I had access to. But in a lot of ways, a heatmap was a perfect tool: with a proper arrangement of the data, the heatmap’s ability to model intensity would highlight the parts of each day where the most instruction occurred, without having to worry too much about the precise timing of instruction sessions.

The heatmap generation tool that I had was a slightly modified version of the Heatmap PHP class from LabsMedia’s ClickHeat, an open-source tool for website click tracking. My modified version of the heatmap package takes in an array of (x,y) ordered pairs, corresponding to the locations of the data points to be mapped, and outputs a PNG file of the generated heatmap.

So here was the plan: I would convert each instruction session in the data to a set of (x,y) coordinates, with one coordinate representing day of week and the other representing time of day. Feeding these coordinates into the heatmap software would, I hoped, create five colorful swatches, one for each day of the week. The brightest regions in the swatches would represent the busiest times of the corresponding days.

Arbitrarily, I selected the y-coordinate to represent the day of the week. So I decided that any Monday slot, for instance, would be represented by some small (but nonzero) y-coordinate, with Tuesday represented by some greater y-coordinate, etc., with the intervals between consecutive days of the week equal. The main concern in assigning these y-coordinates was for the generated swatches to be far enough apart so that the heatmap “halo” around one day of the week would not interfere with its neighbors — we’re treating the days of the week independently. Then it was a simple matter of mapping time of day to the x-coordinate in a proportional manner. The graphic below shows the output from this process.

Raw heatmap of instruction data as generated by heatmap software

In this graphic, days of the week are represented by the horizontal rows of blobs, with Monday as the first row and Friday as the last. The leftmost extent of each row corresponds to approximately 8am, while the rightmost extent is about 7:30pm. The key in the upper left indicates (more or less) the number of overlapping data points in a given location. A bit of labeling helps to clarify things:

Heatmap of instruction data, labeled with days of week and approximate indications of time of day.

Right away, we get a good sense of the shape of the instruction week. This presentation reinforces the findings of the earlier chart: that Monday, Tuesday, and Thursday are busiest, and that Friday afternoon is basically dead. But we do see a few other interesting tidbits, which are visible to us specifically through the use of the heatmap:

  • Monday, Tuesday and Thursday aren’t just busy, they’re consistently well-trafficked throughout the day.
  • Friday is really quite slow throughout.
  • There are a few interesting hotspots scattered here and there, notably first thing in the morning on Tuesday.
  • Wednesday is quite sparse overall, except for two or three prominent afternoon/evening times.
  • There is a block of late afternoon-early evening time-slots that are consistently busy in the first half of the week.

Using this information, we can take a much more informed approach to scheduling our graduate students, and hopefully be able to maximize their availability for instruction sessions.

“Better than it was before. Better. Stronger. Faster.” – Open questions and areas for improvement

As a proof of concept, this approach to analyzing our instruction data for the purposes of setting student schedules seems quite promising. We used our findings to inform our scheduling of graduate students this semester, but it’s hard to know whether our findings can even be validated: since this is the first semester where we’re actively assigning instruction to our graduate students, there’s no data available to compare this semester against, with respect to amount of grad student instruction performed. Nevertheless, it seems clear that knowledge of popular instruction times is a good guideline for grad student scheduling for this purpose.

There’s also plenty of work to be done as far as data collection and analysis is concerned. In particular:

  • Data curation by hand is burdensome and inefficient. If we can automate the data collection process at all, we’ll be in a much better position to repeat this type of analysis in future semesters.
  • The current data analysis completely ignores class session length, which is an important factor for scheduling (class times vary between 50 and 100 minutes). This data is recorded in our instruction spreadsheet, but there aren’t any set guidelines on how it’s entered — librarians entering their instruction data tend to round to the nearest quarter- or half-hour increment at their own preference, so a 50-minute class is sometimes listed as “.75 hours” and other times as “1 hour”. More accurate and consistent session time recording would allow us to reliably use session length in our analysis.
  • To make the best use of session length in the analysis, I’ll have to learn a little bit more about PHP’s image generation libraries. The current approach is basically a plug-in adaptation of ClickHeat’s existing Heatmap class, which is only designed to handle “point” data. To modify the code to treat sessions as little line segments corresponding to their duration (rather than points that correspond to their start times) would require using image processing methods that are currently beyond my ken.
  • A bit better knowledge of the image libraries would also allow me to add automatic labeling to the output file. You’ll notice the prominent use of “ish” to describe the hours dimension of the labeled heatmap above: this is because I had neither the inclination nor the patience to count pixels to determine where exactly the labels should go. With better knowledge of the image libraries I would be able to add graphical text labels directly to the generated heatmap, at precisely the correct location.

There are other fundamental questions that may be worth answering — or at least experimenting against — as well. For instance, in this analysis I used data about actual instruction sessions performed. But when lecturers request library sessions, they include two or three “preferred” dates, of which we pick the one that fits our librarian and room schedules best. For the purposes of analysis, it’s not entirely clear whether we should use the actual instruction data, which takes into account real space limitations but is also skewed by librarian availability; or whether we should look strictly at what lecturers are requesting, which might allow us to schedule our grad students in a way that could accommodate lecturers’ first choices better, but which might run us up against the library’s space limitations. In previous semesters, we didn’t store the data on the requests we received; this semester we’re doing that, so I’ll likely perform two analyses, one based on our actual instruction and one based on requests. Some insight might be gained by comparing the results of the two analyses, but it’s unclear what exactly the outcome will be.

Finally, it’s hard to predict how long-term trends in the data will affect our ability to plan for future semesters. It’s unclear whether prior semesters are a good indicator of future semesters, especially as lecturers move into and out of the First Year Writing Program, the source of the vast majority of our requests. We’ll get a better sense of this, presumably, as we perform more frequent analyses — it would also make sense to examine each semester separately to look for trends in instruction scheduling from semester to semester.

In any case, there’s plenty of experimenting left to do and plenty of improvements that we could make.

Reflections and Lessons Learned

There’s a few big points that I took away from this experience. A big one is simply that sometimes the right approach is a totally unexpected one. You can gain some interesting insights if you don’t limit yourself to the tools that are most familiar for a particular problem. Don’t be afraid to throw data at the wall and see what sticks.

Really, what we did in this case is not so different from creating separate histograms of instruction times for each day of the week, and comparing the histograms to each other. But using heatmaps gave us a couple of advantages over traditional histograms: first, our bin size is essentially infinitely narrow; because of the proximity effects of the heatmap calculation, nearby but non-overlapping data points still contribute weight to each other without us having to define bins as in a regular histogram. Second, histograms are typically drawn in two dimensions, which would make comparing them against each other rather a nuisance. In this case, our separate heatmap graphics for each day of the week are basically one-dimensional, which allows us to compare them side by side with little fuss. This technique could be used for side-by-side examinations of multiple sets of any histogram-like data for quick and intuitive at-a-glance comparison.

In particular, it’s important to remember — especially if your familiarity with heatmaps is already firmly entrenched in a spatial mapping context — that data doesn’t have to be spatial in order to be analyzed with heatmaps. This is really just an extension of the idea of graphical data analysis: A heatmap is just another way to look at arbitrary data represented graphically, not so different from a bar graph, pie chart, or scatter plot. Anything that you can express in two dimensions (or even just one), and where questions of frequency, density, proximity, etc., are relevant, can be analyzed using the heatmap approach.

A final point: as an analysis tool, the heatmap is really about getting a feel for how the data lies in aggregate, rather than getting a precise sense of where each point falls. Since the halo effect of a data point extends some distance away from the point, the limits of influence of that point on the final image are a bit fuzzy. If precision analysis is necessary, then heatmaps are not the right tool.

About our guest author: Andreas Orphanides is Librarian for Digital Technologies and Learning in the Research and Information Services department at NCSU Libraries. He holds an MSLS from UNC-Chapel Hill and a BA in mathematics from Oberlin College. His interests include instructional design, user interface development, devising technological solutions to problems in library instruction and public services, long walks on the beach, and kittens.