Getting Creative with Data Visualization: A Case StudyPosted: March 8, 2012 | Author: Andreas Orphanides | Filed under: data, library, library instruction, technology | Tags: data, data analysis, data visualization, heatmap, instruction, scheduling, visual | 1 Comment »
At the NCSU Libraries, my colleagues and I in the Research and Information Services department do a fair bit of instruction, especially to classes from the university’s First Year Writing Program. Some new initiatives and outreach have significantly increased our instruction load, to the point where it was getting more difficult for us to effectively cover all the sessions that were requested due to practical limits of our schedules. By way of a solution, we wanted to train some of our grad assistants, who (at the time of this writing) are all library/information science students from that school down the road, in the dark arts of basic library instruction, to help spread the instruction burden out a little.
This would work great, but there’s a secondary problem: since UNC is a good 40 minute drive away, our grad assistants tend to have very rigid schedules, which are fixed well in advance — so we can’t just alter our grad assistants’ schedules on short notice to have them cover a class. Meanwhile, instruction scheduling is very haphazard, due to wide variation in how course slots are configured in the weekly calendar, so it can be hard to predict when instruction requests are likely to be scheduled. What we need is a technique to maximize the likelihood that a grad student’s standing schedule will overlap with the timing of instruction requests that we do get — before the requests come in.
Searching for a Solution – Bar graph-based analysis
The obvious solution was to try to figure out when during the day and week we provided library instruction most frequently. If we could figure this out, we could work with our grad students to get their schedules to coincide with these busy periods.
Luckily, we had some accrued data on our instructional activity from previous semesters. This seemed like the obvious starting point: look at when we taught previously and see what days and times of day were most popular. The data consisted of about 80 instruction sessions given over the course of the prior two semesters; data included date, day of week, session start time, and a few other tidbits. The data was basically scraped by hand from the instruction records we maintain for annual reports; my colleague Anne Burke did the dirty work of collecting and cleaning the data, as well as the initial analysis.
Anne’s first pass at analyzing the data was to look each day of the week in terms of courses taught in the morning, afternoon, and evening. A bit of hand-counting and spreadsheet magic produced this:
This chart was somewhat helpful — certainly it’s clear that Monday, Tuesday and Thursday are our busiest days — but but it doesn’t provide a lot of clarity regarding times of day that are hot for instruction. Other than noting that Friday evening is a dead time (hardly a huge insight), we don’t really get a lot of new information on how the instruction sessions shake out throughout the week.
Let’s Get Visual – Heatmap-based visualization
The chart above gets the fundamentals right — since we’re designing weekly schedules for our grad assistants, it’s clear that the relevant dimensions are days of week and times of day. However, there are basically two problems with the stacked bar chart approach: (1) The resolution of the stacked bars — morning, afternoon and evening — is too coarse. We need to get more granular if we’re really going to see the times that are popular for instruction; (2) The stacked bar chart slices just don’t fit our mental model of a week. If we’re going to solve a calendaring problem, doesn’t it make a lot of sense to create a visualization that looks like a calendar?
What we need is a matrix — something where one dimension is the day of the week and the other dimension is the hour of the day (with proportional spacing) — just like a weekly planner. Then for any given hour, we need something to represent how “popular” that time slot is for instruction. It’d be great if we had some way for closely clustered but non-overlapping sessions to contribute “weight” to each other, since it’s not guaranteed that instruction session timing will coincide precisely.
When I thought about analyzing the data in these terms, the concept of a heatmap immediately came to mind. A heatmap is a tool commonly used to look for areas of density in spatial data. It’s often used for mapping click or eye-tracking data on websites, to develop an understanding of the areas of interest on the website. A heatmap’s density modeling works like this: each data point is mapped in two dimensions and displayed graphically as a circular “blob” with a small halo effect; in closely-packed data, the blobs overlap. Areas of overlap are drawn with more intense color, and the intensity effect is cumulative, so the regions with the most intense color correspond to the areas of highest density of points.
I had heatmaps on the brain since I had just used them extensively to analyze user interaction patterns with a touchscreen application that I had recently developed.
Part of my motivation for using heatmaps to solve our scheduling problem was simply to use the tools I had at hand: it seemed that it would be a simple matter to convert the instruction data into a form that would be amenable to modeling with the heatmap software I had access to. But in a lot of ways, a heatmap was a perfect tool: with a proper arrangement of the data, the heatmap’s ability to model intensity would highlight the parts of each day where the most instruction occurred, without having to worry too much about the precise timing of instruction sessions.
The heatmap generation tool that I had was a slightly modified version of the Heatmap PHP class from LabsMedia’s ClickHeat, an open-source tool for website click tracking. My modified version of the heatmap package takes in an array of (x,y) ordered pairs, corresponding to the locations of the data points to be mapped, and outputs a PNG file of the generated heatmap.
So here was the plan: I would convert each instruction session in the data to a set of (x,y) coordinates, with one coordinate representing day of week and the other representing time of day. Feeding these coordinates into the heatmap software would, I hoped, create five colorful swatches, one for each day of the week. The brightest regions in the swatches would represent the busiest times of the corresponding days.
Arbitrarily, I selected the y-coordinate to represent the day of the week. So I decided that any Monday slot, for instance, would be represented by some small (but nonzero) y-coordinate, with Tuesday represented by some greater y-coordinate, etc., with the intervals between consecutive days of the week equal. The main concern in assigning these y-coordinates was for the generated swatches to be far enough apart so that the heatmap “halo” around one day of the week would not interfere with its neighbors — we’re treating the days of the week independently. Then it was a simple matter of mapping time of day to the x-coordinate in a proportional manner. The graphic below shows the output from this process.
In this graphic, days of the week are represented by the horizontal rows of blobs, with Monday as the first row and Friday as the last. The leftmost extent of each row corresponds to approximately 8am, while the rightmost extent is about 7:30pm. The key in the upper left indicates (more or less) the number of overlapping data points in a given location. A bit of labeling helps to clarify things:
Right away, we get a good sense of the shape of the instruction week. This presentation reinforces the findings of the earlier chart: that Monday, Tuesday, and Thursday are busiest, and that Friday afternoon is basically dead. But we do see a few other interesting tidbits, which are visible to us specifically through the use of the heatmap:
- Monday, Tuesday and Thursday aren’t just busy, they’re consistently well-trafficked throughout the day.
- Friday is really quite slow throughout.
- There are a few interesting hotspots scattered here and there, notably first thing in the morning on Tuesday.
- Wednesday is quite sparse overall, except for two or three prominent afternoon/evening times.
- There is a block of late afternoon-early evening time-slots that are consistently busy in the first half of the week.
Using this information, we can take a much more informed approach to scheduling our graduate students, and hopefully be able to maximize their availability for instruction sessions.
“Better than it was before. Better. Stronger. Faster.” – Open questions and areas for improvement
As a proof of concept, this approach to analyzing our instruction data for the purposes of setting student schedules seems quite promising. We used our findings to inform our scheduling of graduate students this semester, but it’s hard to know whether our findings can even be validated: since this is the first semester where we’re actively assigning instruction to our graduate students, there’s no data available to compare this semester against, with respect to amount of grad student instruction performed. Nevertheless, it seems clear that knowledge of popular instruction times is a good guideline for grad student scheduling for this purpose.
There’s also plenty of work to be done as far as data collection and analysis is concerned. In particular:
- Data curation by hand is burdensome and inefficient. If we can automate the data collection process at all, we’ll be in a much better position to repeat this type of analysis in future semesters.
- The current data analysis completely ignores class session length, which is an important factor for scheduling (class times vary between 50 and 100 minutes). This data is recorded in our instruction spreadsheet, but there aren’t any set guidelines on how it’s entered — librarians entering their instruction data tend to round to the nearest quarter- or half-hour increment at their own preference, so a 50-minute class is sometimes listed as “.75 hours” and other times as “1 hour”. More accurate and consistent session time recording would allow us to reliably use session length in our analysis.
- To make the best use of session length in the analysis, I’ll have to learn a little bit more about PHP’s image generation libraries. The current approach is basically a plug-in adaptation of ClickHeat’s existing Heatmap class, which is only designed to handle “point” data. To modify the code to treat sessions as little line segments corresponding to their duration (rather than points that correspond to their start times) would require using image processing methods that are currently beyond my ken.
- A bit better knowledge of the image libraries would also allow me to add automatic labeling to the output file. You’ll notice the prominent use of “ish” to describe the hours dimension of the labeled heatmap above: this is because I had neither the inclination nor the patience to count pixels to determine where exactly the labels should go. With better knowledge of the image libraries I would be able to add graphical text labels directly to the generated heatmap, at precisely the correct location.
There are other fundamental questions that may be worth answering — or at least experimenting against — as well. For instance, in this analysis I used data about actual instruction sessions performed. But when lecturers request library sessions, they include two or three “preferred” dates, of which we pick the one that fits our librarian and room schedules best. For the purposes of analysis, it’s not entirely clear whether we should use the actual instruction data, which takes into account real space limitations but is also skewed by librarian availability; or whether we should look strictly at what lecturers are requesting, which might allow us to schedule our grad students in a way that could accommodate lecturers’ first choices better, but which might run us up against the library’s space limitations. In previous semesters, we didn’t store the data on the requests we received; this semester we’re doing that, so I’ll likely perform two analyses, one based on our actual instruction and one based on requests. Some insight might be gained by comparing the results of the two analyses, but it’s unclear what exactly the outcome will be.
Finally, it’s hard to predict how long-term trends in the data will affect our ability to plan for future semesters. It’s unclear whether prior semesters are a good indicator of future semesters, especially as lecturers move into and out of the First Year Writing Program, the source of the vast majority of our requests. We’ll get a better sense of this, presumably, as we perform more frequent analyses — it would also make sense to examine each semester separately to look for trends in instruction scheduling from semester to semester.
In any case, there’s plenty of experimenting left to do and plenty of improvements that we could make.
Reflections and Lessons Learned
There’s a few big points that I took away from this experience. A big one is simply that sometimes the right approach is a totally unexpected one. You can gain some interesting insights if you don’t limit yourself to the tools that are most familiar for a particular problem. Don’t be afraid to throw data at the wall and see what sticks.
Really, what we did in this case is not so different from creating separate histograms of instruction times for each day of the week, and comparing the histograms to each other. But using heatmaps gave us a couple of advantages over traditional histograms: first, our bin size is essentially infinitely narrow; because of the proximity effects of the heatmap calculation, nearby but non-overlapping data points still contribute weight to each other without us having to define bins as in a regular histogram. Second, histograms are typically drawn in two dimensions, which would make comparing them against each other rather a nuisance. In this case, our separate heatmap graphics for each day of the week are basically one-dimensional, which allows us to compare them side by side with little fuss. This technique could be used for side-by-side examinations of multiple sets of any histogram-like data for quick and intuitive at-a-glance comparison.
In particular, it’s important to remember — especially if your familiarity with heatmaps is already firmly entrenched in a spatial mapping context — that data doesn’t have to be spatial in order to be analyzed with heatmaps. This is really just an extension of the idea of graphical data analysis: A heatmap is just another way to look at arbitrary data represented graphically, not so different from a bar graph, pie chart, or scatter plot. Anything that you can express in two dimensions (or even just one), and where questions of frequency, density, proximity, etc., are relevant, can be analyzed using the heatmap approach.
A final point: as an analysis tool, the heatmap is really about getting a feel for how the data lies in aggregate, rather than getting a precise sense of where each point falls. Since the halo effect of a data point extends some distance away from the point, the limits of influence of that point on the final image are a bit fuzzy. If precision analysis is necessary, then heatmaps are not the right tool.
About our guest author: Andreas Orphanides is Librarian for Digital Technologies and Learning in the Research and Information Services department at NCSU Libraries. He holds an MSLS from UNC-Chapel Hill and a BA in mathematics from Oberlin College. His interests include instructional design, user interface development, devising technological solutions to problems in library instruction and public services, long walks on the beach, and kittens.