Is the future of research voice controlled? It might be, because when I originally had the idea for this post my first instinct was to grab my phone and dictate my half-formed ideas into a note, rather than typing it out. Writing things down often makes them seem wrong and not at all what we are trying to say in our heads. (Maybe it’s not so new, since as you may remember Socrates had a similar instinct.) The idea came out of a few different talks at the national Code4Lib conference held in Los Angeles in March of 2017 and a talk given by Chris Bourg. Among these presentations the themes of machine learning, artificial intelligence, natural language processing, voice search, and virtual assistants intersect to give us a vision for what is coming. The future might look like a system that can parse imprecise human language and turn it into an appropriately structured search query in a database or variety of databases, bearing in mind other variables, and return the correct results. Pieces of this exist already, of course, but I suspect over the next few years we will be building or adapting tools to perform these functions. As we do this, we should think about how we can incorporate our values and skills as librarians into these tools along the way.
Natural Language Processing
I will not attempt to summarize natural language processing (NLP) here, except to say that speaking to a computer requires that the computer be able to understand what we are saying. Human—or natural—language is messy, full of nuance and context that requires years for people to master, and even then often leads to misunderstandings that can range from funny to deadly. Using a machine to understand and parse natural language requires complex techniques, but luckily there are a lot of tools that can make the job easier. For more details, you should review the NLP talks by Corey Harper and Nathan Lomeli at Code4Lib. Both these talks showed that there is a great deal of complexity involved in NLP, and that its usefulness is still relatively confined. Nathan Lomeli puts it like this. NLP can “cut strings, count beans, classify things, and correlate everything”. 1 Given a corpus, you can use NLP tools to figure out what certain words might be, how many of those words there are, and how they might connect to each other.
Processing language to understand a textual corpus has a long history but is now relatively easy for anyone to do with the tools out there. The easiest is Voyant Tools, which is a project by Sinclair, Stéfan Sinclair and Geoffrey Rockwell. It is a portal to a variety of tools for NLP. You can feed it a corpus and get back all kind of counts and correlations. For example, Franny Gaede and I used VoyantTools to analyze social justice research websites to develop a social justice term corpus for a research project. While a certain level of human review is required for any such project, it’s possible to see that this technology can replace a lot of human-created language. This is already happening, in fact. A tool called Wordsmith can create convincing articles about finance, sports, and technology, or really any field with a standard set of inputs and outputs in writing. If computers are writing stories, they can also find stories.
we would be wise to start thinking now about machines and algorithms as a new kind of patron — a patron that doesn’t replace human patrons, but has some different needs and might require a different set of skills and a different way of thinking about how our resources could be used. 2
One way in which we can start to address the needs of machines as patrons is by creating searches that work with them, which is for now ultimately to serve the needs of humans, but in the future could be for their own artificial intelligence purposes. Most people are familiar with virtual assistants that have popped up on all platforms over the past few years. As an iOS and a Windows user, I am now constantly invited to speak to Siri or Cortana to search for answers to my questions or fix something in my schedule. While I’m perfectly happy to ask Siri to remind me to bring my laptop to work at 7:45 AM or to wake me up in 20 minutes, I find mixed results when I try to ask a more complex question. 3 Sometimes when I ask the temperature on the surface of Jupiter I get the answer, other times I get today’s weather in a town called Jupiter. This is not too surprising, as asking “What is the temperature of Jupiter?” could mean a number of things. It’s on the human to specify to the computer to which domain of knowledge you are referring, which requires knowing exactly how to ask the question. Computers cannot yet do a reference interview, since they cannot pick up on the subtle hidden meanings or helping with the struggle for the right words that librarians do so well. But they can help with certain types of research tasks quite well, if you know how to ask the question. Eric Frierson (PPT) gave a demonstration of his project working on voice powered search in EBSCO using Alexa. In the presentation he demonstrates the Alexa “skills” he set up for people to ask Alexa for help. They are “do you have”, “the book”, “information about”, “an overview of”, “what I should read after”, or “books like”. There is a demonstration of what this looks like on YouTube. The results are useful when you say the correct thing in the correct order, and for an active user it would be fairly quick to learn what to say, just as we learn how best to type in a search query in various services.
Why ask a question of a computer rather than type in a question to a computer? For the reason I started this piece with, certainly–voice is there, and it’s often easier to say what you mean than write it. This can be taken pragmatically as well. If you find typing difficult, being able to speak makes life easier. When I was home with a newborn baby I really appreciated being able to dictate and ask Siri about the weather forecast and what time the doctor’s appointment was. Herein lies one of the many potential pitfalls of voice: who is listening to what you are saying? One recent news story puts this in perspective, as Amazon agreed to turn over data from Alexa to police in a murder investigation after the suspect gave the ok. They refused to do at first, but it is an open question as to the legal nature of the conversation with a virtual assistant. Nor is it entirely clear when you speak to a device where the data is being processed. So before we all rush out and write voice search tools for all our systems, it is useful to think about where that data lives what the purpose of it is.
If we would protect a user’s search query by ensuring that our catalogs are encrypted (and let’s be honest, we aren’t there yet), how do we do the same for virtual search assistants in the library catalog? For Alexa, that’s built into creating an Alexa skill, since a basic requirement for the web service used is that it meet Amazon’s security requirements. But if this data is subject to subpoena, we would have to think about it in the same way we would any other search data on a third party system. And we also have to recognize that these tools are created by these companies for commercial purposes, and part of that is to gather data about people and sell things to them based on that data. Machine learning could eventually build on that to learn a lot more about people than they think, which the Amazon Echo Look recently brought up as a subject of debate. There are likely to be other services popping up in addition to those offered by Amazon, Google, Apple, and Microsoft. Before long, we might expect our vendors to be offering voice search in their interfaces, and we need to be aware of the transmission of that data and where it is being processed. A recent alliance formed called The Voice Privacy Alliance, which is developing some standards for this.
The invisibility of the result processing has another dark side. The biases inherent in the algorithms become even more hidden, as the first result becomes the “right” one. If Siri tells me the weather in Jupiter, that’s a minor inconvenience, but if Siri tells me that “Black girls” are something hypersexualized, as Safiya Noble has found that Google does, do I (or let’s say, a kid) necessarily know something has gone wrong? 4 Without human intervention and understanding, machines can perpetuate the worst side of humanity.
This comes back to Chris Bourg’s question. What happens to librarians when machines can read all the books, and have a conversation with patrons about those books? Luckily for us, it is unlikely that artificial intelligence will ever be truly self-aware with desires, metacognition, love, and need for growth and adventure. Those qualities will continue to make librarians useful to creating vibrant and unique collections and communities. But we will need to fit that in a world where we are having conversations with our computers about those collections and communities.
Lomeli, Nathan. “Natural Language Processing: Parsing Through The Hype”. Code4Lib. Los Angeles, CA. March 7, 2017. ↩
“What Happens to Libraries and Librarians When Machines Can Read All the Books?” Feral Librarian, March 17, 2017. https://chrisbourg.wordpress.com/2017/03/16/what-happens-to-libraries-and-librarians-when-machines-can-read-all-the-books/. ↩
As a side issue, I don’t have a private office and I feel weird speaking to my computer when there are people around. ↩
Noble, Safiya Umoja. “Google Search: Hyper-Visibility as a Means of Rendering Black Women and Girls Invisible – InVisible Culture.” InVisible Culture: An Electronic Journal for Visual Culture, no. 19 (2013). http://ivc.lib.rochester.edu/google-search-hyper-visibility-as-a-means-of-rendering-black-women-and-girls-invisible/. ↩
A decade ago, Stephen Colbert introduced the concept of “truthiness”, or a fact that was so because it felt right “from the gut.” When we search for information online, we are always up against the risk that the creator of a page is someone who, like Stephen Colbert’s character doesn’t trust books, because “they’re all fact, no heart.”1 Since sites with questionable or outright false facts that “feel right” often end up at the top of Google search results, librarians teach students how to evaluate online sources for accuracy, relevancy, and so on rather than just trusting the top result. But what if there were a way to ensure that truthiness was removed, and only sites with true information appeared at the top of the results?
This idea is what underlies a new Google algorithm called Knowledge-Based Trust(KBT)2. Google’s original founding principles and the PageRank algorithm were based on academic citation practices–loosely summarized, pages linked to by a number of other pages are more likely to be useful than those with fewer links. The content of the page, while it needs to match the search query, is less crucial to its ranking than outside factors, which is otherwise known as an exogenous model. The KBT, by contrast, is an endogenous model relying on the actual content of the page. Ranking is based on the probability that the page is accurate, and therefore more trustworthy. This is designed to address the problem of sites with high PageRank scores that aren’t accurate, either because their truthiness quotient is high, or because they have gamed the system by scraping content and applying misleading SEO. On the other side, pages with great information that aren’t very popular may be buried.
“Wait a second,” you are now asking yourself, “Google now determines what is true?” The answer is: sort of, but of course it’s not as simple as that. Let’s look at the paper in detail, and then come back to the philosophical questions.
Digging Into the KBT
First, this paper is technical, but the basic information is fairly straightforward. This model is based on extracting facts from a web source, evaluating whether those facts are true or not, and then whether a source is accurate or not. This leads to a determination that the facts are correct in an iterative process. Of course, verifying that determination is essential to ensuring that all the algorithms are working correctly, and this paper describes ways of checking the extracted facts for accuracy.
The extractors are described more fully in an earlier version of this work, Knowledge Vault (KV), which was designed to fill in large-scale knowledge bases such as Freebase by extracting facts from a web source using techniques like Natural Language Processing of text files followed by machine learning, HTML DOM trees, HTML tables, and human processed pages with schema.org metadata. The extractors themselves can perform poorly in creating these triples, however, and this is more common than the facts being wrong, and so sites may be unfairly flagged as inaccurate. The KBT project aims to introduce an algorithm for determining what type of error is present, as well as how to judge sites with many or few facts accurately, and lastly to test their assumptions using real world data against known facts.
The specific example given in the paper is the birthplace of President Barack Obama. The extractor would determine a predicate, subject, object triple from a web source and match these strings to Freebase (for example). This can lead to a number of errors–there is a huge problem in computationally determining the truth even when the semantics are straightforward (which we all know it rarely is). For this example, it’s possible to check data from the web against the known value in Freebase, and so if that extractor works set an option to 1 (for yes) and 0 (for no). Then this can be charted in a two-dimensional or three-dimensional matrix that helps show the probability of a given extractor working, as well as whether the value pulled by the extractor was true or not.
They go on to examine two models for computing the data, single-layer and multi-layer. The single-layer model, which looks at each web source and its facts separately, is easier to work with using standard techniques but is limited because it doesn’t take into account extraction errors. The multi-layer model is more complex to analyze, but takes the extraction errors into account along with the truth errors. I am not qualified to comment on the algorithm math in detail, but essentially it computes probability of accuracy for each variable in turn, ultimately arriving at an equation that estimates how accurate a source is, weighted by the likelihood that source contains those facts. There are additional considerations for precision and recall, as well as confidence levels returned by extractors.
Lastly, they consider how to split up large sources to avoid computational bottlenecks, as well as to merge sources with few facts in order to not penalize them but not accidentally combine unrelated sources. Their experimental results determined that generally PageRank and KBT are orthogonal, but with a few outliers. In some cases, the site has a low PageRank but a high KBT. They manually verified the top three predicates with high extraction accuracy scores for web sources with a high KBT to check what was happening. 85% of these sources were trustworthy without extraction errors and with predicates related to the topic of the page, but only 23% of these sources had PageRank scores over 0.5. In other cases, sources had a low KBT but high PageRank, which included sites such as celebrity gossip sites and forums such as Yahoo Answers. Yes, indeed, Google computer scientists finally have definitive proof that Yahoo Answers tends to be inaccurate.
The conclusion of the article with future improvements reads like the learning outcomes for any basic information literacy workshop. First, the algorithm would need to be able to tell the main topic of the website and filter out unrelated facts, to understand which triples are trivial, to have better comprehension of what is a fact, and to correctly remove sites with data scraped from other sources. That said, for what it does, this is a much more sophisticated model than anything else out there, and at least proves that there is a possibility to computationally determine the accuracy of a web source.
What is Truth, Anyway?
Despite the promise of this model there are clearly many potential problems, of which I’ll mention just a few. The source for this exercise, Freebase, is currently in read-only mode as its data migrates to Wikidata. Google is apparently dropping Freebase to focus on their Open Knowledge Graph, which is partially Freebase/Wikidata content and partially schema.org data 3. One interesting wrinkle is that much of Freebase content cites Wikipedia as a source, which means there are currently recursive citations that must be properly cited before they will be accepted as facts. We already know that Wikipedia suffers from a lack of diversity in contributors and topic coverage, so a focus on content from Wikipedia has the danger of reducing the sources of information from which the KBT could check triples.
That said, most of human knowledge and understanding is difficult to fit into triples. While surely no one would search Google for “What is love?” or similar and expect to get a factual answer, there are plenty of less extreme examples that are unclear. For instance, how does this account for controversial topics? I.e. “anthropogenic global warming is real” vs. “global warming is real, but it’s not anthropogenic.” 97% of scientists agree to the former, but what if you are looking for what the 3% are saying?
And we might question whether it’s a good idea to trust an algorithm’s definition of what is true. As Bess Sadler and Chris Bourg remind us, algorithms are not neutral, and may ignore large parts of human experience, particularly from groups underrepresented in computer science and technology. Librarians should have a role in reducing that ignorance by supporting “inclusion, plurality, participation and transparency.” 4 Given the limitations of what is available to the KBT it seems unlikely that this algorithm would markedly reduce this inequity, though I could see how it could be possible if Wikidata could be seeded with more information about diverse groups.
Librarians take note, this algorithm is still under development, and most likely won’t be appearing in our Google results any time in the near future. And even once it does, we need to ensure that we are still paying attention to nuance, edge cases, and our own sense of truthiness–and more importantly, truth–as we evaluate web sources.
Imagine this scenario: you don’t normally have a whole lot to do at your job. It’s a complex job, sure, but day-to-day you’re spending most of your time monitoring a computer and typing in data. But one day, something goes wrong. The computer fails. You are suddenly asked to perform basic job functions that the computer normally takes care of for you, and you don’t really remember well how to do them. In the mean time, the computer is screaming at you about an error, and asking for additional inputs. How well do you function?
The Glass Cage
In Nicholas Carr’s new book The Glass Cage, this scenario is the frightening result of malfunctions with airplanes, and in the cases he describes, result in crashes and massive loss of life. As librarians, we are thankfully not responsible on a daily basis for the lives of hundreds of people, but like pilots, we too have automated much of our work and depend on systems that we often have no control over. What happens when a database we rely on goes down–say, all OCLC services go down for a few hours in December when many students are trying to get a few last sources for their papers? Are we able to take over seamlessly from the machines in guiding students?
Carr is not against automation, nor indeed against technology in general, though this is a criticism frequently leveled at him. But he is against the uncritical abnegation of our faculties to technology companies. In his 2011 book The Shallows, he argues that offloading memory to the internet and apps makes us more shallow, distractable thinkers. While I didn’t buy all his arguments (after all, Socrates didn’t approve of off-loading memory to writing since it would make us all shallow, distractable thinkers), it was thought-provoking. In The Glass Cage, he focuses on automation specifically, using autopilot technologies as the focal point–“the glass cage” is the name pilots use for cockpits since they are surrounded by screens. Besides the danger of not knowing what to do when the automated systems fail, we create potentially more dangerous situations by not paying attention to what choices automated systems make. As Carr writes, “If we don’t understand the commercial, political, intellectual, and ethical motivations of the people writing our software, or the limitations inherent in automated data processing, we open ourselves to manipulation.” 1
We have automated many mundane functions of library operation that have no real effect, or a positive effect. For instance, no longer do students sign out books by writing their names on paper cards which are filed away in drawers. While some mourn for the lost history of who had out the book–or even the romance novel scenario of meeting the other person who checks out the same books–by tracking checkouts in a secure computerized system we can keep better track of where books are, as well as maintain privacy by not showing who has checked out each book. And when the checkout system goes down, it is easy to figure out how to keep things going in the interim. We can understand on an instinctual level how such a system works and what it does. Like a traditional computerized library catalog, we know more or less how data gets in the system, and how data gets out. We have more access points to the data, but it still follows its paper counterpart in creation and structure.
Over the past decade, however, we have moved away more and more from those traditional systems. We want to provide students with systems that align with their (and our) experience outside libraries. Discovery layers take traditional library data and transform it with indexes and algorithms to create a new, easier way to find research material. If traditional automated systems, like autopilot systems, removed the physical effort of moving between card catalogs, print indexes, and microfilm machines, these new systems remove much of the mental effort of determining where to search for that type of information and the particular skills needed to search the relevant database. That is a surely a useful and good development. When one is immersed in a research question, the system shouldn’t get in the way.
That said, the nearly wholesale adoption of discovery systems provided by vendors leaves academic librarians in an awkward position. We can find a parallel in medicine. Carr relates the rush into electronic medical records (EMR) starting in 2004 with the Heath Information Technology Adoption Initiative. This meant huge amounts of money available for digitizing records, as well as a huge windfall for health information companies. While an early study by the RAND corporation (funded in part by those health information companies) indicated enormous promise from electronic medical records to save money and improve care. 2 But in actual fact, these systems did not do everything they were supposed to do. All the data that was supposed to be easy to share between providers was locked up in proprietary systems. 3 In addition, other studies showed that these systems did not merely substitute automated record-keeping for manual, they changed the way medicine was practiced. 4 EMR systems provide additional functions beyond note-taking, such as checklists and prompts with suggestions for questions and tests, which in turn create additional and more costly bills, test requests, and prescriptions. 5 The EMR systems change the dynamic between doctor and patient as well. The systems encourage the use of boilerplate text that lacks the personalized story of an individual patient, and the inability to flip through pages tended to diminish the long view of a patient’s entire medical history. 6 The presence of the computer in the room and the constant multitasking of typing notes into a computer means that doctors cannot be fully present with the patient. 7 With the constant presence of the EMR and its checklists, warnings, and prompts, doctors lose the ability to gain intuition and new understandings that the EMR could never provide. 8
The reference librarian has an interaction with patrons that is not all that different from doctors with patients (though as with pilots, the stakes are usually quite different). We work one on one with people on problems that are often undefined or misunderstood at the beginning of the interaction, and work towards a solution through conversation and cursory examinations of resources. We either provide the resource that solves the problem (e.g. the prescription), or make sure the patron has the tools available to solve problem over time (e.g. diet and exercise recommendations). We need to use subtle queues of body language and tone of voice to see how things are going, and use instinctive knowledge to understand if there is a deeper but unexpressed problem. We need our tools at hand to work with patrons, but we need to be present and use our own experience and judgment in knowing the appropriate tool to use. That means that we have to understand how the tool we have works, and ideally have some way of controlling it. Unfortunately that has not always been the case with vendor discovery systems. We are at the mercy of the system, and reactions to this vary. Some people avoid using it at all costs and won’t teach using the discovery system, which means that students are even less likely to use it, preferring the easier to get to even if less robust Google search. Or, if students do use it, they may still be missing out on the benefits of having academic librarians available–people who have spent years developing domain knowledge and the best resources available at the library, which knowledge can’t be replaced by an algorithm. Furthermore, the vendor platforms and content only interoperate to the extent the vendors are willing to work together, for which many of them have a disincentive since they want their own index to come out on top.
Enter the ODI
Just as doctors may have given up some of their professional ability and autonomy to proprietary databases of patient information, academic librarians seem to have done something similar with discovery systems. But the NISO Open Discovery Initiative (ODI) has potential to make the black box more transparent. This group has been working for two years to develop a set of practices that aim to make some aspects of discovery even across providers, and so give customers and users more control in understanding what they are seeing and ensure that indexes are complete. The Recommended Practice addresses some (but not all) major concerns in discovery service platforms. Essentially it covers requirements for metadata that content providers must provide to discovery service providers and to libraries, as well as best practices for content providers and discovery service providers. The required core metadata is followed by the “enriched” content which is optional–keywords, abstract, and full text. (Though the ODI makes it clear that including these is important–one might argue that the abstract is essential). 9 Discovery service providers are in turn strongly encouraged to make the content their repositories hold clear to their customers, and the metadata required for this. Discovery service providers should follow suggested practices to ensure “fair linking”, specifically to not use business relationships as a ranking or ordering consideration, and allow libraries to set their own preferences about choice of providers and wording for links. ODI suggests a fairly simple set of usage statistics that should be provided and exactly what they should measure. 10
While this all sets a good baseline, what is out of scope for ODI is equally important. It “does not address issues related to performance or features of the discovery services, as these are inherently business and design decisions guided by competitive market forces.” 11 Performance and features includes the user interface and experience, the relevancy ranking algorithms, APIs, specific mechanisms for fair linking, and data exchange (which is covered by other protocols). The last section of the Recommended Practice covers some of those in “Recommended Next Steps”. One of those that jumps out is the “on-demand lookup by discovery service users” 12, which suggests that users should be able to query the discovery service to determine “…whether or not a particular collection, journal, or book is included in the indexed content”13–seemingly the very goal of discovery in the first place.
“Automation of Intellect”
We know that many users only look at the first page of results for the resource they want. If we don’t know what results should be there, or how they get there, we are leaving users at the mercy of the tool. Disclosure of relevancy rankings is a major piece of transparency that ODI leaves out, and without understanding or controlling that piece of discovery, I think academic librarians are still caught in the trap of the glass cage–or become the chauffeur in the age of the self-driving car. This has been happening in all professional fields as machine learning algorithms and processing power to crunch big data sets improve. Medicine, finance, law, business, and information technology itself have been increasingly automated as software can run algorithms to analyze scenarios that in the past would require a senior practitioner. 14 So what’s the problem with this? If humans are fallible (and research shows that experts are equally if not more fallible), why let them touch anything? Carr argues that “what makes us smart is not our ability to pull facts from documents.…It’s our ability to make sense of things…” 15 We can grow to trust the automated system’s algorithms beyond our own experience and judgment, and lose the possibility of novel insights. 16
This is not to say that discovery systems do not solve major problems or that libraries should not use them. They do, and as much as practical libraries should make discovery as easy as possible. But as this ODI Recommended Practice makes clear, much remains a secret business decision for discovery service vendors, and thus something over which academic librarian can exercise control only though their dollars in choosing a platform and their advocacy in working with vendors to ensure they understand the system and it does what they need.
Nicholas Carr, The Glass Cage: Automation and Us (New York: Norton, 2014), 208. ↩
John Oliver describes net neutrality as the most boring important issue. More than that, it’s a complex idea that can be difficult to understand without a strong grasp of the architecture of the internet, which is not at all intuitive. An additional barrier to having a measured response is that most of the public discussions about net neutrality conflate it with negotiations over peering agreements (more on that later) and ultimately rest in contracts with unknown terms. The hyperbole surrounding net neutrality may be useful in riling up public sentiment, but the truth seems far more subtle. I want to approach a definition and an understanding of the issues surrounding net neutrality, but this post will only scratch the surface. Despite the technical and legal complexities, this is something worth understanding, since as academic librarians our daily lives and work revolve around internet access for us and for our students.
The Communications Act of 1934 (PDF) created the FCC to regulate wire and radio communication. This classified phone companies and similar services as “common carriers”, which means that they are open to all equally. If internet service providers are classified in the same way, this ensures equal access, but for various reasons they are not considered common carriers, which was affirmed by the Supreme Court in 2005. The FCC is now seeking to use section 706 of the 1996 Telecommunications Act (PDF) to regulate internet service providers. Section 706 gave the FCC regulatory authority to expand broadband access, particularly to elementary and high schools, and this piece of it is included in the current rulemaking process.
The legal part of this is confusing to everyone, not least the FCC. We’ll return to that later. But for now, let’s turn our attention to the technical part of net neutrality, starting with one of the most visible spats.
A Tour Through the Internet
I am a Comcast customer for my home internet. Let’s say I want to watch Netflix. How do I get there from my home computer? First comes the traceroute that shows how the request from my computer travels over the physical lines that make up the internet.
Tracing route to netflix.com 1
over a maximum of 30 hops:
1 1 ms <1 ms <1 ms 10.0.1.1
2 24 ms 30 ms 37 ms 184.108.40.206
3 43 ms 40 ms 29 ms te-0-4-0-17-sur04.chicago302.il.chicago.comcast.
4 20 ms 32 ms 36 ms te-2-6-0-11-ar01.area4.il.chicago.comcast.net [6
5 33 ms 30 ms 37 ms he-3-14-0-0-cr01.350ecermak.il.ibone.comcast.net
6 27 ms 34 ms 30 ms pos-1-4-0-0-pe01.350ecermak.il.ibone.comcast.net
7 30 ms 41 ms 54 ms chp-edge-01.inet.qwest.net 5
8 * * * Request timed out.
9 73 ms 69 ms 69 ms 220.127.116.11
10 65 ms 77 ms 96 ms te1-8.csrt-agg01.prod1.netflix.com 6
11 80 ms 81 ms 74 ms www.netflix.com 1
7-9. Now the request leaves Comcast, and goes out to a Tier 1 internet provider, which owns cables that cross the country. In this case, the cables belong to CenturyLink (which recently purchased Qwest).
Why should Comcast ask Netflix to pay to transmit their data over Comcast’s networks? Understanding this requires a few additional concepts.
Peering is an important concept in the structure of the internet. Peering is a physical link of hardware to hardware between networks in internet exchanges, which are (as pictured above) huge buildings filled with routers connected to each other. 8. Facebook Peering is an example of a very open peering policy. Companies and internet service providers can use internet exchange centers to plug their equipment together directly, and so make their connections faster and more reliable. For websites such as Facebook which have an enormous amount of upload and download traffic, it’s well worth the effort for a small internet service provider to peer with Facebook 9.
Peering relies on some equality of traffic, as the name implies. The various tiers of internet service providers you may have heard of are based on with whom they “peer”. Tier 1 ISPs are large enough that they all peer with each other, and thus form what is usually called the backbone of the internet.
Academic institutions created the internet originally–computer science departments at major universities literally had the switches in their buildings. In the US this was ARPANET, but a variety of networks at academic institutions existed throughout the world. Groups such as Internet2 allow educational, research, and government networks to connect and peer with each other and commercial entities (including Facebook, if the traceroute from my workstation is any indication). Smaller or isolated institutions may rely on a consumer ISP, and what bandwidth is available to them may be limited by geography.
The Last Mile
Consumers, by contrast, are really at the mercy of whatever company dominates in their neighborhoods. Consumers obviously do not have the resources to lay their own fiber optic cables directly to all the websites they use most frequently. They rely on an internet service provider to do the heavy lifting, just as most of us rely on utility companies to get electricity, water, and sewage service (though of course it’s quite possible to live off the grid to a certain extent on all those services depending on where you live). We also don’t build our own roads, and we expect that certain spaces are open for traveling through by anyone. This idea of roads open for all to get from the wider world to arterial streets to local neighborhoods is thus used as an analogy for the internet–if internet service providers (like phone companies) must be common carriers, this ensures the middle and last miles aren’t jammed.
When Peering Goes Bad
Think about how peering works–it requires a roughly equal amount of traffic being sent and received through peered networks, or at least an amount of traffic to which both parties can agree. This is the problem with Netflix. Unlike big companies such as Facebook, and especially Google, Netflix is not trying to build its own network. It relies on content delivery services and internet backbone providers to get content from its servers (all hosted on Amazon Web Services) to consumers. But Netflix only sends traffic, it doesn’t take traffic, and this is the basis of most of the legal battles going on with internet service providers that service the “last mile”.
Netflix tried various arrangements, but ultimately negotiated with Comcast to pay for direct access to their last mile networks through internet exchanges, one of which is illustrated above in steps 4-6. This seems to be the most reasonable course of action for Netflix to get their outbound content over networks, since they really don’t have the ability to do settlement-free peering. Of course, Reed Hastings, the CEO of Netflix, didn’t see it that way. But for most cases, settlement-free peering is still the only way the internet can actually work, and while we may not see the agreements that make this happen, it won’t be going anywhere. In this case, Comcast was not offering Netflix paid prioritization of its content, it was negotiating for delivery of the content at all. This might seem equally wrong, but someone has to pay for the bandwidth, and why shouldn’t Netflix pay for it?
What Should We Do?
If companies want to connect with each other or build their own network connections, they can do under whatever terms work best for them. The problem would be if certain companies were using the same lines that everyone was using but their packets got preferential treatment. The imperfect road analogy works well enough for these purposes. When a firetruck, police car, and ambulance are racing through traffic with sirens blazing, we are usually ok with the resulting traffic jam since we can see this requires that speed for an emergency situation. But how do we feel when we suspect a single police car has turned on a siren just to cut in line to get to lunch faster? Or a funeral procession blocks traffic? Or an elected official has a motorcade? Or a block party? These situations are regulated by government authorities, but we may or may not like that these uses of public ways are being allowed and causing our own travel to slow down. Going further, it is clearly illegal for a private company to block a public road and charge a high rate for faster travel, but imagine if no governmental agency had the power to regulate this? The FCC is attempting to make sure they have those regulatory powers.
That said it doesn’t seem like anyone is actually planning to offer paid prioritization. Even Comcast claims “no company has had a stronger commitment to openness of the Internet…” and that they have no plans of offering such a service . I find it unlikely that we will face a situation that Barbara Stripling describes as “prioritizing Mickey Mouse and Jennifer Lawrence over William Shakespeare and Teddy Roosevelt.”
I certainly won’t advocate against treating ISPs as common carriers–my impression is that this is what the 1996 Telecommunications Act was trying to get at, though the legal issues are confounding. However, a larger problem facing libraries (not so much large academics, but smaller academics and publics) is the digital divide. If there’s no fiber optic line to a town, there isn’t going to be broadband access, and an internet service provider has no business incentive to create a line for a small town that may not generate a lot of revenue. I think we need to remain vigilant about ensuring that everyone has access to the internet at all or at a fast speed, and not get too sidetracked about theoretical future possible malfeasance by internet service providers. These points are included in the FCC’s proposal, but are not receiving most of the attention, despite the fact that they are given explicit regulatory authority to do this.
Yes, that was that controversial Summit that was much talked about on Twitter with the #libfuturesummit hashtag. This Summit and other summits with a similar theme close to one another in timing – “The Future of Libraries Survival Summit” hosted by Information Today Inc. and “The Future of Libraries: Do We Have Five Years to Live?” hosted by Ken Heycock Associates Inc. and Dysart & Jones Associates – seemed to have brought out the sentiment that Andy Woodworth aptly named ‘Library Future Fatigue.’ It was impressive experience to see how active librarians – both ALA members and non-members – were in providing real-time comments and feedback about these summits while I was at one of those in person. I thought ALA is lucky to have such engaged members and librarians to work with.
A few days ago, ALA released the official Summit report.1 The report captured all the talks and many table discussions in great detail. In this post, I will focus on some of my thoughts and take-aways prompted by the talks and the table discussion at the Summit.
A. The Draw
Here is an interesting fact. The invitation to this Summit sat in my Inbox for over a month because from the email subject I thought it was just another advertisement for a fee-based webinar or workshop. It was only after I had gotten another email from the ALA office asking about the previous e-mail that I realized that it was something different.
What drew me to this Summit were: (a) I have never been at a formal event organized just for a discussion about the future of libraries, (b) the event were to include a good number of people outside of the libraries, and (c) the overall size of the Summit would be kept relatively small.
For those curious, the Summit had 51 attendees plus 6 speakers, a dozen discussion table facilitators, all of whom fit into the Members’ Room in the Library of Congress. Out of those 51 attendees, 9 of them were from the non-library sector such as Knight Foundation, PBS, Rosen Publishing, and Aspen Institute. 33 attendees ranged from academic librarians to public, school, federal, corporate librarians, library consultants, museum and archive folks, an LIS professor, and library vendors. And then there were 3 ALA presidents (current, past, and president-elect) and 6 officers from ALA. You can see the list of participants here.
B. Two Words (or Phrases)
At the beginning of the Summit, the participants were asked to come up with two words or short phrases that capture what they think about libraries “from now on.” We wrote these on the ribbons and put right under our name tags. Then we were encouraged to keep or change them as we move through the Summit.
Other phrases and words I saw from other participants included “From infrastructure to engagement,” “Sanctuary for learning,” “Universally accessible,” “Nimble and Flexible,” “From Missionary to Mercenary,” “Ideas into Action,” and “Here, Now.” The official report also lists some of the words that were most used by participants. If you choose your two words or phrases that capture what you think about libraries “from now on,” what would those be?
C. The Set-up
The Summit organizers have filled the room with multiple round tables, and the first day morning, afternoon, and the second day morning, participants sat at the table according to the table number assigned on the back of their name badges. This was a good method that enabled participants to have discussion with different groups of people throughout the Summit.
As the Summit agenda shows, the Summit program started with a talk by a speaker. After that, participants were asked to personally reflect on the talk and then have a table discussion. This discussion was captured on the large poster-size papers by facilitators and collected by the event organizers. The papers on which we were asked to write our personal reflections were also collected in the same way along with all our ribbons on which we wrote those two words or phrases. These were probably used to produce the official Summit report.
One thing I liked about the set-up was that every participant sat at a round table including speakers and all three ALA presidents (past, president, president-elect). Throughout the Summit, I had a chance to talk to Lorcan Dempsey from OCLC, Corinne Hill, the director of Chattanooga Public Library, Courtney Young, the ALA president-elect, and Thomas Frey, a well-known futurist at DaVinci Institute, which was neat.
Also, what struck me most during the Summit was that those who were outside of the library took the guiding questions and the following discussion much more seriously than those of us who are inside the library world. Maybe indeed we librarians are suffering from ‘library future fatigue.’ And/or maybe outsiders have more trust in libraries as institutions than we librarians do because they are less familiar with our daily struggles and challenges in the library operation. Either way, the Summit seemed to have given them an opportunity to seriously consider the future of libraries. The desired impact of this would be more policymakers, thought leaders, and industry leaders who are well informed about today’s libraries and will articulate, support, and promote the significant work libraries do to the benefit of the society in their own areas.
D. Talks, Table Discussion, and Some of My Thoughts and Take-aways
These were the talks given during the two days of the Summit:
“How to Think Like a Freak” – Stephen Dubner, Journalist
“What Are Libraries Good For?” – Joel Garreau, Journalist
“Education in the Future: Anywhere, Anytime” – Dr. Renu Khator, Chancellor and President at the University of Houston
“From an Internet of Things to a Library of Things” – Thomas Frey, Futurist
A Table Discussion of Choice:
Open – group decides the topic to discuss
Empowering individuals and families
Promoting literacy, particularly in children and youth
Building communities the library serves
Protecting and empowering access to information
Advancing research and scholarship at all levels
Preserving and/or creating cultural heritage
Supporting economic development and good government
“What Happened at the Summit?” – Joan Frye Williams, Library consultant
(0) Official Report, Liveblogging Posts, and Tweets
The most fascinating story in the talk by Dubner was Kobe, the hot dog eating contest champion from Japan. The secret of his success in the eating contest was rethinking the accepted but unchallenged artificial limits and redefining the problem, said Dubner. In Kobe’s case, he redefined the problem from ‘How can I eat more hotdogs?’ to ‘How can I eat one hotdog faster?’ and then removed artificial limits – widely accepted but unchallenged conventions – such as when you eat a hot dog you hold it in the hand and eat it from the top to the bottom. He experimented with breaking the hotdog into two pieces to feed himself faster with two hands. He further refined his technique by eating the frankfurter and the bun separately to make the eating even speedier.
So where can libraries apply this lesson? One thing I can think of is the problem of the low attendance of some library programs. What if we ask what barriers we can remove instead of asking what kind of program will draw more people? Chattanooga Public Library did exactly this. Recently, they targeted the parents who would want to attend the library’s author talk and created an event that would specifically address the child care issue. The library scheduled a evening story time for kids and fun activities for tween and teens at the same time as the author talk. Then they asked parents to come to the library with the children, have their children participate in the library’s children’s programs, and enjoy themselves at the library’s author talk without worrying about the children.
Another library service that I came to learn about at my table was the Zip Books service by the Yolo county library in California. What if libraries ask what the fastest to way to deliver a book that the library doesn’t have to a patron’s door would be instead of asking how quickly the cataloging department can catalog a newly acquired book to get it ready for circulation? The Yolo county library Zip Books service came from that kind of redefinition of a problem. When a library user requests a book the library doesn’t have but meets certain requirements, the Yolo County Library purchases the book from a bookseller and have it shipped directly to the patron’s home without processing the book. Cataloging and processing is done when the book is returned to the library after the first use.
(2) What Can Happen to Higher Education
My favorite talk during the Summit was by Dr. Khator because she had deep insight in higher education and I have been working at university libraries for a long time. The two most interesting observations she made were the possibility of (a) the decoupling of the content development and the content delivery and (b) the decoupling of teaching and credentialing in higher education.
The upside of (a) is that some wonderful class a world-class scholar created may be taught by other instructors at places where the person who originally developed the class is not available. The downside of (a) is, of course, the possibility of it being used as the cookie-cutter type lowest baseline for quality control in higher education – University of Phoenix mentioned as an example of this by one of the participants at my table – instead of college and university students being exposed to the classes developed and taught by their institutions’ own individual faculty members.
I have to admit that (b) was a completely mind-blowing idea to me. Imagine colleges and universities with no credentialing authority. Your degree will no longer be tied to a particular institution to which you were admitted and graduate from. Just consider the impact of what this may entail if it ever becomes realized. If both (a) and (b) take place at the same time, the impact would be even more significant. What kind of role could an academic library play in such a scenario?
(3) Futurizing Libraries
Joe Garreau observed that nowadays what drives the need for a physical trip is more and more a face-to-face contact than anything else. Then he pointed out that as technology allows more people to tele-work, people are flocking to smaller cities where they can have a more meaningful contact with the community. If this is indeed the case, libraries that make their space a catalyst for a face-to-face contact in a community will prosper. Last speaker, Thomas Frey, spoke mostly about the Internet of Things (IoT).
While I think that IoT is an important trend to note, for sure, what I most liked about Frey’s talk was his statement that the vision of future we have today will change the decisions we make (towards that future). After the talk by Garreau, I had a chance to ask him a question about his somewhat idealized vision of the future, in which people live and work in a small but closely connected community in a society that is highly technological and collaborative. He called this ‘human evolution’.
But in my opinion, the reality that we see today in my opinion is not so idyllic.3 The current economy is highly volatile. It no longer offers job security, consistently reduces the number of jobs, and returns either stagnant or decreasing amount of income for those whose skills are not in high demand in the era of digital revolution.4 As a result, today’s college students, who are preparing to become tomorrow’s knowledge workers, are perceiving their education and their lives after quite differently than their parents did.5
Garreau’s answer to my question was that this concern of mine may be coming from a kind of techno-determinism. While this may be a fair critique, I felt that his portrayal of the human evolution may be just as techno-deterministic. (To be fair, he mentioned that he does not make predictions and this is one of the future scenarios he sees.)
Regarding the topic of the Internet of Things (IoT), which was the main topic of Frey’s talk, the privacy and the proper protection of the massive amount of data – which will result from the very many sensors that makes IoT possible – will be the real barrier to implementing the IoT on a large scale. After his talk, I had a chance to briefly chat with him about this. (There was no Q&A because Frey’s talk went over the time allotted). He mentioned the possibility of some kind of an international gathering similar to the scale of the Geneva Conventions to address the issue. While the likelihood of that is hard to assess, the idea seemed appropriate to the problem in question.
(4) What If…?
Some of the shiny things shown at the talk, whose value for library users may appear dubious and distant, however, prompted Eli Neiburger at Ann Arbor District Library to question which useful service libraries can offer to provide the public with significant benefit now. He wondered what it would be like if many libraries ran a Tor exit node to help the privacy and anonymity of the web traffic, for example.
Just pause a minute and imagine what kind of impact such a project by libraries may have to the privacy of the public. What if?
(5) Leadership and Sustainability
For the “Table Discussion of Choice” session, I opted for the “Open” table because I was curious in what other topics people were interested. Two discussions at this session were most memorable to me. One was the great advice I got from Corinne Hill regarding leading people. A while ago, I read her interview, in which she commented that “the staff are just getting comfortable with making decisions.” In my role as a relatively new manager, I also found empowering my team members to be more autonomous decision makers a challenge. Corinne particularly cautioned that leaders should be very careful about not being over-critical when the staff takes an initiative but makes a bad decision. Being over-critical in that case can discourage the staff from trying to make their own decisions in their expertise areas, she said. Hearing her description of how she relies on the different types of strengths in her staff to move her library in the direction of innovation was also illuminating to me. (Lorcan Dempsey who was also at our table mentioned “Birkman Quadrants” in relation to Corinne’s description, a set of useful theoretical constructs. He also brought up the term ‘Normcore’ at another session. I forgot the exact context of that term, but the term was interesting that I wrote it down.) We also talked for a while about the current LIS education and how it is not sufficiently aligned with the skills needed in everyday library operation.
The other interesting discussion started with the question about the sustainability of the future libraries by Amy Garmer from Aspen Institute. (She has been working on a library-related project with various policy makers, and PLA has a program related to this project at the upcoming 2014 ALA Annual Conference if you are interested.) One thought that always comes to my mind whenever I think about the future of libraries is that while in the past the difference between small and large libraries was mostly quantitative in terms of how many books and other resources were available, in the present and future, the difference is and will be more qualitative. What New York Public Libraries offers for their patrons, a whole suite of digital library products from the NYPL Labs for example, cannot be easily replicated by a small rural library. Needless to say, this has a significant implication for the core mission of the library, which is equalizing the public’s access to information and knowledge. What can we do to close that gap? Or perhaps will different types of libraries have different strategies for the future, as Lorcan Dempsey asked at our table discussion? These two things are not incompatible to be worked out at the same time.
(6) Nimble and Media-Savvy
In her Summit summary, Joanne Frye Williams, who moved around to observe discussions at all tables during the Summit, mentioned that one of the themes that surfaced was thinking about a library as a developing enterprise rather than a stable organization. This means that the modus operandi of a library should become more nimble and flexible to keep the library in the same pace of the change that its community goes through.
Another thread of discussion among the Summit participants was that not all library supporters have to be the active users of the library services. As long as those supporters know that the presence and the service of libraries makes their communities strong, libraries are in a good place. Often libraries make the mistake of trying to reach all of their potential patrons to convert them into active library users. While this is admirable, it is not always practical or beneficial to the library operation. More needed and useful is a well-managed strategic media relations that will effectively publicize the library’s services and programs and its benefits and impact to its community. (On a related note, one journalist who was at the Summit mentioned how she noticed the recent coverage about libraries changing its direction from “Are libraries going to be extinct?” to “No, libraries are not going to be extinct. And do you know libraries offer way more than books such as … ?”, which is fantastic.)
E. What Now? Library Futurizing vs. Library Grounding
What all the discussion at the Summit reminded me was that ultimately the time and efforts we spend on trying to foresee what the future holds for us and on raising concerns about the future may be better directed at refining the positive vision for the desirable future for libraries and taking well-calculated and decisive actions towards the realization of that vision.
Technology is just a tool. It can be used to free people to engage in more meaningful work and creative pursuits. Or it can be used to generate a large number of the unemployed, who have to struggle to make the ends meet and to retool themselves with fast-changing skills that the labor market demands, along with those in the top 1 or 0.1 % of very rich people. And we have the power to influence and determine which path we should and would be on by what we do now.
Certainly, there are trends that we need to heed. For example, the shift of the economy that places a bigger role on entrepreneurship than ever before requires more education and support for entrepreneurship for students at universities and colleges. The growing tendency of the businesses looking for potential employees based upon their specific skill sets rather than their majors and grades has lead universities and colleges to adopt a digital badging system (such as Purdue’s Passport) or other ways for their students to record and prove the job-related skills obtained during their study.
But when we talk about the future, many of us tend to assume that there are some kind of inevitable trends that we either get or miss and that those trends will determine what our future will be. We forget that not some trends but (i) what we intend to achieve in the future and (ii) today’s actions we take to realize that intention are really what determines our future. (Also always critically reflect on whatever is trendy; you may be in for a surprise.7) The fact that people will no longer need to physically visit a library to check out books or access library resources does not automatically mean that the library in the future will cease to have a building. The question is whether we will let that be the case. Suppose we decide that we want the library to be and stay as the vibrant hub for a community’s freedom of inquiry and right to access human knowledge, no matter how much change takes place in the society. Realizing this vision ‘IS’ within our power. We only reach the future by walking through the present.
For a short but well-written clear description of this phenomenon, see Brynjolfsson, Erik, and Andrew McAfee. Race against the Machine: How the Digital Revolution Is Accelerating Innovation, Driving Productivity, and Irreversibly Transforming Employment and the Economy. Lexington: Digital Frontier Press, 2012. ↩
Many of us have had conversations in the past few weeks about data collection due to the reports about the NSA’s PRISM program, but ever since April and the bombings at the Boston Marathon, there has been an increased awareness of how much data is being collected about people in an attempt to track down suspects–or, increasingly, stop potential terrorist events before they happen. A recent Nova episode about the manhunt for the Boston bombers showed one such example of this at the New York Police Department. This program is called the Domain Awareness System at the New York Police Department, and consists of live footage from almost every surveillance camera in the New York City playing in one room, with the ability to search for features of individuals and even the ability to detect people acting suspiciously. Added to that a demonstration of cutting edge facial recognition software development at Carnegie Mellon University, and reality seems to be moving ever closer to science fiction movies.
Librarians focused on technical projects love to collect data and make decisions based on that data. We try hard to get data collection systems as close to real-time as possible, and work hard to make sure we are collecting as much data as possible and analyzing it as much as possible. The idea of a series of cameras to track in real-time exactly what our patrons are doing in the library in real-time might seem very tempting. But as librarians, we value the ability of our patrons to access information with as much privacy as possible–like all professions, we treat the interactions we have with our patrons (just as we would clients, patients, congregants, or sources) with care and discretion (See Item 3 of the Code of Ethics of the American Library Association). I will not address the national conversation about privacy versus security in this post–I want to address the issue of data collection right where most of us live on a daily basis inside analytics programs, spreadsheets, and server logs.
What kind of data do you collect?
Let’s start with an exercise. Write a list of all the statistical reports you are expected to provide your library–for most of us, it’s probably a very long list. Now, make a list of all the tools you use to collect the data for those statistics.
Here are a few potential examples:
Website visitors and user experience
Google Analytics or some other web analytics tool
Heat map tool
Electronic resource access reports
Electronic resources management application
Vendor reports (COUNTER and other)
Link resolver click-through report
Proxy server logs
How much is enough?
Think about with these tools what type of data you are collecting about your users. Some of it may be very private indeed. For instance, the heat map tool I’ve recently started using (Inspectlet) not only tracks clicks, but actually records sessions as patrons use the website. This is fascinating information–we had, for instance, one session that was a patron opening the library website, clicking the Facebook icon on the page, and coming back to the website nearly 7 hours later. It was fun to see that people really do visit the library’s Facebook page, but the question was immediately raised whether it was a visit from on campus. (It was–and wouldn’t have taken long to figure out if it was a staff machine and who was working that day and time). IP addresses from off campus are very easy to track, sometimes down to the block–again, easy enough to tie to an individual. We like to collect IP addresses for abusive or spamming behavior and block users based on IP address all the time. But what about in this case? During the screen recordings I can see exactly what the user types in the search boxes for the catalog and discovery system. Luckily, Inspectlet allows you to obscure the last two octets (which is legally required some places) of the IP address, so you can have less information collected. All similar tools should allow you the same ability.
Consider another case: proxy server logs. In the past when I did a lot of EZProxy troubleshooting, I found the logs extremely helpful in figuring out what went wrong when I got a report of trouble, particularly when it had occurred a day or two before. I could see the username, what time the user attempted to log in or succeeded in logging in, and which resources they accessed. Let’s say someone reported not being able to log in at midnight– I could check to see the failed logins at midnight, and then that username successfully logging in at 1:30 AM. That was a not infrequent occurrence, as usually people don’t think to write back and say they figured out what they did wrong! But I could also see everyone else’s logins and which articles they were reading, so I could tell (if I wanted) which grad students were keeping up with their readings or who was probably sharing their login with their friend or entire company. Where I currently work, we don’t keep the logs for more than a day, but I know a lot of people are out there holding on to EZProxy logs with the idea of doing “something” with them someday. Are you holding on to more than you really want to?
Let’s continue our exercise. Go through your list of tools, and make a list of all the potentially personally identifying information the tool collects, whether or not you use them. Are you surprised by anything? Make a plan to obscure unused pieces of data on a regular basis if it can’t be done automatically. Consider also what you can reasonably do with the data in your current job requirements, rather than future study possibilities. If you do think the data will be useful for a future study, make sure you are saving anonymized data sets unless it is absolutely necessary to have personally identifying information. In the latter case, you should clear your study in advance with your Institutional Review Board and follow a data management plan.
A privacy and data management policy should include at least these items:
A statement about what data you are collecting and why.
Where the data is stored and who has access to it.
A retention timeline.
What we can do with data
In all this I don’t at all mean to imply that we shouldn’t be collecting this data. In both the examples I gave above, the data is extremely useful in improving the patron experience even while giving identifying details away. Not collecting data has trade-offs. For years, libraries have not retained a patron’s borrowing record to protect his or her privacy. But now patrons who want to have an online record of what they’ve borrowed from the library must use third-party services with (most likely) much less stringent privacy policies than libraries. By not keeping records of what users have checked out or read through databases, we are unable to provide them personalized automated suggestions about what to read next. Anyone who uses Amazon regularly knows that they will try to tempt you into purchases based on your past purchases or books you were reading the preview of–even if you would rather no one know that you were reading that book and certainly don’t want suggestions based on it popping up when you are doing a collection development project at work and are logged in on your personal account. In all the decisions we make about collecting or not collecting data, we have to consider trade-offs like these. Is the service so important that the benefits of collecting the data outweigh the risks? Or, is there another way to provide the service?
We can see some examples of this trade-off in two similar projects coming out of Harvard Library Labs. One, Library Hose, was a Twitter stream with the name of every book being checked out. The service ran for part of 2010, and has been suspended since September of 2010. In addition to daily tweet limits, this also was a potential privacy violation–even if it was a fun idea (this blog post has some discussion about it). A newer project takes the opposite approach–books that a patron thinks are “awesome” can be returned to the Awesome Box at the circulation desk and the information about the book is collected on the Awesome Box website. This is a great tweak to the earlier project, since this advertises material that’s now available rather than checked out, and people have to opt in by putting the item in the box.
In terms of personal recommendations, librarians have the advantage of being able to form close working relationships with faculty and students so they can make personal recommendations based on their knowledge of the person’s work and interests. But how to automate this without borrowing records? One example is a project that Ian Chan at California State University San Marcos has done to use student enrollment data to personalize the website based on a student’s field of study. (Slides). This provides a great deal of value for the students, who need to log in to check their course reserves and access articles from off campus anyway. This adds on top of that basic need a list of recommended resources for students, which they can choose to star as favorites.
Work to educate your patrons about privacy, particularly online privacy. ALA has a Choose Privacy Week, which is always the first week in May. The site for that has a number of resources you might want to consult in planning programming. Academic librarians may find it easiest to address college students in terms of their presence on social media when it comes to future job hunting, but this is just an opening to larger conversations about data. Make sure that when you ask patrons to use a third party service (such as a social network) or recommend a service (such as a book recommending site) that you make sure they are aware of what information they are sharing.
We all know that Google’s slogan is “Don’t be evil”, but it’s not always clear if they are sticking to that. Make sure that you are not being evil in your own data collection.
Cultivating Change in the Academy: 50+ Stories from the Digital Frontlines
This is a review of the ebook Cultivating Change in the Academy: 50+ Stories from the Digital Frontlines and also of the larger project that collected the stories that became the content of the ebook. The project collects discussions about how technology can be used to improve student success. Fifty practical examples of successful projects are the result. Academic librarians will find the book to be a highly useful addition to our reference or professional development collections. The stories collected in the ebook are valuable examples of innovative pedagogy and administration and are useful resources to librarians and faculty looking for technological innovations in the classroom. Even more valuable than the collected examples may be the model used to collect and publish them. Cultivating Change, especially in its introduction and epilogue, offers a model for getting like minds together on our campuses and sharing experiences from a diversity of campus perspectives. The results of interdisciplinary cooperation around technology and success make for interesting reading, but we can also follow their model to create our own interdisciplinary collaborations at home on our campuses. More details about the ongoing project are available on their community site. The ebook is available as a blog with comments and also as an .epub, .mobi, or .pdf file from the University of Minnesota Digital Conservancy.
Cultivating Change in the Academy: 50+ Stories from the Digital Frontlines1
The stories that make up the ebook have been peer reviewed and organized into chapters on the following topics: Changing Pedagogies (teaching using the affordances of today’s technology), Creating Solutions (technology applied to specific problems), Providing Direction (technology applied to leadership and administration), and Extending Reach (technology employed to reach expanded audiences.) The stories follow a semi-standard format that clearly lays out each project, including the problem addressed, methodology, results, and conclusions.
Section One: Changing Pedagogies
The opening chapter focuses on applications of academic technology in the classroom that specifically address issues of moving instruction from memorization to problem solving and interactive coaching. These efforts are often described by the term “digital pedagogy” (For an explanation of digital pedagogy, see Brian Croxall’s elegant definition.2) I’m often critical of digital pedagogy efforts because they can confuse priorities and focus on the digital at the expense of the pedagogy. The stories in this section do not make this mistake and correctly focus on harnessing the affordances of technology (the things we can do now that were not previously possible) to achieve student-success and foster learning.
One particularly impressive story, Web-Based Problem-Solving Coaches for Physics Students, explained how a physics course used digital tools to enable more detailed feedback to student work using the cognitive apprenticeship model. This solution encouraged the development of problem-solving skills and has to potential to scale better than classical lecture/lab course structures.
Section Two: Creating Solutions
This section focuses on using digital technology to present content to students outside of the classroom. Technology is extending the reach of the University beyond the limits of our campus spaces, this section address how innovations can make distance education more effective. A common theme here is the concept of the flipped classroom. (See Salmam Khan’s TED talk for a good description of flipping the classroom. 3) In a flipped classroom the traditional structure of content being presented to students in lectures during class time and creative work being assigned as homework is flipped. Content is presented outside the classroom and instructors lead students in creative projects during class time. Solutions listed in this section include podcasts, video podcasts, and screencasts. They also address synchronous and asynchronous methods of distance education and some theoretical approaches for instructors to employ as they transition from primarily face to face instruction to more blended instruction environments.
Of special note is the story Creating Productive Presence: A Narrative in which the instructor assesses the steps taken to provide a distance cohort with the appropriate levels of instructor intervention and student freedom. In face-to-face instruction, students have body-language and other non-verbal cues to read on the instructor. Distance students, without these familiar cues, experienced anxiety in a text-only communication environment. Using delegates from student group projects and focus groups, the instructor was able to find an appropriate classroom presence balanced between cold distance and micro-management of the group projects.
Section Three: Providing Direction
The focus of this section is on innovative new tools for administration and leadership and how administration can provide leadership and support for the embrace of disruptive technologies on campus. The stories here tie the overall effort to use technology to advance student success to accreditation, often a necessary step to motivate any campus to make uncomfortable changes. Data archives, the institutional repository, clickers (class polling systems), and project management tools fall under this general category.
The final section discusses ways technology can enable the university to reach wider audiences. Examples include moving courseware content to mobile platforms, using SMS messaging to gather research data, and using mobile devices to scale the collection of oral histories. Digital objects scale in ways that physical objects cannot and these projects take advantage of this scale to expand the reach of the university.
The stories and practical experiences recorded in Cultivating Change in the Academy are valuable in their own right. It is a great resource for ideas and shared experience for anyone looking for creative ways to leverage technology to achieve educational goals. For this reader though, the real value of this project is the format used to create it. The book is full of valuable and interesting content. However, in the digital world, content isn’t king. As Corey Doctorow tells us:
Content isn’t king. If I sent you to a desert island and gave you the choice of taking your friends or your movies, you’d choose your friends — if you chose the movies, we’d call you a sociopath. Conversation is king. Content is just something to talk about.4
The process the University of Minnesota followed to generate conversation around technology and student success is detailed in a white paper. 5 After reading some of the stories in Cultivating Change, if you find yourself wishing similar conversations could take place on your campus, this is the road-map the University of Minnesota followed. Before they were able to publish their stories, the University of Minnesota had to bring together their faculty, staff, and administration to talk about employing innovative technological solutions to the project of increasing student success. In a time when conversation trumps content, a successful model for creating these kinds of conversations on our own campuses will also trump the written record of other’s conversations.
Mozilla and the National Science Foundation are sponsoring an open round of submissions for developers/app designers to create fiber-based gigabit apps. The detailed contest information is available over at Mozilla Ignite (https://mozillaignite.org/about/). Cash prizes to fund promising start up ideas are being award for a total of $500,000 over three rounds of submissions. Note: this is just the start, and these are seed projects to garner interest and momentum in the area. A recent hackathon in Chattanooga lists out what some coders are envisioning for this space: http://colab.is/2012/hackers-think-big-at-hackanooga/
If you’re still puzzled after the video, you are not alone. One of the reasons for the contest is that network designers are not quite sure what immense levels of processing in the network and next generation transfer speeds will really mean.
Consider that best case transfer speeds on a network are somewhere along the lines of 10 megabits per second. There are of course variances of this speed across your home line (it may hover closer to 5 mb/s), but this is pretty much the standard that average subscribers can expect. A gigabit speed rate transfers data at 100 times that speed, 1,000 megabits per second. When a whole community is able to achieve 1,000 megabits upstream and downstream, you basically have no need for things like “streaming” video – the data pipes are that massive.
One theory is that gigabit apps could provide public benefit, solve societal issues and usher in the next generation of the Internet. Think of gigabit speed as the difference between getting water (Internet) through a straw, and getting water (Internet) through a fire-hose. The practicality of this contest is to seed startups with ideas that will in some way impact healthcare (realtime health monitoring), the environment and energy challenges. The local Champaign Urbana municipal gigabit speed fiber cause is noble, as it will provide those in areas without access to broadband an awesome pipeline to the Internet. It is a intergovernmental partnership that aims to serve municipal needs, as well as pave the way for research and industry start-ups.
Here are some attributes that Mozilla Ignite Challenge lists as the possible affordances of fiber based gigabit speed apps:
As I read about the Mozilla Ignite open challenge, I wondered about the possibilities for libraries and as a thought experiment I list out here some ideas for library services that live on gigabit speed networks:
* Consider the video data you could provide access to – in libraries that arestewarding any kind of video gigabit speeds would allow you to provide in-library viewing that has few bottlenecks. A fiber-based gigabit speed video viewing app of all library video content available at once. Think about viewing every video in your collection simultaneously. You could have them playing to multiple clusters (grid videos) in the library at multiple stations. Without streaming.
* Consider Sensors and sensor arrays and fiber. One idea that is promulgated for the use of fiber-based gigabit speed networks are the affordances to monitor large amounts of data in real time. The sensor networks that could be installed around the library facility could help to throttle energy consumption in real time, making the building more energy efficient and less costly to maintain. Such a facilities based app would impact savings on the facilities budgets.
* Consider collaborations among libraries with fiber affordances. Libraries that are linked by fiber-based gigabit speeds would be able to transfer large amounts of data in fractions of what it takes now. There are implications here for data curation infrastructure (http://smartech.gatech.edu/handle/1853/28513).
Another way to approach this problem is by asking: “What problem does gigabit speed networking solve?” One of the problems with the current web is the de facto latency. Your browser needs to request a page from a server which is then sent back to your client. We’ve gotten so accustomed to this latency that we expect it, but what if pages didn’t have to be requested? What if servers didn’t have to send pages? What if a zero latency web meant we needed a new architecture to take advantage of data possibilities?
Is your library poised to take advantage of increased data transfer? What apps do you want to get funding for?
Open access publication makes access to research free for the end reader, but in many fields it is not free for the author of the article. When I told a friend in a scientific field I was working on this article, he replied “Open access is something you can only do if you have a grant.” PeerJ, a scholarly publishing venture that started up over the summer, aims to change this and make open access publication much easier for everyone involved.
While the first publication isn’t expected until December, in this post I want to examine in greater detail the variation on the “gold” open-access business model that PeerJ states will make it financially viable 1, and the open peer review that will drive it. Both of these models are still very new in the world of scholarly publishing, and require new mindsets for everyone involved. Because PeerJ comes out of funding and leadership from Silicon Valley, it can more easily break from traditional scholarly publishing and experiment with innovative practices. 2
PeerJ is a platform that will host a scholarly journal called PeerJ and a pre-print server (similar to arXiv) that will publish biological and medical scientific research. Its founders are Peter Binfield (formerly of PLoS ONE) and Jason Hoyt (formerly of Mendeley), both of whom are familiar with disruptive models in academic publishing. While the “J” in the title stands for Journal, Jason Hoyt explains on the PeerJ blog that while the journal as such is no longer a necessary model for publication, we still hold on to it. “The journal is dead, but it’s nice to hold on to it for a little while.” 3. The project launched in June of this year, and while no major updates have been posted yet on the PeerJ website, they seem to be moving towards their goal of publishing in late 2012.
To submit a paper for consideration in PeerJ, authors must buy a “lifetime membership” starting at $99. (You can submit a paper without paying, but it costs more in the end to publish it). This would allow the author to publish one paper in the journal a year. The lifetime membership is only valid as long as you meet certain participation requirements, which at minimum is reviewing at least one article a year. Reviewing in this case can mean as little as posting a comment to a published article. Without that, the author might have to pay the $99 fee again (though as yet it is of course unclear how strictly PeerJ will enforce this rule). The idea behind this is to “incentivize” community participation, a practice that has met with limited success in other arenas. Each author on a paper, up to 12 authors, must pay the fee before the article can be published. The Scholarly Kitchen blog did some math and determined that for most lab setups, publication fees would come to about $1,124 4, which is equivalent to other similar open access journals. Of course, some of those researchers wouldn’t have to pay the fee again; for others, it might have to be paid again if they are unable to review other articles.
Peer Review: Should it be open?
PeerJ, as the name and the lifetime membership model imply, will certainly be peer-reviewed. But, keeping with its innovative practices, it will use open peer review, a relatively new model. Peter Binfield explained in this interview PeerJ’s thinking behind open peer review.
…we believe in open peer review. That means, first, reviewer names are revealed to authors, and second, that the history of the peer review process is made public upon publication. However, we are also aware that this is a new concept. Therefore, we are initially going to encourage, but not require, open peer review. Specifically, we will be adopting a policy similar to The EMBO Journal: reviewers will be permitted to reveal their identities to authors, and authors will be given the choice of placing the peer review and revision history online when they are published. In the case of EMBO, the uptake by authors for this latter aspect has been greater than 90%, so we expect it to be well received. 5
In single blind peer review, the reviewers would know the name of the author(s) of the article, but the author would not know who reviewed the article. The reviewers could write whatever sorts of comments they wanted to without the author being able to communicate with them. For obvious reasons, this lends itself to abuse where reviewers might not accept articles by people they did not know or like or tend to accept articles from people they did like 6 Even people who are trying to be fair can accidentally fall prey to bias when they know the names of the submitters.
Double blind peer review in theory takes away the ability for reviewers to abuse the system. A link that has been passed around library conference planning circles in the past few weeks is the JSConf EU 2012 which managed to improve its ratio of female presenters by going to a double-blind system. Double blind is the gold standard for peer review for many scholarly journals. Of course, it is not a perfect system either. It can be hard to obscure the identity of a researcher in a small field in which everyone is working on unique topics. It also is a much lengthier process with more steps involved in the review process. To this end, it is less than ideal for breaking medical or technology research that needs to be made public as soon as possible.
In open peer review, the reviewers and the authors are known to each other. By allowing for direct communication between reviewer and researcher, this speeds up the process of revisions and allows for greater clarity and speed 7. Open peer review doesn’t affect the quality of the reviews or the articles negatively, it does make it more difficult to find qualified reviewers to participate, and it might make a less well-known researcher more likely to accept the work of a senior colleague or well-known lab. 8.
Given the experience of JSConf and a great deal of anecdotal evidence from women in technical fields, it seems likely that open peer review is open to the same potential abuse of single peer review. While open peer review might make the rejected author able to challenge unfair rejections, this would require that the rejected author feels empowered enough in that community to speak up. Junior scholars who know they have been rejected by senior colleagues may not want to cause a scene that could affect future employment or publication opportunities. On the other hand, if they can get useful feedback directly from respected senior colleagues, that could make all the difference in crafting a stronger article and going forward with a research agenda. Therein lies the dilemma of open peer review.
Who pays for open access?
A related problem for junior scholars exists in open access funding models, at least in STEM publishing. As open access stands now, there are a few different models that are still being fleshed out. Green open access is free to the author and free to the reader; it is usually funded by grants, institutions, or scholarly societies. Gold open access is free to the end reader but has a publication fee charged to the author(s).
This situation is very confusing for researchers, since when they are confronted with a gold open access journal they will have to be sure the journal is legitimate (Jeffrey Beall has a list of Predatory Open Access journals to aid in this) as well as secure funding for publication. While there are many schemes in place for paying publication fees, there are no well-defined practices in place that illustrate long-term viability. Often this is accomplished by grants for the research, but not always. The UK government recently approved a report that suggests that issuing “block grants” to institutions to pay these fees would ultimately cost less due to reduced library subscription fees. As one article suggests, the practice of “block grants” or other funding strategies are likely to not be advantageous to junior scholars or those in more marginal fields9. A large research grant for millions of dollars with the relatively small line item for publication fees for a well-known PI is one thing–what about the junior humanities scholar who has to scramble for a few thousand dollar research stipend? If an institution only gets so much money for publication fees, who gets the money?
By offering a $99 lifetime membership for the lowest level of publication, PeerJ offers hope to the junior scholar or graduate student to pursue projects on their own or with a few partners without worrying about how to pay for open access publication. Institutions could more readily afford to pay even $250 a year for highly productive researchers who were not doing peer review than the $1000+ publication fee for several articles a year. As above, some are skeptical that PeerJ can afford to publish at those rates, but if it is possible, that would help make open access more fair and equitable for everyone.
Open access with low-cost paid up front could be very advantageous to researchers and institutional bottom lines, but only if the quality of articles, peer reviews, and science is very good. It could provide a social model for publication that will take advantage of the web and the network effect for high quality reviewing and dissemination of information, but only if enough people participate. The network effect that made Wikipedia (for example) so successful relies on a high level of participation and engagement very early on to be successful [Davis]. A community has to build around the idea of PeerJ.
In almost the opposite method, but looking to achieve the same effect, this last week the Sponsoring Consortium for Open Access Publishing in Particle Physics (SCOAP3) announced that after years of negotiations they are set to convert publishing in that field to open access starting in 2014. 10 This means that researchers (and their labs) would not have to do anything special to publish open access and would do so by default in the twelve journals in which most particle physics articles are published. The fees for publication will be paid upfront by libraries and funding agencies.
So is it better to start a whole new platform, or to work within the existing system to create open access? If open (and through a commenting s system, ongoing) peer review makes for a lively and engaging network and low-cost open access makes publication cheaper, then PeerJ could accomplish something extraordinary in scholarly publishing. But until then, it is encouraging that organizations are working from both sides.
Wennerås, Christine, and Agnes Wold. “Nepotism and sexism in peer-review.” Nature 387, no. 6631 (May 22, 1997): 341–3. ↩
For an ingenious way of demonstrating this, see Leek, Jeffrey T., Margaret A. Taub, and Fernando J. Pineda. “Cooperation Between Referees and Authors Increases Peer Review Accuracy.” PLoS ONE 6, no. 11 (November 9, 2011): e26895. ↩
Mainguy, Gaell, Mohammad R Motamedi, and Daniel Mietchen. “Peer Review—The Newcomers’ Perspective.” PLoS Biology 3, no. 9 (September 2005). http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1201308/. ↩
Crotty, David. “Are University Block Grants the Right Way to Fund Open Access Mandates?” The Scholarly Kitchen, September 13, 2012. http://scholarlykitchen.sspnet.org/2012/09/13/are-university-block-grants-the-right-way-to-fund-open-access-mandates/.↩
Van Noorden, Richard. “Open-access Deal for Particle Physics.” Nature 489, no. 7417 (September 24, 2012): 486–486. ↩
If you say “analytics” to most technology-savvy librarians, they think of Google Analytics or similar web analytics services. Many libraries are using such sophisticated data collection and analyses to improve the user experience on library-controlled sites. But the standard library analytics are retrospective: what have users done in the past? Have we designed our web platforms and pages successfully, and where do we need to change them?
Technology is enabling a different kind of future-oriented analytics. Action Analytics is evidence-based, combines data sets from different silos, and uses actions, performance, and data from the past to provide recommendations and actionable intelligence meant to influence future actions at both the institutional and the individual level. We’re familiar with these services in library-like contexts such as Amazon’s “customers who bought this item also bought” book recommendations and Netflix’s “other movies you might enjoy”.
Action Analytics in the Academic Library Landscape
It was a presentation by Mark David Milliron at Educause 2011 on “Analytics Today: Getting Smarter About Emerging Technology, Diverse Students, and the Completion Challenge” that made me think about the possibilities of the interventionist aspect of analytics for libraries. He described the complex dependencies between inter-generational poverty transmission, education as a disrupter, drop-out rates for first generation college students, and other factors such international competition and the job market. Then he moved on to the role of sophisticated analytics and data platforms and spoke about how it can help individual students succeed by using technology to deliver the right resource at the right time to the right student. Where do these sorts of analytics fit into the academic library landscape?
If your library is like my library, the pressure to prove your value to strategic campus initiatives such student success and retention is increasing. But assessing services with most analytics is past-oriented; how do we add the kind of library analytics that provide a useful intervention or recommendation? These analytics could be designed to help an individual student choose a database, or trigger a recommendation to dive deeper into reference services like chat reference or individual appointments. We need to design platforms and technology that can integrate data from various campus sources, do some predictive modeling, and deliver a timely text message to an English 101 student that recommends using these databases for the first writing assignment, or suggests an individual research appointment with the appropriate subject specialist (and a link to the appointment scheduler) to every honors students a month into their thesis year.
But should we? Are these sorts of interventions creepy and stalker-ish?* Would this be seen as an invasion of privacy? Does the use of data in this way collide with the profession’s ethical obligation and historical commitment to keep individual patron’s reading, browsing, or viewing habits private?
Every librarian I’ve discussed this with felt the same unease. I’m left with a series of questions: Have technology and online data gathering changed the context and meaning of privacy in such fundamental ways that we need to take a long hard look at our assumptions, especially in the academic environment? (Short answer — yes.) Are there ways to manage opt-in and opt-out preferences for these sorts of services so these services are only offered to those who want them? And does that miss the point? Aren’t we trying to influence the students who are unaware of library services and how the library could help them succeed?
Furthermore, are we modeling our ideas of “creepiness” and our adamant rejection of any “intervention” on the face-to-face model of the past that involved a feeling of personal surveillance and possible social judgment by live flesh persons? The phone app Mobilyze helps those with clinical depression avoid known triggers by suggesting preventative measures. The software is highly personalized and combines all kinds of data collected by the phone with self-reported mood diaries. Researcher Colin Depp observes that participants felt that the impersonal advice delivered via technology was easier to act on than “say, getting advice from their mother.”**
While I am not suggesting in any way that libraries move away from face-to-face, personalized encounters at public service desks, is there room for another model for delivering assistance? A model that some students might find less intrusive, less invasive, and more effective — precisely because it is technological and impersonal? And given the struggle that some students have to succeed in school, and the staggering debt that most of them incur, where exactly are our moral imperatives in delivering academic services in an increasingly personalized, technology-infused, data-dependent environment?
Increasingly, health services, commercial entities, and technologies such as browsers and social networking environments that are deeply embedded in most people’s lives, use these sorts of action analytics to allow the remote monitoring of our aging parents, sell us things, and match us with potential dates. Some of these uses are for the benefit of the user; some are for the benefit of the data gatherer. The moment from the Milliron presentation that really stayed with me was the poignant question that a student in a focus group asked him: “Can you use information about me…to help me?”