Future? Libraries? What Now? – After the ALA Summit on the Future of Libraries

I attended the ALA Summit on the Future of Libraries a few weeks ago.

[Let's give it a minute for that to sink in.]

ALA President Barbara Stripling at the ALA Summit on the Future of Libraries at the Library of Congress

ALA President Barbara Stripling at the ALA Summit on the Future of Libraries at the Library of Congress. (Photo by the author)

Yes, that was that controversial Summit that was much talked about on Twitter with the #libfuturesummit hashtag. This Summit and other summits with a similar theme close to one another in timing – “The Future of Libraries Survival Summit” hosted by Information Today Inc. and “The Future of Libraries: Do We Have Five Years to Live?” hosted by Ken Heycock Associates Inc. and Dysart & Jones Associates – seemed to have brought out the sentiment that Andy Woodworth aptly named ‘Library Future Fatigue.’ It was impressive experience to see how active librarians – both ALA members and non-members – were in providing real-time comments and feedback about these summits while I was at one of those in person. I thought ALA is lucky to have such engaged members and librarians to work with.

A few days ago, ALA released the official Summit report.1 The report captured all the talks and many table discussions in great detail. In this post, I will focus on some of my thoughts and take-aways prompted by the talks and the table discussion at the Summit.

A. The Draw

Here is an interesting fact. The invitation to this Summit sat in my Inbox for over a month because from the email subject I thought it was just another advertisement for a fee-based webinar or workshop. It was only after I had gotten another email from the ALA office asking about the previous e-mail that I realized that it was something different.

What drew me to this Summit were: (a) I have never been at a formal event organized just for a discussion about the future of libraries, (b) the event were to include a good number of people outside of the libraries, and (c) the overall size of the Summit would be kept relatively small.

For those curious, the Summit had 51 attendees plus 6 speakers, a dozen discussion table facilitators, all of whom fit into the Members’ Room in the Library of Congress. Out of those 51 attendees, 9 of them were from the non-library sector such as Knight Foundation, PBS, Rosen Publishing, and Aspen Institute. 33 attendees ranged from academic librarians to public, school, federal, corporate librarians, library consultants, museum and archive folks, an LIS professor, and library vendors. And then there were 3 ALA presidents (current, past, and president-elect) and 6 officers from ALA. You can see the list of participants here.

B. Two Words (or Phrases)

At the beginning of the Summit, the participants were asked to come up with two words or short phrases that capture what they think about libraries “from now on.” We wrote these on the ribbons and put right under our name tags. Then we were encouraged to keep or change them as we move through the Summit.

My two phrases were “Capital and Labor” and “Peer-to-Peer.” I kept those two until the end of the Summit and didn’t change. I picked “Capital and Labor” because recently I have been thinking more about the socioeconomic background behind the expansion of post-secondary education (i.e. higher ed) and how it affects the changes in higher education and academic libraries.2 And of course, the fact that Thomas Picketty’s book, Capital in the 21st Century, was being reviewed and discussed all over in the mass media contributed to that choice of the words as well. In my opinion, libraries “from now on” will be closely driven by the demands of the capital and the labor market and asked to support more and more of the peer-to-peer learning activities that have become widespread with the advent of the Internet.

Other phrases and words I saw from other participants included “From infrastructure to engagement,” “Sanctuary for learning,” “Universally accessible,” “Nimble and Flexible,” “From Missionary to Mercenary,” “Ideas into Action,” and “Here, Now.” The official report also lists some of the words that were most used by participants. If you choose your two words or phrases that capture what you think about libraries “from now on,” what would those be?

C. The Set-up

The Summit organizers have filled the room with multiple round tables, and the first day morning, afternoon, and the second day morning, participants sat at the table according to the table number assigned on the back of their name badges. This was a good method that enabled participants to have discussion with different groups of people throughout the Summit.

As the Summit agenda shows, the Summit program started with a talk by a speaker. After that, participants were asked to personally reflect on the talk and then have a table discussion. This discussion was captured on the large poster-size papers by facilitators and collected by the event organizers. The papers on which we were asked to write our personal reflections were  also collected in the same way along with all our ribbons on which we wrote those two words or phrases. These were probably used to produce the official Summit report.

One thing I liked about the set-up was that every participant sat at a round table including speakers and all three ALA presidents (past, president, president-elect). Throughout the Summit, I had a chance to talk to Lorcan Dempsey from OCLC, Corinne Hill, the director of Chattanooga Public Library, Courtney Young, the ALA president-elect, and Thomas Frey, a well-known futurist at DaVinci Institute, which was neat.

Also, what struck me most during the Summit was that those who were outside of the library took the guiding questions and the following discussion much more seriously than those of us who are inside the library world. Maybe indeed we librarians are suffering from ‘library future fatigue.’ And/or maybe outsiders have more trust in libraries as institutions than we librarians do because they are less familiar with our daily struggles and challenges in the library operation. Either way, the Summit seemed to have given them an opportunity to seriously consider the future of libraries. The desired impact of this would be more policymakers, thought leaders, and industry leaders who are well informed about today’s libraries and will articulate, support, and promote the significant work libraries do to the benefit of the society in their own areas.

D. Talks, Table Discussion, and Some of My Thoughts and Take-aways

These were the talks given during the two days of the Summit:

  • “How to Think Like a Freak” – Stephen Dubner, Journalist
  • “What Are Libraries Good For?” – Joel Garreau, Journalist
  • “Education in the Future: Anywhere, Anytime” – Dr. Renu Khator, Chancellor and President at the University of Houston
  • “From an Internet of Things to a Library of Things” – Thomas Frey, Futurist
  • A Table Discussion of Choice:
    • Open – group decides the topic to discuss
    • Empowering individuals and families
    • Promoting literacy, particularly in children and youth
    • Building communities the library serves
    • Protecting and empowering access to information
    • Advancing research and scholarship at all levels
    • Preserving and/or creating cultural heritage
    • Supporting economic development and good government
  • “What Happened at the Summit?” – Joan Frye Williams, Library consultant

(0) Official Report, Liveblogging Posts, and Tweets

As I mentioned earlier, ALA released the 15-page official report of the Summit, which provides the detailed description of each talk and table discussion. Carolyn Foote, a school librarian and one of the Summit participants, also live-blogged all of the these talks in detail. I highly recommend reading her notes on Day 1, Day 2, and Closing in addition to the official report. The tweets from the Summit participants with the official hashtag, #libfuturesummit, will also give you an idea of what participants found exciting at the Summit.

(1) Redefining a Problem

The most fascinating story in the talk by Dubner was Kobe, the hot dog eating contest champion from Japan. The secret of his success in the eating contest was rethinking the accepted but unchallenged artificial limits and redefining the problem, said Dubner. In Kobe’s case, he redefined the problem from ‘How can I eat more hotdogs?’ to ‘How can I eat one hotdog faster?’ and then removed artificial limits – widely accepted but unchallenged conventions – such as when you eat a hot dog you hold it in the hand and eat it from the top to the bottom. He experimented with breaking the hotdog into two pieces to feed himself faster with two hands. He further refined his technique by eating the frankfurter and the bun separately to make the eating even speedier.

So where can libraries apply this lesson? One thing I can think of is the problem of the low attendance of some library programs. What if we ask what barriers we can remove instead of asking what kind of program will draw more people? Chattanooga Public Library did exactly this. Recently, they targeted the parents who would want to attend the library’s author talk and created an event that would specifically address the child care issue. The library scheduled a evening story time for kids and fun activities for tween and teens at the same time as the author talk. Then they asked parents to come to the library with the children, have their children participate in the library’s children’s programs, and enjoy themselves at the library’s author talk without worrying about the children.

Another library service that I came to learn about at my table was the Zip Books service by the Yolo county library in California. What if libraries ask what the fastest to way to deliver a book that the library doesn’t have to a patron’s door would be instead of asking how quickly the cataloging department can catalog a newly acquired book to get it ready for circulation? The Yolo county library Zip Books service came from that kind of redefinition of a problem. When a library user requests a book the library doesn’t have but meets certain requirements, the Yolo County Library purchases the book from a bookseller and have it shipped directly to the patron’s home without processing the book. Cataloging and processing is done when the book is returned to the library after the first use.

(2) What Can Happen to Higher Education

My favorite talk during the Summit was by Dr. Khator because she had deep insight in higher education and I have been working at university libraries for a long time. The two most interesting observations she made were the possibility of (a) the decoupling of the content development and the content delivery and (b) the decoupling of teaching and credentialing in higher education.

The upside of (a) is that some wonderful class a world-class scholar created may be taught by other instructors at places where the person who originally developed the class is not available. The downside of (a) is, of course, the possibility of it being used as the cookie-cutter type lowest baseline for quality control in higher education – University of Phoenix mentioned as an example of this by one of the participants at my table – instead of college and university students being exposed to the classes developed and taught by their institutions’ own individual faculty members.

I have to admit that (b) was a completely mind-blowing idea to me. Imagine colleges and universities with no credentialing authority. Your degree will no longer be tied to a particular institution to which you were admitted and graduate from. Just consider the impact of what this may entail if it ever becomes realized. If both (a) and (b) take place at the same time, the impact would be even more significant. What kind of role could an academic library play in such a scenario?

(3) Futurizing Libraries

Joe Garreau observed that nowadays what drives the need for a physical trip is more and more a face-to-face contact than anything else. Then he pointed out that as technology allows more people to tele-work, people are flocking to smaller cities where they can have a more meaningful contact with the community. If this is indeed the case, libraries that make their space a catalyst for a face-to-face contact in a community will prosper. Last speaker, Thomas Frey, spoke mostly about the Internet of Things (IoT).

While I think that IoT is an important trend to note, for sure, what I most liked about Frey’s talk was his statement that the vision of future we have today will change the decisions we make (towards that future). After the talk by Garreau, I had a chance to ask him a question about his somewhat idealized vision of the future, in which people live and work in a small but closely connected community in a society that is highly technological and collaborative. He called this ‘human evolution’.

But in my opinion, the reality that we see today in my opinion is not so idyllic.3 The current economy is highly volatile. It no longer offers job security, consistently reduces the number of jobs, and returns either stagnant or decreasing amount of income for those whose skills are not in high demand in the era of digital revolution.4 As a result, today’s college students, who are preparing to become tomorrow’s knowledge workers, are perceiving their education and their lives after quite differently than their parents did.5

Garreau’s answer to my question was that this concern of mine may be coming from a kind of techno-determinism. While this may be a fair critique, I felt that his portrayal of the human evolution may be just as techno-deterministic. (To be fair, he mentioned that he does not make predictions and this is one of the future scenarios he sees.)

Regarding the topic of the Internet of Things (IoT), which was the main topic of Frey’s talk, the privacy and the proper protection of the massive amount of data – which will result from the very many sensors that makes IoT possible – will be the real barrier to implementing the IoT on a large scale. After his talk, I had a chance to briefly chat with him about this. (There was no Q&A because Frey’s talk went over the time allotted). He mentioned the possibility of some kind of an international gathering similar to the scale of the Geneva Conventions to address the issue. While the likelihood of that is hard to assess, the idea seemed appropriate to the problem in question.

(4) What If…?

One of the slides from Thoams Frey's Talk at the ALA Summit. (Photo by the author)

One of the slides from Thomas Frey’s Talk at the ALA Summit. (Photo by the author)

Some of the shiny things shown at the talk, whose value for library users may appear dubious and distant, however, prompted Eli Neiburger at Ann Arbor District Library to question which useful service libraries can offer to provide the public with significant benefit now. He wondered what it would be like if many libraries ran a Tor exit node to help the privacy and anonymity of the web traffic, for example.

For those who are unfamiliar, Tor (the Onion Router) is “free software and an open network that helps you defend against traffic analysis, a form of network surveillance that threatens personal freedom and privacy, confidential business activities and relationships, and state security.” Tor is not foolproof, but it is still the best tool for privacy and anonymity on the Web.

Eli’s idea is a truly wild one because there are so many libraries in the US and the public’s privacy in the US is in such a precarious state.6 Running a Tor exit node is not a walk in the park as this post by someone who actually set up a Tor exit node on a hosted virtual server in Germany attests. But libraries have been a serious and dedicated advocate for privacy for people’s intellectual freedom for a long time and have a strong network of alliance. There is also the useful guidelines and tips that Tor provides in their website.

Just pause a minute and imagine what kind of impact such a project by libraries may have to the privacy of the public. What if?

(5) Leadership and Sustainability

For the “Table Discussion of Choice” session, I opted for the “Open” table because I was curious in what other topics people were interested. Two discussions at this session were most memorable to me. One was the great advice I got from Corinne Hill regarding leading people. A while ago, I read her interview, in which she commented that “the staff are just getting comfortable with making decisions.” In my role as a relatively new manager, I also found empowering my team members to be more autonomous decision makers a challenge. Corinne particularly cautioned that leaders should be very careful about not being over-critical when the staff takes an initiative but makes a bad decision. Being over-critical in that case can discourage the staff from trying to make their own decisions in their expertise areas, she said. Hearing her description of how she relies on the different types of strengths in her staff to move her library in the direction of innovation was also illuminating to me. (Lorcan Dempsey who was also at our table mentioned “Birkman Quadrants” in relation to Corinne’s description, a set of useful theoretical constructs. He also brought up the term ‘Normcore’ at another session. I forgot the exact context of that term, but the term was interesting that I wrote it down.) We also talked for a while about the current LIS education and how it is not sufficiently aligned with the skills needed in everyday library operation.

The other interesting discussion started with the question about the sustainability of the future libraries by Amy Garmer from Aspen Institute. (She has been working on a library-related project with various policy makers, and PLA has a program related to this project at the upcoming 2014 ALA Annual Conference if you are interested.) One thought that always comes to my mind whenever I think about the future of libraries is that while in the past the difference between small and large libraries was mostly quantitative in terms of how many books and other resources were available, in the present and future, the difference is and will be more qualitative. What New York Public Libraries offers for their patrons, a whole suite of digital library products from the NYPL Labs for example, cannot be easily replicated by a small rural library. Needless to say, this has a significant implication for the core mission of the library, which is equalizing the public’s access to information and knowledge. What can we do to close that gap? Or perhaps will different types of libraries have different strategies for the future, as Lorcan Dempsey asked at our table discussion? These two things are not incompatible to be worked out at the same time.

(6) Nimble and Media-Savvy

In her Summit summary, Joanne Frye Williams, who moved around to observe discussions at all tables during the Summit, mentioned that one of the themes that surfaced was thinking about a library as a developing enterprise rather than a stable organization. This means that the modus operandi of a library should become more nimble and flexible to keep the library in the same pace of the change that its community goes through.

Another thread of discussion among the Summit participants was that not all library supporters have to be the active users of the library services. As long as those supporters know that the presence and the service of libraries makes their communities strong, libraries are in a good place. Often libraries make the mistake of trying to reach all of their potential patrons to convert them into active library users. While this is admirable, it is not always practical or beneficial to the library operation. More needed and useful is a well-managed strategic media relations that will effectively publicize the library’s services and programs and its benefits and impact to its community. (On a related note, one journalist who was at the Summit mentioned how she noticed the recent coverage about libraries changing its direction from “Are libraries going to be extinct?” to “No, libraries are not going to be extinct. And do you know libraries offer way more than books such as … ?”, which is fantastic.)

E. What Now? Library Futurizing vs. Library Grounding

What all the discussion at the Summit reminded me was that ultimately the time and efforts we spend on trying to foresee what the future holds for us and on raising concerns about the future may be better directed at refining the positive vision for the desirable future for libraries and taking well-calculated and decisive actions towards the realization of that vision.

Technology is just a tool. It can be used to free people to engage in more meaningful work and creative pursuits. Or it can be used to generate a large number of the unemployed, who have to struggle to make the ends meet and to retool themselves with fast-changing skills that the labor market demands, along with those in the top 1 or 0.1 % of very rich people. And we have the power to influence and determine which path we should and would be on by what we do now.

Certainly, there are trends that we need to heed. For example, the shift of the economy that places a bigger role on entrepreneurship than ever before requires more education and support for entrepreneurship for students at universities and colleges. The growing tendency of the businesses looking for potential employees based upon their specific skill sets rather than their majors and grades has lead universities and colleges to adopt a digital badging system (such as Purdue’s Passport) or other ways for their students to record and prove the job-related skills obtained during their study.

But when we talk about the future, many of us tend to assume that there are some kind of inevitable trends that we either get or miss and that those trends will determine what our future will be. We forget that not some trends but (i) what we intend to achieve in the future and (ii) today’s actions we take to realize that intention are really what determines our future. (Also always critically reflect on whatever is trendy; you may be in for a surprise.7) The fact that people will no longer need to physically visit a library to check out books or access library resources does not automatically mean that the library in the future will cease to have a building. The question is whether we will let that be the case. Suppose we decide that we want the library to be and stay as the vibrant hub for a community’s freedom of inquiry and right to access human knowledge, no matter how much change takes place in the society. Realizing this vision ‘IS’ within our power. We only reach the future by walking through the present.

Notes

  1. Stripling, Barbara. “Report on the Summit on the Future of Libraries.” ALA Connect, May 19, 2014. http://connect.ala.org/node/223667.
  2. Kim, Bohyun. “Higher ‘Professional’ Ed, Lifelong Learning to Stay Employed, Quantified Self, and Libraries.” ACRL TechConnect Blog, March 23, 2014. http://acrl.ala.org/techconnect/?p=4180.
  3. Ibid.
  4. For a short but well-written clear description of this phenomenon, see Brynjolfsson, Erik, and Andrew McAfee. Race against the Machine: How the Digital Revolution Is Accelerating Innovation, Driving Productivity, and Irreversibly Transforming Employment and the Economy. Lexington: Digital Frontier Press, 2012.
  5. Brooks, David. “The Streamlined Life.” The New York Times, May 5, 2014. http://www.nytimes.com/2014/05/06/opinion/brooks-the-streamlined-life.html.
  6. See Timm, Trevor. “Everyone Should Know Just How Much the Government Lied to Defend the NSA.” The Guardian, May 17, 2014. http://www.theguardian.com/commentisfree/2014/may/17/government-lies-nsa-justice-department-supreme-court.
  7. For example, see this article about what the wide adoption of 3D-printing may mean to the public. Sadowski, Jathan, and Paul Manson. “3-D Print Your Way to Freedom and Prosperity.” Al Jazeera America, May 17, 2014. http://america.aljazeera.com/opinions/2014/5/3d-printing-politics.html.

Library & Academic Tech Conferences Roundup

Here we present a summary of various library technology conferences that ACRL TechConnect authors have been to. There are a lot of them and some fairly niche. So we hope this guide serves to assist neophytes and veterans alike in choosing how they spend their limited professional development monies. Do you attend one of these conferences every year because it’s awesome? Did we miss your favorite conference? Let us know in the comments!

The lisevents.com website might be of interest, as it compiles LIS conferences of all types. Also, one might be able to get a sense of the content of a conference by searching for its hashtag on Twitter. Most conferences list their hashtag on their website.

Access

  • Time: late in the year, typically September or October
  • Place: Canada
  • Website: http://accessconference.ca/
  • Access is a Canada’s annual library technology conference. Although the focus is primarily on technology, a wide variety of topics are addressed from linked data, innovation, makerspace, to digital archiving by librarians in various areas of specialization. (See the past conferences’ schedules: http://accessconference.ca/about/past-conferences/) Access provides an excellent opportunity to get an international perspective without traveling too far. Access is also a single-track conference, offers great opportunities to network, and starts with preconferences and the hackathon, which welcomes to all types of librarians not just library coders. Both preconferences and the hackathon are optional but highly recommended. (p.s. One of the ACRL TechConnect authors thinks that this is the conference with the best conference lunch and snacks.)

Code4Lib

  • Time: early in the year, typically February but this year in late March
  • Place: varies
  • Website: http://code4lib.org/conference/
  • Code4Lib is unique in that it is organized by a group of volunteers and not supported by any formal organization. While it does cover some more general technology concepts, the conference tends to be focused on coding, naturally. Preconferences from past years have covered the Railsbridge curriculum for learning Ruby on Rails and Blacklight, the open source discovery interface. Code4Lib moves quickly—talks are short (20 minutes) with even shorter lightning talks thrown in—but is also all on one track in the same room; attendees can see every presentation.

Computers in Libraries

  • Time: Late March or early April
  • Place: Washington, DC
  • Website: http://www.infotoday.com/conferences.asp
  • Computers in Libraries is a for-profit conference hosted by Information Today. Its use of tracks, organizing presentations around a topic or group of topics, is a useful way to attend a conference and its overall size is more conducive to networking, socializing, and talking with vendors in the exhibit hall than many other conferences. However, the role of consultants in panel and presentation selection and conference management, as opposed to people who work in libraries, means that there is occasionally a focus on trends that are popular at the moment, but don’t pan out, as well as language more suited to an MBA than an MLIS. The conference also lacks a code of conduct and given the corporate nature of the conference, the website is surprisingly antiquated.
  • They also run Internet Librarian, which meets in Monterey, California, every fall.
    — Jacob Berg, Library Director, Trinity Washington University

Digital Library Federation Forum

  • Time: later in the year, October or November
  • Place: varies
  • Website: http://www.diglib.org/
  • We couldn’t find someone who attended this. If you have, please add your review of this conference in the comments section!

edUI

  • Time: late in the year, typically November
  • Place: Richmond, VA
  • Website: http://eduiconf.org/
  • Not a library conference, edUI is aimed at web professionals working in higher education but draws a fair number of librarians. The conference tends to draw excellent speakers, both from within higher education and the web industry at large. Sessions cover user experience, design, social media, and current tools of the trade. The talks suit a broad range of specialties, from programmers to people who work on the web but aren’t technologists foremost.

Electronic Resources & Libraries

  • Time: generally early in the year, late-February to mid-March.
  • Place: Austin, TX
  • Website: http://www.electroniclibrarian.com/
  • The main focus of this conference is workflows and issues surrounding electronic resources (such as licensed databases and online journals, and understanding these is crucial to anyone working with library technology, whether or not they manage e-resources on a daily basis. In recent years the conference has expanded greatly into areas such as open access and user experience, with tracks specifically dedicated to those areas. This year there were also some overlapping programs and themes with SXSW and the Leadership, Technology, Gender Summit.

Handheld Librarian

  • Time: held a few times throughout the year
  • Place: online
  • Website: http://handheldlibrarian.org
  • An online conference devoted specifically to mobile technologies. The advantage of this conference is that without traveling, you can get a glimpse of the current developments and applications of mobile technologies in libraries. It originally started in 2009 as an annual one-day online conference based upon the accepted presentation proposals submitted in advance. The conference went through some changes in recent years, and now it offers a separate day of workshops in addition and focuses on a different theme in mobile technologies in libraries. All conference presentations and workshops are recorded. If you are interested in attending, it is a good idea to check out the presentations and the speakers in advance.

Internet Librarian

  • Time: October
  • Place: Monterey, CA
  • Website: http://www.infotoday.com/conferences.asp
  • Internet Librarian is for-profit conference hosted by Information Today. It is quite similar to Information Today’s Computers in Libraries utilizing tracks to organize a large number of presentations covering a broad swath of library information technology topics. Internet Librarian also hosts the Internet @ Schools track that focus on the IT needs of the K12 library community. IL is held annually in Monterey California in October. The speaker list is deep and varied and one can expect keynote speakers to be prominent and established names in the field. The conference is well attended and provides a good opportunity to network with library technology peers. As with Computers in Libraries, there is no conference code of conduct.

KohaCon

  • Time: varies, typically in the second half of the year
  • Place: varies, international
  • Website: http://koha-community.org/kohacon/
  • The annual conference devoted to the Koha open source ILS.

 Library Technology Conference

  • Time: mid-March
  • Place: St. Paul, MN
  • Website: http://libtechconf.org/
  • LTC is an annual library conference that takes place in March. It’s both organized by and takes place at Macalester College in St. Paul. Not as completely tech-heavy as a Code4Lib or even an Access, talks at LTC tend to run a whole range of technical aptitudes. Given the time and location of LTC, it has historically been primarily of regional interest but has seen elevating levels of participation nationally and internationally.
    — John Fink, Digital Scholarship Librarian, McMaster University
  • We asked Twitter for a short overview of Library Technology Conference, and Matthew Reidsma offered up this description:

LITA Forum

  • Time: Late in the year, typically November
  • Place: varies
  • Website: http://www.ala.org/lita/conferences
  • A general library technology conference that’s moderately sized, with some 300 attendees most years. One of LITA’s nice aspects is that, because of the smaller size of the conference and the arranged networking dinners, it’s very easy to meet other librarians. You need not be involved with LITA to attend and there are no committee or business meetings.

Open Repositories

  • Time: mid-summer, June or July
  • Place: varies, international
  • Website: changes each year, here are the 2013 and 2014 sites
  • A mid-sized conference focused specifically on institutional repositories.

Online NorthWest

  • Time: February
  • Place: Corvallis, OR
  • Website: http://onlinenorthwest.org/
  • A small library technology conference in the Pacific Northwest. Hosted by the Oregon University System, but invites content from Public, Medical, Special, Legal, and Academic libraries.

THATcamps

  • Time: all the time
  • Place: varies, international
  • Website: http://thatcamp.org/
  • Every THATCamp is different, but all revolve around technology and the humanities (i.e. The Technology And Humanities Camp). They are unconferences with “no spectators”, and so will reflect the interests of the participants. Some have specific themes such as digital pedagogy, others are attached to conferences as pre or post conference events, and some are more general regional events. Librarians are important participants in THATCamps, and if there is one in your area or at a conference you’re attending, you should go. They cost under $30 and are a great networking and education opportunity. Sign up for the THATCamp mailing list or subscribe to the RSS feed to find out about new THATCamps. They have a attendee limit and usually fill up quickly.

Analyzing Usage Logs with OpenRefine

Background

Like a lot of librarians, I have access to a lot of data, and sometimes no idea how to analyze it. When I learned about linked data and the ability to search against data sources with a piece of software called OpenRefine, I wondered if it would be possible to match our users’ discovery layer queries against the Library of Congress Subject Headings. From there I could use the linking in LCSH to find the Library of Congress Classification, and then get an overall picture of the subjects our users were searching for. As with many research projects, it didn’t really turn out like I anticipated, but it did open further areas of research.

At California State University, Fullerton, we use an open source application called Xerxes, developed by David Walker at the CSU Chancellor’s Office, in combination with the Summon API. Xerxes acts as an interface for any number of search tools, including Solr, federated search engines, and most of the major discovery service vendors. We call it the Basic Search, and it’s incredibly popular with students, with over 100,000 searches a month and growing. It’s also well-liked – in a survey, about 90% of users said they found what they were looking for. We have monthly files of our users’ queries, so I had all of the data I needed to go exploring with OpenRefine.

OpenRefine

OpenRefine is an open source tool that deals with data in a very different way than typical spreadsheets. It has been mentioned in TechConnect before, and Margaret Heller’s post, “A Librarian’s Guide to OpenRefine” provides an excellent summary and introduction. More resources are also available on Github.

One of the most powerful things OpenRefine does is to allow queries against open data sets through a function called reconciliation. In the open data world, reconciliation refers to matching the same concept among different data sets, although in this case we are matching unknown entities against “a well-known set of reference identifiers” (Re-using Cool URIs: Entity Reconciliation Against LOD Hubs).

Reconciling Against LCSH

In this case, we’re reconciling our discovery layer search queries with LCSH. This basically means it’s trying to match the entire user query (e.g. “artist” or “cost of assisted suicide”) against what’s included in the LCSH linked open data. According to the LCSH website this includes “all Library of Congress Subject Headings, free-floating subdivisions (topical and form), Genre/Form headings, Children’s (AC) headings, and validation strings* for which authority records have been created. The content includes a few name headings (personal and corporate), such as William Shakespeare, Jesus Christ, and Harvard University, and geographic headings that are added to LCSH as they are needed to establish subdivisions, provide a pattern for subdivision practice, or provide reference structure for other terms.”

I used the directions at Free Your Metadata to point me in the right direction. One note: the steps below apply to OpenRefine 2.5 and version 0.8 of the RDF extension. OpenRefine 2.6 requires version 0.9 of the RDF extension. Or you could use LODRefine, which bundles some major extensions and I hear is great, but personally haven’t tried. The basic process shouldn’t change too much.

(1) Import your data

OpenRefine has quite a few file type options, so your format is likely already supported.

 Screenshot of importing data

(2) Clean your data

In my case, this involves deduplicating by timestamp and removing leading and trailing whitespaces. You can also remove weird punctuation, numbers, and even extremely short queries (<2 characters).

(3) Add the RDF extension.

If you’ve done it correctly, you should see an RDF dropdown next to Freebase.

Screenshot of correctly installed RDF extension

(4) Decide which data you’d like to search on.

In this example, I’ve decided to use just queries that are less than or equal to four words, and removed duplicate search queries. (Xerxes handles facet clicks as if they were separate searches, so there are many duplicates. I usually don’t, though, unless they happen at nearly the same time). I’ve also experimented with limiting to 10 or 15 characters, but there were not many more matches with 15 characters than 10, even though the data set was much larger. It depends on how much computing time you want to spend…it’s really a personal choice. In this case, I chose 4 words because of my experience with 15 characters – longer does not necessarily translate into more matches. A cursory glance at LCSH left me with the impression that the vast majority of headings (not including subdivisions, since they’d be searched individually) were 4 words or less. This, of course, means that your data with more than 4 words is unusable – more on that later.

Screenshot of adding a column based on word count using ngrams

(5) Go!

Shows OpenRefine reconciling

(6) Now you have your queries that were reconciled against LCSH, so you can limit to just those.

Screenshot of limiting to reconciled queries

Finding LC Classification

First, you’ll need to extract the cell.recon.match.id – the ID for the matched query that in the case of LCSH is the URI of the concept.

Screenshot of using cell.recon.match.id to get URI of concept

At this point you can choose whether to grab the HTML or the JSON, and create a new column based on this one by fetching URLs. I’ve never been able to get the parseJson() function to work correctly with LC’s JSON outputs, so for both HTML and JSON I’ve just regexed the raw output to isolate the classification. For more on regex see Bohyun Kim’s previous TechConnect post, “Fear No Longer Regular Expressions.”

On the raw HTML, the easiest way to do it is to transform the cells or create a new column with:

replace(partition(value,/<li property=”madsrdf:classification”>(<[^>]+>)*([A-Z]{1,2})/)[1],/<li property=”madsrdf:classification”>(<[^>]+>)*([A-Z]{1,2})/,”$2″).

Screenshot of using regex to get classification

You’ll note this will only pull out the first classification given, even if some have multiple classifications. That was a conscious choice for me, but obviously your needs may vary.

(Also, although I’m only concentrating on classification for this project, there’s a huge amount of data that you could work with – you can see an example URI for Acting to see all of the different fields).

Once you have the classifications, you can export to Excel and create a pivot table to count the instances of each, and you get a pretty table.

Table of LC Classifications

Caveats & Further Explorations

As you can guess by the y-axis in the table above, the number of matches is a very small percentage of actual searches. First I limited to keyword searches (as opposed to title/subject), then of those only ones that were 4 or fewer words long (about 65% of keyword searches). Of those, only about 1000 of the 26000 queries matched, and resulted in about 360 actual LC Classifications. Most months I average around 500, but in this example I took out duplicates even if they were far apart in time, just to experiment.

One thing I haven’t done but am considering is allowing matches that aren’t 100%. From my example above, there are another 600 or so queries that matched at 50-99%. This could significantly increase the number of matches and thus give us more classifications to work with.

Some of this is related to the types of searches that students are doing (see Michael J DeMars’ and my presentation “Making Data Less Daunting” at Electronic Resources & Libraries 2014, which this article grew out of, for some crazy examples) and some to the way that LCSH is structured. I chose LCSH because I could get linked to the LC Classification and thus get a sense of the subjects, but I’m definitely open to ideas. If you know of a better linked data source, I’m all ears.

I must also note that this is a pretty inefficient way of matching against LCSH. If you know of a way I could download the entire set, I’m interested in investigating that way as well.

Another approach that I will explore is moving away from reconciliation with LCSH (which is really more appropriate for a controlled vocabulary) to named-entity extraction, which takes natural language inputs and tries to recognize or extract common concepts (name, place, etc). Here I would use it as a first step before trying to match against LCSH. Free Your Metadata has a new named-entity extraction extension for OpenRefine, so I’ll definitely explore that option.

Planned Research

In the end, although this is interesting, does it actually mean anything? My next step with this dataset is to take a subset of the search queries and assign classification numbers. Over the course of several months, I hope to see if what I’ve pulled in automatically resembles the hand-classified data, and then draw conclusions.

So far, most of the peaks are expected – psychology and nursing are quite strong departments. There are some surprises though – education has been consistently underrepresented, based on both our enrollment numbers and when you do word counts (see our presentation for one month’s top word counts). Education students have a robust information literacy program. Does this mean that education students do complex searches that don’t match LCSH? Do they mostly use subject databases? Once again, an area for future research, should these automatic results match the classifications I do by hand.

What do you think? I’d love to hear your feedback or suggestions.

About Our Guest Author

Jaclyn Bedoya has lived and worked on three continents, although currently she’s an ER Librarian at CSU Fullerton. It turns out that growing up in Southern California spoils you, and she’s happiest being back where there are 300 days of sunshine a year. Also Disneyland. Reach her @spamgirl on Twitter or jaclynbedoya@gmail.com


Getting Started with APIs

There has been a lot of discussion in the library community regarding the use of web service APIs over the past few years.  While APIs can be very powerful and provide awesome new ways to share, promote, manipulate and mashup your library’s data, getting started using APIs can be overwhelming.  This post is intended to provide a very basic overview of the technologies and terminology involved with web service APIs, and provides a brief example to get started using the Twitter API.

What is an API?

First, some definitions.  One of the steepest learning curves with APIs involves navigating the terminology, which unfortunately can be rather dense – but understanding a few key concepts makes a huge difference:

  • API stands for Application Programming Interface, which is a specification used by software components to communicate with each other.  If (when?) computers become self-aware, they could use APIs to retrieve information, tweet, post status updates, and essentially run most day-to-do functions for the machine uprising. There is no single API “standard” though one of the most common methods of interacting with APIs involves RESTful requests.
  • REST / RESTful APIs  – Discussions regarding APIs often make references to “REST” or “RESTful” architecture.  REST stands for Representational State Transfer, and you probably utilize RESTful requests every day when browsing the web. Web browsing is enabled by HTTP (Hypertext Transfer Protocol) – as in http://example.org.  The exchange of information that occurs when you browse the web uses a set of HTTP methods to retrieve information, submit web forms, etc.  APIs that use these common HTTP methods (sometimes referred to as HTTP verbs) are considered to be RESTful.  RESTful APIs are simply APIs that leverage the existing architecture of the web to enable communication between machines via HTTP methods.

HTTP Methods used by RESTful APIs

Most web service APIs you will encounter utilize at the core the following HTTP methods for creating, retrieving, updating, and deleting information through that web service.1  Not all APIs allow each method (at least without authentication) but some common methods for interacting with APIs include:

    • GET – You can think of GET as a way to “read” or retrieve information via an API.  GET is a good starting point for interacting with an API you are unfamiliar with.  Many APIs utilize GET, and GET requests can often be used without complex authentication.  A common example of a GET request that you’ve probably used when browsing the web is the use of query strings in URLs (e.g., www.example.org/search?query=ebooks).
    • POST – POST can be used to “write” data over the web.  You have probably generated  POST requests through your browser when submitting data on a web form or making a comment on a forum.  In an API context, POST can be used to request that an API’s server accept some data contained in the POST request – Tweets, status updates, and other data that is added to a web service often utilize the POST method.
    • PUT – PUT is similar to POST, but can be used to send data to a web service that can assign that data a unique uniform resource identifier (URI) such as a URL.  Like POST, it can be used to create and update information, but PUT (in a sense) is a little more aggressive. PUT requests are designed to interact with a specific URI and can replace an existing resource at that URI or create one if there isn’t one.
    • DELETE – DELETE, well, deletes – it removes information at the URI specified by the request.  For example, consider an API web service that could interact with your catalog records by barcode.2 During a weeding project, an application could be built with DELETE that would delete the catalog records as you scanned barcodes.3

Understanding API Authentication Methods

To me, one of the trickiest parts of getting started with APIs is understanding authentication. When an API is made available, the publishers of that API are essentially creating a door to their application’s data.  This can be risky:  imagine opening that door up to bad people who might wish to maliciously manipulate or delete your data.  So often APIs will require a key (or a set of keys) to access data through an API.

One helpful way to contextualize how an API is secured is to consider access in terms of identification, authentication, and authorization.4  Some APIs only want to know where the request is coming from (identification), while others require you to have a valid account (authentication) to access data.  Beyond authentication, an API may also want to ensure your account has permission to do certain functions (authorization).  For example, you may be an authenticated user of an API that allows you to make GET requests of data, but your account may still not be authorized to make POST, PUT, or DELETE requests.

Some common methods used by APIs to store authentication and authorization include OAuth and WSKey:

  • OAuth - OAuth is a widely used open standard for authorization access to HTTP services like APIs.5  If you have ever sent a tweet from an interface that’s not Twitter (like sharing a photo directly from your mobile phone) you’ve utilized the OAuth framework.  Applications that already store authentication data in the form of user accounts (like Twitter and Google) can utilize their existing authentication structures to assign authorization for API access.  API Keys, Secrets, and Tokens can be assigned to authorized users, and those variables can be used by 3rd party applications without requiring the sharing of passwords with those 3rd parties.
  • WSKey (Web Services Key) – This is an example from OCLC, that is conceptually very similar to OAuth.  If you have an OCLC account (either through worldcat.org or oclc.org account) you can request key access.  Your authorization – in other words, what services and REST requests you are permitted to access – may be dependent upon your relationship with an OCLC member organization.

Keys, Secrets, Tokens?  HMAC?!

API authorization mechanisms often require multiple values in order to successfully interact with the API.  For example, with the Twitter API, you may be assigned an API Key and a corresponding Secret.  The topic of secret key authentication can be fairly complex,6 but fundamentally a Key and its corresponding Secret are used to authenticate requests in a secure encrypted fashion that would be difficult to guess or decrypt by malicious third-parties.  Multiple keys may be required to perform particular requests – for example, the Twitter API requires a key and secret to access the API itself, as well as a token and secret for OAuth authorization.

Probably the most important thing to remember about secrets is to keep them secret.  Do not share them or post them anywhere, and definitely do not store secret values in code uploaded to Github 7 (.gitignore – a method to exclude files from a git repository – is your friend here). 8  To that end, one strategy that is used by RESTful APIs to further secure secret key value is an HMAC header (hash-based method authentication code).  When requests are sent, HMAC uses your secret key to sign the request without actually passing the secret key value in the request itself. 9

Case Study:  The Twitter API

It’s easier to understand how APIs work when you can see them in action.  To do these steps yourself, you’ll need a Twitter account.  I strongly recommend creating a Twitter account separate from your personal or organizational accounts for initial experimentation with the API.  This code example is a very simple walkthrough, and does not cover securing your applications’ server (and thus securing the keys that may be stored on that server).  Anytime you authorize access to a Twitter account to API access you may be exposing it to some level of vulnerability.  At the end of the walkthrough, I’ll list the steps you would need to take if your account does get compromised.

1.  Activate a developer account

Visit dev.twitter.com and click the sign in area in the upper right corner.  Sign in with your Twitter account. Once signed in, click on your account icon (again in the upper right corner of the page) and then select the My Applications option from the drop-down menu.

Screenshot of the Twitter Developer Network login screen

2.  Get authorization

In the My applications area, click the Create New App button, and then fill out the required fields (Name, Description, and Website where the app code will be stored).  If you don’t have a web server, don’t worry, you can still get started testing out the API without actually writing any code.

3.  Get your keys

After you’ve created the application and are looking at its settings, click on the API Keys tab.  Here’s where you’ll get the values you need.  Your API Access Level is probably limited to read access only.  Click the “modify app permissions” link to set up read and write access, which will allow you to post through the API.  You will have to associate a mobile phone number with your Twitter account to get this level of authorization.

Screenshot of Twitter API options that allow for configuraing API read and write access.

Scroll down and note that in addition to an API Key and Secret, you also have an access token associated with OAUTH access.  This Token Key and Secret are required to authorize account activity associated with your Twitter user account.

4.  Test Oauth Access / Make a GET call

From the application API Key page, click the Test OAuth button.  This is a good way to get a sense of the API calls.  Leave the key values as they are on the page, and scroll down to the Request Settings Area.  Let’s do a call to return the most recent tweet from our account.

With the GET request checked, enter the following values:

Request URI:

Request Query (obviously replace yourtwitterhandle with… your actual Twitter handle):

  • screen_name=yourtwitterhandle&count=1

For example, my GET request looks like this:

Screenshot of the GET request setup screen for OAuth testing.

Click “See OAuth signature for this request”.  On the next page, look for the cURL request.  You can copy and paste this into a terminal or console window to execute the GET request and see the response (there will be a lot more of response text than what’s posted here):

* SSLv3, TLS alert, Client hello (1):
[{"created_at":"Sun Apr 20 19:37:53 +0000 2014","id":457966401483845632,
"id_str":"457966401483845632",
"text":"Just Added: The Fault in Our Stars by John Green; 
2nd Floor PZ7.G8233 Fau 2012","

As you can see, the above response to my cURL request includes the text of my account’s last tweet:

image00

What to do if your Twitter API Key or OAuth Security is Compromised

If your Twitter account suddenly starts tweeting out spammy “secrets to weight loss success” that you did not authorize (or other tweets that you didn’t write), your account has been compromised.  If you can still login with your username and password, it’s likely that your OAuth Keys have been compromised.  If you can’t log in, your account has probably been hacked.10  Your account can be compromised if you’ve authorized a third party app to tweet, but if your Twitter account has an active developer application on dev.twitter.com, it could be your own application’s key storage that’s been compromised.

Here are the immediate steps to take to stop the spam:

  1. Revoke access to third party apps under Settings –> Apps.  You may want to re-authorize them later – but you’ll probably want to reset the password for the third-party accounts that you had authorized.
  2. If you have generated API keys, log into dev.twitter.com and re-generate your API Keys and Secrets and your OAuth Keys and Secrets.  You’ll have to update any apps using the keys with the new key and secret information – but only if you have verified the server running the app hasn’t also been compromised.
  3. Reset your Twitter account password.11
5.  Taking it further:  Posting a new titles Twitter feed

So now you know a lot about the Twitter API – what now?  One way to take this further might involve writing an application to post new books that are added to your library’s collection.  Maybe you want to highlight a particular subject or collection – you can use some text output from your library catalog to post the title, author, and call number of new books.

The first step to such an application could involve creating an app that can post to the Twitter API.  If you have access to a  server that can run PHP, you can easily get started by downloading this incredibly helpful PHP wrapper.

Then in the same directory create two new files:

  • settings.php, which contains the following code (replace all the values in quotes with your actual Twitter API Key information):
<?php

$settings = array {
 ‘oath_access_token’ => “YOUR_ACCESS_TOKEN”,
 ‘oath_access_token_secret’ => “YOUR_ACCESS_TOKEN_SECRET”,
 ‘consumer_key’ => “YOUR_API_KEY”,
 ‘consumer_secret’ => “YOUR_API_KEY_SECRET”,
);

?>
  • and twitterpost.php, which has the following code, but swap out the values of ‘screen_name’ with your Twitter handle, and change the ‘status’ value if desired:
<?php

//call the PHP wrapper and your API values
require_once('TwitterAPIExchange.php');
include 'settings.php';

//define the request URL and REST request type
$url = "https://api.twitter.com/1.1/statuses/update.json";
$requestMethod = "POST";

//define your account and what you want to tweet
$postfields = array(
  'screen_name' => 'YOUR_TWITTER_HANDLE',
  'status' => 'This is my first API test post!'
);

//put it all together and build the request
$twitter = new TwitterAPIExchange($settings);
echo $twitter->buildOauth($url, $requestMethod)
->setPostfields($postfields)
->performRequest();

?>

Save the files and run the twitterpost.php page in your browser. Check the Twitter account referenced by the screen_name variable.  There should now be a new post with the contents of the ‘status’ value.

This is just a start – you would still need to get data out of your ILS and feed it to this application in some way – which brings me to one final point.

Is there an API for your ILS?  Should there be? (Answer:  Yes!)

Getting data out of traditional, legacy ILS systems can be a challenge.  Extending or adding on to traditional ILS software can be impossible (and in some cases may have been prohibited by license agreements).  One of the reasons for this might be that the architecture of such systems was designed for a world where the kind of data exchange facilitated by RESTful APIs didn’t yet exist.  However, there is definitely a major trend by ILS developers to move toward allowing access to library data within ILS systems via APIs.

It can be difficult to articulate exactly why this kind of access is necessary – especially when looking toward the future of rich functionality in emerging web-based library service platforms.  Why should we have to build custom applications using APIs – shouldn’t our ILS systems be built with all the functionality we need?

While libraries should certainly promote comprehensive and flexible architecture in the ILS solutions they purchase, there will almost certainly come a time when no matter how comprehensive your ILS is, you’re going to wonder, “wouldn’t it be nice if our system did X”?  Moreover, consider how your patrons might use your library’s APIs; for example, integrating your library’s web services other apps and services they already to use, or to build their own applications with your library web services. If you have web service API access to your data – bibliographic, circulation, acquisition data, etc. – you have the opportunity to meet those needs and to innovate collaboratively.  Without access to your data, you’re limited to the development cycle of your ILS vendor, and it may be years before you see the functionality you really need to do something cool with your data.  (It may still be years before you can find the time to develop your own app with an API, but that’s an entirely different problem.)

Examples of Library Applications built using APIs and ILS API Resources

Further Reading

Michel, Jason P. Web Service APIs and Libraries. Chicago, IL:  ALA Editions, 2013. Print.

Richardson, Leonard, and Michael Amundsen. RESTful Web APIs. Sebastopol, Calif.: O’Reilly, 2013.

 

About our Guest Author:

Lauren Magnuson is Systems & Emerging Technologies Librarian at California State University, Northridge and a Systems Coordinator for the Private Academic Library Network of Indiana (PALNI).  She can be reached at lauren.magnuson@csun.edu or on Twitter @lpmagnuson.

 

Notes

  1. create, retrieve, update, and delete is sometimes referred to by acronym: CRUD
  2. For example, via the OCLC Collection Management API: http://www.oclc.org/developer/develop/web-services/wms-collection-management-api.en.html
  3. For more detail on these and other HTTP verbs, http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
  4. https://blog.apigee.com/detail/do_you_need_api_keys_api_identity_vs._authorization
  5. Google, for example: https://developers.google.com/accounts/docs/OAuth2
  6. To learn a lot more about this, check out this web series: http://www.youtube.com/playlist?list=PLB4D701646DAF0817
  7. http://www.securityweek.com/github-search-makes-easy-discovery-encryption-keys-passwords-source-code
  8. Learn more about .gitignore here:  https://help.github.com/articles/ignoring-files
  9. An nice overview of HMAC is here: http://www.wolfe.id.au/2012/10/20/what-is-hmac-and-why-is-it-useful
  10. Here’s what to do if you’re account is hacked and you can’t log in:  https://support.twitter.com/articles/185703-my-account-has-been-hacked
  11. More information, and further steps you can take are here:  https://support.twitter.com/articles/31796-my-account-has-been-compromised

Lightweight Project Management Tools in the Real World

My life got extra complicated in the last few months. I gave birth to my first child in January, and in between the stress of a new baby, unexpected hospital visits, and the worst winter in 35 years, it was a trying time. While I was able to step back from many commitments during my 8 week maternity leave, I didn’t want to be completely out the loop, and since I would come back to three conferences back to back, I needed to be able to jump back in and monitor collaborative projects from wherever. All of us have times in our lives that are this hectic or even more so, but even in the regular busy thrum of our professional lives it’s too easy to let ongoing commitments like committee work completely disappear from our mental landscapes other than the nagging feeling that you are missing something.

There are various methods and tools to enhance productivity, which we’ve looked at before. Some basic collaboration tools such as Google Docs are always good to have any time you are working on a group project that builds into something like a presentation or report. But for committee work or every day work in a department, something more specialized can be even better. I want to look at some real-life examples of using lightweight project management tools to keep projects that you work on with others going strong—or not so strong, depending on how they are used. Over the past 4-5 months I’ve gotten experience using Trello for committee work and Asana for work projects. Both of them have some great features, but as always the implementation doesn’t depend entirely on the software’s functionality. Beyond my experience with these two implementations I’ll address a few other tools and my experience with effective usage of them.

Asana

I have the great fortune of having an entire wall of my office painted with white board paint, Asana Screenshotand use it to sketch out ideas and projects. For that to be useful, I need to be physically be in the office. So before I went on maternity leave, I knew I needed to get all my projects at work organized in a way that I could give tasks I would normally do to others, as well as monitor what was happening on large on-going projects. I had used Asana before in another context, so I decided to give it a try for this purpose. Asana has projects, tasks, and due dates that anyone in a workspace can follow and assign. It’s a pretty flexible system–the screenshot shows one potential way of setting it up, but we use different models for different projects, and there are many ideas out there. My favorite feature is project templates, which I use in another workspace that I share with my graduate assistant. This allows you to create a new project based on a standard series of steps, which means that she could create new projects while I was away based on the normal workflow we follow and I could work on them when I returned. All of this requires a very strict attention to keeping projects organized, however, and if you don’t have an agreed upon system for naming and organizing tasks they can get out of hand very quickly.

We also use Asana as part of our help request system. We wanted to set up a system to track requests from all the library staff not only for my maternity leave but in general. I looked at many different systems, but they were almost all too heavy-duty for what we needed. I made our own very lightweight system using the Webform module in Drupal on our intranet. Staff submits requests through that form, which sends an email using a departmental email address to our Issue Tracking queue in Asana. Once the task is completed we explain the problem in an Asana comment (or just mark completed if it’s a normal request such as new user account), and then send a reply to the requestor through the intranet. They can see all the requests they’ve made plus the replies through that system. The nice thing about doing it this way is that everything is in one place–trouble tickets become projects with tasks very easily.

Trello

Trello screenshotTrello is designed to mimic the experience of using index cards or sticky notes on a wall to track ideas and figure out what is going on at a glance. This is particularly useful for ongoing work where you have multiple projects in a set of pipelines divvied up among various people. You can easily see how many ideas you have in the inception stage and how many are closer to completion, which can be a good motivator to move items along. Another use is to store detailed project ideas and notes and then sort them into lists once you figure out a structure.

Trello starts with a virtual board, which is divided into lists of cards. Trello cards can be assigned to specific people, and anyone can follow a card to get notifications. Clicking on a card brings up a whole set of additional options, including who is working on the project, attachments, due dates, color coding, and anything else you might want. The screenshot shows how the LITA Education Committee uses Trello to plan educational offerings. The white areas with small boxes indicate cards (we use one card per program/potential idea) that are active and assigned, the gray areas indicate cards which haven’t been touched in a while and so probably need followup. Not surprisingly, there are many more cards, many of which are inactive, at the beginning of the pipeline than at the end with programs already set up. This is a good visual reminder that we need to keep things moving along.

In this case I didn’t set up Trello, and I am not always the best user of it. Using this for committee work has been useful, but there are a few items to keep in mind for it to actually work to keep projects going. First, and this goes for everything, including analog cards or sticky notes, all the people working on the project need to check into it on a regular basis and use it consistently. One thing that I found was important to do to get it into a regular workflow was turn on email notifications. While it would be nice to stay out of email more, most of us are used to finding work show up there, and if you have a sane relationship to your inbox (i.e. you don’t use it to store work in progress), it can be helpful to know to log in to work on something. I haven’t used the mobile app yet, but that is another option for notifications.

Other Tools

While I have started using Asana and Trello more heavily recently, there are a number of other tools out there that you may need to use in your job or professional life. Here are a few:

Box

Many institutions have some sort of “cloud” file system now such as Box or Google Drive. My work uses Box, and I find it very useful for parts of projects where I need many people (but a slightly different set each time) to collaborate on completing a single task. I upload a spreadsheet that I need everyone to look at, use the information to do something, and then add additional information to the spreadsheet. This is a very common scenario that organizations often use a shared drive to accomplish, but there are a number of problems with that approach. If you’ve ever been confronted with the filename “Spring2014_report-Copy-Copy-DRAFT.xlsx” or not been able to open a file because someone else left it open on her desktop and went to lunch, you know what I mean. Instead of that, I upload the file to Box, and assign a task to the usernames of all the people I need to look at the document. They can use a tool called Box Edit to open the file in Excel and any changes they make are immediately saved back to the shared document, just as a Google Doc would do. They can then mark the task complete, and the system only sends email reminders to people who haven’t yet finished the task.

ALA Connect

This section is only relevant to people working on projects with an American Library Association group, whether a committee or interest group. Since this happens to most people working in academic libraries at some point, I think it’s worth considering. But if not,  skip to the conclusion. ALA Connect is the central repository for institutional memory and documents for work around ALA, including committees and interest groups. It can also be a good place to work on project collaboratively, but it takes some setup. As a committee chair, I freely admit that I need to organize my own ALA Connect page much better. My normal approach was to use an online document (so something editable by everyone) for each project and file each document under a subcommittee heading, but in practice I find it way too hard to find the right document to see what each subcommittee is working on. I am going to experiment with a new approach. I will create “groups” for each project, and use the Group Headings sidebar to organize these. If you’re on a committee and not the chair, you don’t have access to reorganize the sidebar or posts, but suggest this approach to your chair if you can’t find anything in “General News & Discussions”. Also, try to document the approach you’ve taken so future chairs will know what you did, and let other chairs know what works for your committee.

You also need to make a firm commitment as a chair to hold certain types of discussions on your committee mailing list, and certain discussions on ALA Connect, and then to document any pertinent mailing list discussions on ALA Connect. That way you won’t be unable to figure out where you are on the project because half your work is in email and half on ALA Connect. (This obviously goes for any other tool other than email as well).

 Conclusion

With all the tools above, you really have no excuse to be running projects through email, which is not very effective unless everyone you are working with is very strict with their email filing and reply times. (Hint: they aren’t—see above about a sane relationship with your inbox.) But any tool requires a good plan to understand how its strengths mesh with work you have to accomplish. If your project is to complete a document by a certain date, a combination of Google Docs or Box (or ALA Connect for ALA work) and automated reminders might be best. If you want to throw a lot of ideas around and then organize them, Trello or Asana might work. Since these are all free to try, explore a few tools before starting a big project to see what works for you and your collaborators. Once you pick one, dedicate a bit of time on a weekly or monthly basis to keeping your virtual workspace organized. If you find it’s no longer working, figure out why. Did the scope of your project change over time, and a different tool is now more effective? This can happen when you are planning to implement something and switch over from the implementation to ongoing work using the new system. Or maybe people have gotten complacent about checking in on work to do. Explore different types of notifications or mobile apps to reinvigorate your team.

I would love to hear about your own approach to lightweight project management with these tools or others in the comments.

 


One Shocking Tool Plus Two Simple Ideas That Will Forever Chage How You Share Links

The Click Economy

The economy of the web runs on clicks and page views. The way web sites turn traffic into profit is complex, but I think we can get away with a broad gloss of the link economy as long as we acknowledge that greater underlying complexity exists. Basically speaking, traffic (measured in clicks, views, unique visitors, length of visit, etc.) leads to ad revenue. Web sites benefit when viewers click on links to their pages and when these viewers see and click on ads. The scale of the click economy is difficult to visualize. Direct benefits from a single click or page view are minuscule. Profits tend to be nonexistent or trivial on any scale smaller than unbelievably massive. This has the effect of making individual clicks relatively meaningless, but systems that can magnify clicks and aggregate them are extremely valuable. What this means for the individual person on the web is that unless we are Ariana Huffington, Sheryl Sandberg, Larry Page, or Mark Zuckerberg we probably aren’t going to get rich off of clicks. However, we do have impact and our online reputations can significantly influence which articles and posts go viral. If we understand how the click economy works, we can use our reputation and influence responsibly. If we are linking to content we think is good and virtuous, then there is no problem with spreading “link juice” indiscriminately. However, if we want to draw someone’s attention to content we object to, we can take steps to link responsibly and not have our outrage fuel profits for the content’s author. 1 We’ve seen that links benefit the site’s owners in two way: directly through ad revenues and indirectly through “link juice” or the positive effect that inbound links have on search engine ranking and social network trend lists. If our goal is to link without benefiting the owner of the page we are linking to, we will need a separate technique for each the two ways a web site benefits from links.

For two excellent pieces on the click economy, check out see Robinson Meyer’s Why are Upworthy Headlines Suddenly Everywhere?2 in the Atlantic Monthly and Clay Johnson’s book The Information Diet especially The New Journalists section of chapter three3

Page Rank

Page Rank is the name of a key algorithm Google uses to rank web pages it returns. 4 It counts inbound links to a page and keeps track of the relative importance of the sites the links come from. A site’s Page Rank score is a significant part of how Google decides to rank search results. 5 Search engines like Google recognize that there would be a massive problem if all inbound links were counted as votes for a site’s quality. 6 Without some mechanism to communicate “I’m linking to this site as an example of awful thinking” there really would be no such thing as bad publicity and a website with thousands of complaints and zero positive reviews would shoot to the top of search engine rankings. For example, every time a librarian used martinlutherking.org (A malicious propaganda site run by the white supremacist group Stormfront) as an example in a lesson about web site evaluation, the page would rise in Google’s ranking and more people would find it in the course of natural searches for information on Dr. King. When linking to malicious content, we can avoid increasing its Page Rank score, by adding the rel=“nofollow” attribute to the anchor link tag. A normal link is written like this:

<a href=“http://www.horriblesite.com/horriblecontent/“ target=”_blank”>This is a horrible page.</a>

This link would add the referring page’s reputation or “link juice” to the horrible site’s Page Rank. To fix that, we need to add the rel=“nofollow” attribute.

<a href=“http://www.horriblesite.com/horriblecontent/“ target=”_blank” rel=“nofollow”>This is a horrible page.</a>

This addition communicates to the search engine that the link should not count as a vote for the site’s value or reputation. Of course, not all linking takes place on web pages anymore. What happens if we want to share this link on Facebook or Twitter? Both Facebook and Twitter automatically add rel=“nofollow” to their links (you can see this if you view page source), but we should not rely on that alone. Social networks aggregate links and provide their own link juice similarly to search engines. When sharing links on social networks, we’ll want to employ a tool that keeps control of the link’s power in our own hands. donotlink.com is an very interesting tool for this purpose.

donotlink.com

donotlink.com is a service that creates safe links that don’t pass on any reputation or link juice. It is ideal for sharing links to sites we object to. On one level, it works similarly to a URL shortener like bit.ly or tinyurl.com. It creates a new URL customized for sharing on social networks. On deeper levels, it does some very clever stuff to make sure no link juice dribbles to the site being linked. They explain what, why, and how very well on their site. Basically speaking donotlink.com passes the link through a new URL that uses javascript, a robots.txt file, and the nofollow and noindex link attributes to both ask search engines and social networks to not apply link juice and to make it structurally difficult to do ignore these requests. 7 This makes donotlink.com’s link masking service an excellent solution to the problem of web sites indirectly profiting from negative attention.

Page Views & Traffic

All of the techniques listed above will deny a linked site the indirect benefits of link juice. They will not, however, deny the site the direct benefits from increased traffic or views and clicks on the pages advertisements. There are ways to share content without generating any traffic or advertising revenues, but these involve capturing the content and posting it somewhere else so they raise ethical questions about respect for intellectual property. So I suggest using only with both caution and intentionality. A quick and easy way to direct traffic to content without benefiting the hosting site is to use a link to Google’s cache of the page. If you can find a page in a Google search, clicking the green arrow next to the URL (see image) will give the option of viewing the cached page. Then just copy the full URL and share that link instead of the original. Viewers can read the text without giving the content page views. Not all pages are visible on Google, so the Wayback Machine from the Internet Archive is a great alternative. The Wayback Machine provides access to archived version of web pages and also has a mechanism (see the image on the right) for adding new pages to the archive.

screengrab of google cache

Screengrab of Google Cache

screengrab of wayback machine

Caching a site at the wayback machine

Both of these solutions rely on external hosts and if the owner of the content is serious about erasing a page, there are processes for removing content from both Google’s cache and the Wayback Machine archives. To be certain of archiving content, the simplest solution is to capture a screenshot and share the image file. This gives you control over the image, but may be unwieldy for larger documents. In these cases saving as a PDF may be a useful workaround. (Personally, I prefer to use the Clearly browser plugin with Evernote, but I have a paid Evernote account and am already invested in the Evernote infrastructure.)

Summing up

In conclusion, there are a number of steps we can take when we want to be responsible with how we distribute link juice. If we want to share information without donating our online reputation to the information’s owner, we can use donotlink.com to generate a link that does not improve their search engine ranking. If we want to go a step further, we can link to a cached version of the page or share a screenshot.

Notes

  1. Using outrageous or objectionable content to generate web traffic is a black-hat SEO technique known as “evil hooks.” There is a lot of profit in “You won’t believe what this person said!” links.
  2. http://www.theatlantic.com/technology/archive/2013/12/why-are-upworthy-headlines-suddenly-everywhere/282048/
  3. The Information Diet, page 35-41
  4. https://en.wikipedia.org/wiki/PageRank
  5. Matt Cuts How Search Works Video.
  6. I’ve used this article http://www.nytimes.com/2010/11/28/business/28borker.html to explain this concept to my students. It is also referenced by donotlink.com in their documentation.
  7. javascript is slightly less transparent to search engines and social networks than is HTML, robots.txt is a file on a web server that tells search engine bots which pages to crawl (it works more like a no trespassing sign than a locked gate), noindex tells bots not to add the link to its index.

Higher ‘Professional’ Ed, Lifelong Learning to Stay Employed, Quantified Self, and Libraries

The 2014 Horizon Report is mostly a report on emerging technologies. Many academic librarians carefully read its Higher Ed edition issued every year to learn about the upcoming technology trends. But this year’s Horizon Report Higher Ed edition was interesting to me more in terms of how the current state of higher education is being reflected on the report than in terms of the technologies on the near-term (one-to-five year) horizon of adoption. Let’s take a look.

A. Higher Ed or Higher Professional Ed?

To me, the most useful section of this year’s Horizon Report was ‘Wicked Challenges.’ The significant backdrop behind the first challenge “Expanding Access” is the fact that the knowledge economy is making higher education more and more closely and directly serve the needs of the labor market. The report says, “a postsecondary education is becoming less of an option and more of an economic imperative. Universities that were once bastions for the elite need to re-examine their trajectories in light of these issues of access, and the concept of a credit-based degree is currently in question.” (p.30)

Many of today’s students enter colleges and universities with a clear goal, i.e. obtaining a competitive edge and a better earning potential in the labor market. The result that is already familiar to many of us is the grade and the degree inflation and the emergence of higher ed institutions that pursue profit over even education itself. When the acquisition of skills takes precedence to the intellectual inquiry for its own sake, higher education comes to resemble higher professional education or intensive vocational training. As the economy almost forces people to take up the practice of lifelong learning to simply stay employed, the friction between the traditional goal of higher education – intellectual pursuit for its own sake – and the changing expectation of higher education — creative, adaptable, and flexible workforce – will only become more prominent.

Naturally, this socioeconomic background behind the expansion of postsecondary education raises the question of where its value lies. This is the second wicked challenge listed in the report, i.e. “Keeping Education Relevant.” The report says, “As online learning and free educational content become more pervasive, institutional stakeholders must address the question of what universities can provide that other approaches cannot, and rethink the value of higher education from a student’s perspective.” (p.32)

B. Lifelong Learning to Stay Employed

Today’s economy and labor market strongly prefer employees who can be hired, retooled, or let go at the same pace with the changes in technology as technology becomes one of the greatest driving force of economy. Workers are expected to enter the job market with more complex skills than in the past, to be able to adjust themselves quickly as important skills at workplaces change, and increasingly to take the role of a creator/producer/entrepreneur in their thinking and work practices. Credit-based degree programs fall short in this regard. It is no surprise that the report selected “Agile Approaches to Change” and “Shift from Students as Consumers to Students as Creators” as two of the long-range and the mid-range key trends in the report.

A strong focus on creativity, productivity, entrepreneurship, and lifelong learning, however, puts a heavier burden on both sides of education, i.e. instructors and students (full-time, part-time, and professional). While positive in emphasizing students’ active learning, the Flipped Classroom model selected as one of the key trends in the Horizon report often means additional work for instructors. In this model, instructors not only have to prepare the study materials for students to go over before the class, such as lecture videos, but also need to plan active learning activities for students during the class time. The Flipped Classroom model also assumes that students should be able to invest enough time outside the classroom to study.

The unfortunate side effect or consequence of this is that those who cannot afford to do so – for example, those who have to work on multiple jobs or have many family obligations, etc. – will suffer and fall behind. Today’s students and workers are now being asked to demonstrate their competencies with what they can produce beyond simply presenting the credit hours that they spent in the classroom. Probably as a result of this, a clear demarcation between work, learning, and personal life seems to be disappearing. “The E-Learning Predictions for 2014 Report”  from EdTech Europe predicts that ‘Learning Record Stores’, which track, record, and quantify an individual’s experiences and progress in both formal and informal learning, will be emerging in step with the need for continuous learning required for today’s job market. EdTech Europe also points out that learning is now being embedded in daily tasks and that we will see a significant increase in the availability and use of casual and informal learning apps both in education but also in the workplace.

C. Quantified Self and Learning Analytics

Among the six emerging technologies in the 2014 Horizon Report Higher Education edition, ‘Quantified Self’ is by far the most interesting new trend. (Other technologies should be pretty familiar to those who have been following the Horizon Report every year, except maybe the 4D printing mentioned in the 3D printing section. If you are looking for the emerging technologies that are on a farther horizon of adoption, check out this article from the World Economic Forum’s Global Agenda Council on Emerging Technologies, which lists technologies such as screenless display and brain-computer interfaces.)

According to the report, “Quantified Self describes the phenomenon of consumers being able to closely track data that is relevant to their daily activities through the use of technology.” (ACRL TechConnect has covered personal data monitoring and action analytics previously.) Quantified self is enabled by the wearable technology devices, such as Fitbit or Google Glass, and the Mobile Web. Wearable technology devices automatically collect personal data. Fitbit, for example, keeps track of one’s own sleep patterns, steps taken, and calories burned. And the Mobile Web is the platform that can store and present such personal data directly transferred from those devices. Through these devices and the resulting personal data, we get to observe our own behavior in a much more extensive and detailed manner than ever before. Instead of deciding on which part of our life to keep record of, we can now let these devices collect about almost all types of data about ourselves and then see which data would be of any use for us and whether any pattern emerges that we can perhaps utilize for the purpose of self-improvement.

Quantified Self is a notable trend not because it involves an unprecedented technology but because it gives us a glimpse of what our daily lives will be like in the near future, in which many of the emerging technologies that we are just getting used to right now – the mobile, big data, wearable technology – will come together in full bloom. Learning Analytics,’ which the Horizon Report calls “the educational application of ‘big data’” (p.38) and can be thought of as the application of Quantified Self in education, has been making a significant progress already in higher education. By collecting and analyzing the data about student behavior in online courses, learning analytics aims at improving student engagement, providing more personalized learning experience, detecting learning issues, and determining the behavior variables that are the significant indicators of student performance.

While privacy is a natural concern for Quantified Self, it is to be noted that we ourselves often willingly participate in personal data monitoring through the gamified self-tracking apps that can be offensive in other contexts. In her article, “Gamifying the Quantified Self,” Jennifer Whitson writes:

Gamified self-tracking and participatory surveillance applications are seen and embraced as play because they are entered into freely, injecting the spirit of play into otherwise monotonous activities. These gamified self-improvement apps evoke a specific agency—that of an active subject choosing to expose and disclose their otherwise secret selves, selves that can only be made penetrable via the datastreams and algorithms which pin down and make this otherwise unreachable interiority amenable to being operated on and consciously manipulated by the user and shared with others. The fact that these tools are consumer monitoring devices run by corporations that create neoliberal, responsibilized subjectivities become less salient to the user because of this freedom to quit the game at any time. These gamified applications are playthings that can be abandoned at whim, especially if they fail to pleasure, entertain and amuse. In contrast, the case of gamified workplaces exemplifies an entirely different problematic. (p.173; emphasis my own and not by the author)

If libraries and higher education institutions becomes active in monitoring and collecting students’ learning behavior, the success of an endeavor of that kind will depend on how well it creates and provides the sense of play to students for their willing participation. It will be also important for such kind of learning analytics project to offer an opt-out at any time and to keep the private data confidential and anonymous as much as possible.

D. Back to Libraries

The changed format of this year’s Horizon Report with the ‘Key Trends’ and the ‘Significant Challenges’ has shown the forces in play behind the emerging technologies to look out for in higher education much more clearly. A big take-away from this report, I believe, is that in spite of the doubt about the unique value of higher education, the demand will be increasing due to the students’ need to obtain a competitive advantage in entering or re-entering the workforce. And that higher ed institutions will endeavor to create appropriate means and tools to satisfy students’ need of acquiring and demonstrating skills and experience in a way that is appealing to future employers beyond credit-hour based degrees, such as competency-based assessments and a badge system, is another one.

Considering that the pace of change at higher education tends to be slow, this can be an opportunity for academic libraries. Both instructors and students are under constant pressure to innovate and experiment in their teaching and learning processes. Instructors designing the Flipped Classroom model may require a studio where they can record and produce their lecture videos. Students may need to compile portfolios to demonstrate their knowledge and skills for job interviews. Returning adult students may need to acquire the habitual lifelong learning practices with the help from librarians. Local employers and students may mutually benefit from a place where certain co-projects can be tried. As a neutral player on the campus with tech-savvy librarians and knowledgeable staff, libraries can create a place where the most palpable student needs that are yet to be satisfied by individual academic departments or student services are directly addressed. Maker labs, gamified learning or self-tracking modules, and a competency dashboard are all such examples. From the emerging technology trends in higher ed, we see that the learning activities in higher education and academic libraries will be more and more closely tied to the economic imperative of constant innovation.

Academic libraries may even go further and take up the role of leading the changes in higher education. In his blog post for Inside Higher Ed, Joshua Kim suggests exactly this and also nicely sums up the challenges that today’s higher education faces:

  • How do we increase postsecondary productivity while guarding against commodification?
  • How do we increase quality while increasing access?
  • How do we leverage technologies without sacrificing the human element essential for authentic learning?

How will academic libraries be able to lead the changes necessary for higher education to successfully meet these challenges? It is a question that will stay with academic libraries for many years to come.


My First Hackathon & WikipeDPLA

Almost two months ago, I attended my first hackathon during ALA’s Midwinter Meeting. Libhack was coordinated by the Library Code Year Interest Group. Much credit is due to coordinators Zach Coble, Emily Flynn, Jesse Saunders, and Chris Strauber. The University of Pennsylvania graciously hosted the event in their Van Pelt Library.

What’s a hackathon? It’s a short event, usually a day or two, wherein coders and other folks get together to produce software. Hackathons typically work on a particular problem, application, or API (a source of structured data). LibHack focused on APIs from two major library organizations: OCLC and the Digital Public Library of America (DPLA).

Impressions & Mixed Content

Since this was my first hackathon and the gritty details below may be less than relevant to all our readers, I will front-load my general impressions of Libhack rather than talk about the code I wrote. First of all, splitting the hackathon into two halves focused on different APIs and catering to different skill levels worked well. There were roughly equal numbers of participants in both the structured, beginner-oriented OCLC group and the more independent DPLA group.

Having representatives from both of the participating institutions was wonderful. While I didn’t take advantage of the attending DPLA staff as much as I should have, it was great to have a few people to answer questions. What’s more, I think DPLA benefited from hearing about developers’ experiences with their API. For instance, there are a few metadata fields in their API which might contain an array or a string depending upon the record. If an application assumes one or the other, chances are it breaks at some point and the programmer has to locate the error and write code that handles either data format.

Secondly, the DPLA API is currently available only over unencrypted HTTP. Thus due to the mixed content policies of web browsers it is difficult to call the HTTP API on HTTPS pages. For the many HTTP sites on the web this isn’t a concern, but I wanted to call the DPLA API from Wikipedia which only serves content over HTTPS. To work around this limitation, users have to manually override mixed content blocking in their browser, a major limitation for my project. DPLA already had plans to roll out an HTTPS API, but I think hearing from developers may influence its priority.

Learn You Some Lessons

Personally, I walked away from Libhack with a few lessons. First of all, I had to throw away my initial code before creating something useful. While I had a general idea in mind—somehow connect DPLA content related to a given Wikipedia page—I wasn’t sure what type of project I should create. I started writing a command-line tool in Python, envisioning a command that could be passed a Wikipedia URL or article title and return a list of related items in the DPLA. But after struggling with a pretty unsatisfying project for a couple hours, including a detour into investigating the MediaWiki API, I threw everything aside and took a totally different approach by building a client-side script meant to run in a web browser. In the end, I’m a lot happier with the outcome after my initial failure. I love the command line, but the appeal of such a tool would be niche at best. What I wrote has a far broader appeal.1

Secondly, I worked closely with Wikipedian Jake Orlowitz.2 While he isn’t a coder, his intimate knowledge of Wikipedia was invaluable for our end product. Whenever I had a question about Wikipedia’s inner workings or needed someone to bounce ideas off of, he was there. While I blindly starting writing some JavaScript without a firm idea of how we could embed it onto Wikipedia pages, it was Jake who pointed me towards User Scripts and created an excellent installation tour.3 In other groups, I heard people discussing metadata, subject terms, and copyright. I think that having people of varied expertise in a group is advantageous when compared with a group solely composed of coders. Many hackathons explicitly state that non-programmers are welcome and with good reason; experts can outline goals, consider end-user interactions, and interpret API results. These are all invaluable contributions which are also hard to do with one’s face buried in a code editor.

While I did enjoy my hackathon experience, I was expecting a bit more structure and larger project groups. I arrived late, which doubtless didn’t help, but the DPLA groups were very fragmented. Some projects were only individuals, while others (like ours) were pairs. I had envisioned groups of at least four, where perhaps one person would compose plans and documentation, another would design a user interface, and the remainder would write back-end code. I can’t say that I was at all disappointed, but I could have benefited from the perspectives of a larger group.

What is WikipeDPLA?

So what did we build at Libhack anyway? As previously stated, we made a Wikipedia user script. I’ve dubbed it WikipeDPLA, though you can find it as FindDPLA on Wikipedia. Once installed, the script will query DPLA’s API on each article you visit, inserting related items towards the top.

WikipeDPLA in action

How does it work?

Here’s a step-by-step walkthrough of how WikipeDPLA works:

When you visit a Wikipedia article, it collects a few pieces of information about the article by copying text from the page’s HTML: the article’s title, any “X redirects here” notices, and the article’s categories.

First, WikipeDPLA constructs a DPLA query using the article’s title. Specifically, it constructs a JSONP query. JSONP is a means of working around the web’s same-origin policy which lets scripts manipulate data on other web pages. It works by including a script tag with a specially constructed URL containing a reference to one of your JavaScript functions:

<script src="//example.com/jsonp-api?q=search+term&callback=parseResponse"></script>

In responding to this request, the API plays a little trick; it doesn’t just return raw data, since that would be invalid JavaScript and thus cause a parsing error in the browser. Instead, it wraps the data in the function we’ve provided it. In the example above, that’s parseResponse:

parseResponse({
    "results": [
        {"title": "Searcher Searcherson",
        "id": 123123,
        "genre": "Omphaloskepsis"},
        {"title": "Terminated Term",
        "id": 321321,
        "genre": "Literalism"}
    ]
});

This is valid JavaScript; parseResponse receives an object which contains an array of search result records, each with some minimal metadata. This pattern has the handy feature that, as soon as our query results are available, they’re immediately passed to our callback function.

WikipeDPLA’s equivalent of parseResponse looks to see if there are any results. If the article’s title doesn’t return any results, then it’ll try again with any alternate titles culled from the article’s redirection notices. If those queries are also fruitless, it starts to go through the article’s categories.

Once we’ve guaranteed that we have some results from DPLA, we parse the API’s metadata into a simpler subset. This subset consists of the item’s title, a link to its content, and an “isImage” Boolean value noting whether or not the item is an image. With this simpler set of data in hand, we loop through our results to build a string of HTML which is then inserted onto the page. Voìla! DPLA search results in Wikipedia.

Honing

After putting the project together, I continued to refine it. I used the “isImage” Boolean to put a small image icon next to an item’s link. Then, after the hackathon, I noticed that my script was a nuisance if a user started reading a page anywhere other than at its start. For instance, if you start reading the Barack Obama article at the Presidency section, you will read for a moment and then suddenly be jarred as the DPLA results are inserted up top and push the rest of the article’s text down the page. In order to mitigate this behavior, we need to know if the top of the article is in view before inserting our results HTML. I used a jQuery visibility plug-in and an event listener on window scroll events to fix this.

Secondly, I was building a project with several targets: a user script for Wikipedia, a Grease/Tampermonkey user script4, and a (as yet inchoate) browser extension. To reuse the same basic JavaScript but in these different contexts, I chose to use the make command. Make is a common program used for projects which have multiple platform targets. It has an elegantly simple design: when you run make foo inside of a directly, make looks in a file named “makefile” for a line labelled “foo:” and then executes the shell command on the subsequent line. So if I have the following makefile:

hello:
    echo 'hello world!'

bye:
    echo 'goodbye!'

clean:
    rm *.log *.cache

Inside the same directory as this makefile, the commands make hello, make bye, and make clean respectively would print “hello world!” to my terminal, print “goodbye!”, and delete all files ending in extension “log” or “cache”. This contrived example doesn’t help much, but in my project I can run something like make userscript and the Grease/Tampermonkey script is automatically produced by prepending some header text to the main WikipeDPLA script. Similarly, make push produces all the various platform targets and then pushes the results up to the GitHub repo, saving a significant amount of typing on the command line.

These bits of trivia about interface design and tooling allude to a more important idea: it’s vital to choose projects that help you learn, particularly in a low-stakes environment like a hackathon. No one expects greatness from a product duct taped together in a few hours, so seize the opportunity to practice rather than aim for perfection. I didn’t have to write a makefile, but I chose to spend time familiarizing myself with a useful tool.

What’s Next?

While I am quite happy with my work at Libhack I do have plans for improvement. My main goal is to turn WikipeDPLA into a browser extension, for Chrome and perhaps Firefox. An extension offers a couple advantages: it can avoid the mixed-content issue with DPLA’s HTTP-only API5 and it is available even for users who aren’t logged in to Wikipedia. It would also be nice to expand my approach to encompassing other major digital library APIs, such as Europeana or Australia’s Trove.

And, of course, I want to attend more hackathons. Libhack was a very positive event for me, both in terms of learning and producing something useful, so I’m encouraged and hope other library conferences offer collaborative coding opportunities.

Other Projects

Readers should head over to LITA Blog where organizer Zach Coble has a report on libhack which details several other projects created at the Midwinter hackathon. Or you could just follow @HistoricalCats on Twitter.

Notes

  1. An aside related to learning to program: being a relatively new coder, I often think about advice I can give others looking to start coding. One common question is “what language should I learn first?” There’s one stock response, that it’s important not to worry too much about this choice because learning the fundamentals of one language will enable you to learn others quickly. But that dodges the question, because what people want to hear is a proper noun like “Ruby” or “Python” or “JavaScript.” And JavaScript, despite not being nearly as user friendly as those other two options, is a great starting place because it lets you work on the web with little effort. All of this to say; if I didn’t know JavaScript fairly well, I would not have been able to make something so useful.
  2. Shameless plug: Jake works on the Wikipedia Library, an interesting project that aims to connect Wikipedian researchers with source material, from subscription databases and open access repositories alike.
  3. User Scripts are pieces of JavaScript that a user can choose to insert whenever they are signed into and browsing Wikipedia. They’re similar to Greasemonkey user scripts, except the scripts only apply to Wikipedia. These scripts can do anything from customize the site’s appearance to insert new content, which is exactly what we did.
  4. Greasemonkey is the Firefox add-on for installing scripts that run on specified sites or pages; Tampermonkey is an analogous extension for Chrome.
  5. How’s that for acronyms?

Bash Scripting: automating repetitive command line tasks

Introduction

One of my current tasks is to develop workflows for digital preservation procedures. These workflows begin with the acquisition of files – either disk images or logical file transfers – both of which end up on a designated server. Once acquired, the images (or files) are checked for viruses. If clean, they are bagged using Bagit and then copied over to a different server for processing.1 This work is all done at the command line, and as you can imagine, it gets quite repetitive. It’s also a bit error-prone since our file naming conventions include a 10-digit media ID number, which is easily mistyped. So once all the individual processes were worked out, I decided to automate things a bit by placing the commands into a single script. I should mention here that I’m no Linux whiz- I use it as needed which sometimes is daily, sometimes not. This is the first time I’ve ever tried to tie commands together in a Bash script, but I figured previous programming experience would help.

Creating a Script

To get started, I placed all the virus check commands for disk images into a script. These commands are different than logical file virus checks since the disk has to be mounted to get a read. This is a pretty simple step – first add:

#!/bin/bash

as the first line of the file (this line should not be indented or have any other whitespace in front of it). This tells the kernel which kind of interpreter to invoke, in this case, Bash. You could substitute the path to another interpreter, like Python, for other types of scripts: #!/bin/python.

Next, I changed the file permissions to make the script executable:

chmod +x myscript

I separated the virus check commands so that I could test those out and make sure they were working as expected before delving into other script functions.

Here’s what my initial script looked like (comments are preceded by a #):

#!/bin/bash

#mount disk
sudo mount -o ro,loop,nodev,noexec,nosuid,noatime /mnt/transferfiles/2013_072/2013_072_DM0000000001/2013_072_DM0000000001.iso /mnt/iso

#check to verify mount
mountpoint -q /mnt/iso && echo "mounted" || "not mounted"

#call the Clam AV program to run the virus check
clamscan -r /mnt/iso > "/mnt/transferfiles/2013_072/2013_072_DM0000000001/2013_072_DM0000000001_scan_test.txt"

#unmount disk
sudo umount /mnt/iso

#check disk unmounted
mountpoint -q /mnt/iso && echo "mounted" || "not mounted"

All those options on the mount command? They give me the piece of mind that accessing the disk will in no way alter it (or affect the host server), thus preserving its authenticity as an archival object. You may also be wondering about the use of “&&” and “||”.  These function as conditional AND and OR operators, respectively. So “&&” tells the shell to run the first command, AND if that’s successful, it will run the second command. Conversely, the “||” tells the shell to run the first command OR if that fails, run the second command. So the mount check command can be read as: check to see if the directory at /mnt/iso is a mountpoint. If the mount is successful, then echo “mounted.” If it’s not, echo “not mounted.” More on redirection.

Adding Variables

You may have noted that the script above only works on one disk image (2013_072_DM0000000001.iso), which isn’t very useful. I created variables for the accession number, the digital media number, and the file extension, since they all changed depending on the disk image information.The file naming convention we use for disk images is consistent and strict. The top level directory is the accession number. Within that, each disk image acquired from that accession is stored within it’s own directory, named using it’s assigned number. The disk image is then named by a combination of the accession number and the disk image number. Yes, it’s repetitive, but it keeps track of where things came from and links to data we have stored elsewhere. Given that these disks may not be processed for 20 years, such redundancy is preferred.

At first I thought the accession number, digital media number, and extension variables would be best entered at the initial run command; type in one line to run many commands. Each variable is separated by a space, the .iso at the end is the extension for an optical disk image file:

$ ./virus_check.sh 2013_072 DM0000000001 .iso

In Bash, scripts run with arguments are named $1 for the first variable, $2 for the second, and so on. This actually tripped me up for a day or so. I initially thought the $1, $2, etc. variables names used by the book I was referencing were for examples only, and that the first variables I referenced in the script would automatically map in order, so if 2013_072 was the first argument, and $accession was the first variable, $accession = 2013_072 (much like when you pass in a parameter to a Python function). Then I realized there was a reason that more than one reference book and/or site used the $1, $2, $3 system for variables passed in as command line arguments. I assigned each to it’s proper variable, and things were rolling again.

#!/bin/bash

#assign command line variables
$1=$accession
$2=$digital_media
$3=$extension

#mount disk
sudo mount -o ro,loop,noatime /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} /mnt/iso<span style="line-height: 1.5em;"> </span>

Note: variables names are often presented without curly braces; it’s recommended to place them in curly braces when adjacent to other strings.2

Reading Data

After testing the script a bit, I realized I  didn’t like passing the variables in via the command line. I kept making typos, and it was annoying not to have any data validation done in a more interactive fashion. I reconfigured the script to prompt the user for input:

read -p "Please enter the accession number" accession

read -p "Please enter the digital media number" digital_media

read -p "Please enter the file extension, include the preceding period" extension

After reviewing some test options, I decided to test that $accession and $digital_media were valid directories, and that the combo of all three variables was a valid file. This test seems more conclusive than simply testing whether or not the variables fit the naming criteria, but it does mean that if the data entered is invalid, the feedback given to the user is limited. I’m considering adding tests for naming criteria as well, so that the user knows when the error is due to a typo vs. a non-existent directory or file. I also didn’t want the code to simply quit when one of the variables is invalid – that’s not very user-friendly. I decided to ask the user for input until valid input was received.

read -p "Please enter the accession number" accession
until [ -d /mnt/transferfiles/${accession} ]; do
     read -p "Invalid. Please enter the accession number." accession
done

read -p "Please enter the digital media number" digital_media
until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
     read -p "Invalid. Please enter the digital media number." digital_media
done

read -p  "Please enter the file extension, include the preceding period" extension
until [ -e/mnt/transferfiles/${accession}/${accession}_${digital_media}/
${accession}_${digital_media}${extension} ]; do
     read -p "Invalid. Please enter the file extension, including the preceding period" extension
done

Creating Functions

You may have noted that the command used to test if a disk is mounted or not is called twice. This is done on purpose, as it was a test I found helpful when running the virus checks; the virus check runs and outputs a report even if the disk isn’t mounted. Occasionally disks won’t mount for various reasons. In such cases, the resulting report will state that it scanned no data, which is confusing because the disk itself possibly could have contained no data. Testing if it’s mounted eliminates that confusion. The command is repeated after the disk has been unmounted, mainly because I found it easy to forget the unmount step, and testing helps reinforce good behavior. Given that the command is repeated twice, it makes sense to make it a function rather than duplicate it.

check_mount () {
     #checks to see if disk is mounted or not
     mountpoint -q /mnt/iso && echo "mounted" || "not mounted" 
}

Lastly, I created a function for the input variables. I’m sure there’s prettier, more concise ways of writing this function, but since it’s still being refined and I’m still learning Bash scripting, I decided to leave it for now. I did want it placed in it’s own function because I’m planning to add additional code that will notify me if the virus check is positive and exit the program, or if it’s negative, bag the disk image and corresponding files, and copy them over to another server where they’ll wait for further processing.

get_image () {
     #gets data from user, validates it    
     read -p "Please enter the accession number" accession
     until [ -d /mnt/transferfiles/${accession} ]; do
          read -p "Invalid. Please enter the accession number." accession
     done

     read -p "Please enter the digital media number" digital_media
     until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
          read -p "Invalid. Please enter the digital media number." digital_media
     done

     read -p  "Please enter the file extension, include the preceding period" extension
     until [ -e /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} ]; do
          read -p "Invalid. Please enter the file extension, including the preceding period" extension
     done
}

Resulting (but not final!) Script

#!/bin/bash
#takes accession number, digital media number, and extension as variables to mount a disk image and run a virus check

check_mount () {
     #checks to see if disk is mounted or not
     mountpoint -q /mnt/iso && echo "mounted" || "not mounted"
}

get_image () {
     #gets disk image data from user, validates info
     read -p "Please enter the accession number" accession
     until [ -d /mnt/transferfiles/${accession} ]; do
          read -p "Invalid. Please enter the accesion number." accession
     done

     read -p "Please enter the digital media number" digital_media
     until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
          read -p "Invalid. Please enter the digital media number." digital_media
     done

     read -p  "Please enter the file extension, include the preceding period" extension
     until [ -e/mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} ]; do
          read -p "Invalid. Please enter the file extension, including the preceding period" extension
     done
}

get_image

#mount disk
sudo mount -o ro,loop,noatime /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} /mnt/${extension} 

check_mount

#run virus check
sudo clamscan -r /mnt/iso > "/mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}_scan_test.txt"

#unmount disk
sudo umount /mnt/iso

check_mount

Conclusion

There’s a lot more I’d like to do with this script. In addition to what I’ve already mentioned, I’d love to enable it to run over a range of digital media numbers, since they often are sequential. It also doesn’t stop if the disk isn’t mounted, which is an issue. But I thought it served as a good example of how easy it is to take repetitive command line tasks and turn them into a script. Next time, I’ll write about the second phase of development, which will include combining this script with another one, virus scan reporting, bagging, and transfer to another server.

Suggested References

An A-Z Index of the Bash command line for Linux

The books I used, both are good for basic command line work, but they only include a small section for actual scripting:

Barrett, Daniel J., Linux Pocket Guide. O’Reilly Media, 2004.

Shotts, Jr., William E. The Linux Command Line: A complete introduction. no starch press. 2012.

The book I wished I used:

Robbins, Arnold and Beebe, Nelson H. F. Classic Shell Scripting. O’Reilly Media, 2005.

Notes

  1. Logical file transfers often arrive in bags, which are then validated and virus checked.
  2. Linux Pocket Guide