Python is a great programming language to know if you work in a library: it’s (relatively) easy to learn, its syntax is fairly clear and intuitive, and it has great, robust libraries for doing routine library tasks like hacking MARC records and working with delimited data, CSV files, JSON and XML. 1 In this post, I’ll describe a couple of projects I’ve worked on recently that have enabled me to Do Library Stuff Faster using Python. For reference, both of these scripts were written with Python 2.7 2 in mind, but could easily be adapted for other versions of Python.
Library Holdings Lookup with Beautiful Soup
Here’s a very common library dilemma: A generous and well-meaning patron, faculty member, or friend of the library has a large personal collection of books or other materials that they would like to bequeath to your library. They have carefully created a spreadsheet (or word document, or hand-written index) of all of the titles and authors (and maybe dates and ISBNs) in their library and want to know if you want the items.
Many libraries (for very good reason) have policies to just say “no” to these kinds of gifts. Well-meaning library gift givers don’t always realize that it’s an enormous amount of work for a library to evaluate materials and decide whether or not they can be added to the library’s collection. Beyond relevance to their users and condition of the items, libraries don’t want to accept gifts of duplicate copies of titles they already have in their collection due to limited shelf space.
It’s that final point – how to avoid adding duplicate titles to the collection – that led me to develop a very simple (and very hacky) script to enable me to take a list of titles and authors and do a very simple lookup to see if, at minimum, we have those same titles already in the collection. Our ILS (Innovative Interface’s Millennium system) does not have a way to feed in a bunch of titles and generate a report of title matches – and I would venture to say that kind of functionality is probably not available in most library systems. Normally when presented with a dilemma of having to check to see if the library already has a set of titles, we’d sit down an unfortunate student worker and have them manually work through the list – copying and pasting titles into the library catalog and noting down any matches found. This work is incredibly boring for the student worker, and is a prime candidate for automation (the same task is done over and over again, with a very specific set of criteria as output (match or no match).
Python’s Beautiful Soup library is built for exactly this kind of task – instead of having your student worker scan a bunch of web pages in your catalog, the script can do approximately the same thing by sending search terms to your catalog via URL, and returning back page elements that can tell you whether or not any matches were found. In my example script, I’m using title and author elements, but you could modify this script to use other elements as long as they are indexed in your catalog – for example, you could send ISBNs, OCLC numbers, etc.
First, using Excel I concatenate a list of titles and authors with a domain and other URL elements to search my library’s catalog. Here’s a few examples of what the URLs look like:
http://suncat.csun.edu/search~S9/X?SEARCH=t:(Los%20Angeles%20Two%20Hundred)+and+a:(Lavender)&searchscope=9&SORT=DX http://suncat.csun.edu/search~S9/X?SEARCH=t:(The%20Land%20of%20Journeys'%20Ending)+and+a:(Austin)&searchscope=9&SORT=DX http://suncat.csun.edu/search~S9/X?SEARCH=t:(Mathematics%20and%20Sex)+and+a:(Ernest)&searchscope=9&SORT=DX
I’ll save the full list of these (in my example, I have over 1000 titles and authors to check) in a plain text file called advancedtitleauth.txt.
Next, I start my Python script by calling the Beautiful Soup library, and some other libraries that are useful (urllib – a library built for fetching data by URLs; csv – a library for working with CSV files; and re, for working with regular expressions ). You’ll probably have to install Beautiful Soup on your system first, which you can do if you have the pip Python package management system 3 installed by using sudo pip install beautifulsoup4 on your system’s command line.
from bs4 import BeautifulSoup import urllib import csv import re
Then I create a blank array and define a CSV file into which to write the output of the script:
url_list = 
csv_out = csv.writer(open('output.txt', 'w'), delimiter = '\t')
The CSV file I’m creating will use tabs instead of commas as delimiters (hence delimiter = ‘\t’). Typically when working with library data, I prefer tab-delimited text files over comma-separated files, because you never know when a random comma is going to show up in a title and create a delimiter where their should not be one.
Then I open my list of URLs, read it, append each URL to my array, and feed each URL into Beautiful Soup:
try: f = open('advancedtitleauth.txt', 'rb') for line in f: url_list.append(line) r = urllib.urlopen(line).read() soup = BeautifulSoup(r)
Beautiful Soup will go fetch the web page of each URL. Now that I have the web pages, Beautiful Soup can parse out specific features of each page. In my case, my catalog returns a page with a single record if a match is found, and a browsable index when a match is found (e.g., your title would be here, but it isn’t, so here’s some stuff with titles that would be nearby). I can use Beautiful Soup to return page elements that tell me whether a match was found, and if a match is found, to write the permanent URL of the match for later evaluation. This bit of code looks for an HTML div element with the class “bibRecordLink” on the page, which only appears when a single match is found. If this div is present on the page, the script grabs the link and drops it into the output file.
try: link = soup.find_all("div", class_="bibRecordLink") directlink = str(link) directlink = "http://suncat.csun.edu" + directlink[36:]
In the code above, [36:] is Python’s way of noting the start position of a string – so in this case, I’m getting the end of the string starting with the 36th character (which in my case, is the bibliographic ID number of the item that allows me to construct a permalink).
If a title/author search results in multiple possible matches – that is, we might have multiple copies, or the title/author combo is too vague to land on just one item, the page that displays in our catalog shows a browsable list of brief record info. In my script, I just grab the top result:
try: briefcit = soup.find_all("span", class_="briefcitTitle") bestmatch = str(briefcit) sep = "&" bestmatch = bestmatch.split(sep, 1) bestmatch = "http://suncat.csun.edu/" + bestmatch[39:]
In the code above, Beautiful Soup finds all the<span> elements with the class “briefcitTitle”, the script returns the first one, and again returns a URL stored in the bestmatch variable.
You can see a sample output of my lookup script here. You can see that for each entry, I include publication information, direct links, or a best match link if the elements are found. If none of the elements are found for a lookup URL, the line reads:
nopub nolink nomatch
We can now divide the output file into “no match” entries, direct links, or best match links. Direct links and best match links will need to be double-checked by a student worker to make sure they actually represent the item we looked up, including the date and edition. The “no match” entries represent titles we don’t have in our collection, so those can be evaluated more closely to determine if we want them.
The script certainly has room for improvement; I could write in a lot more functionality to better identify publication information, for example, to possibly reduce or eliminate the need for manual review of direct or partial matches. But the return on investment for this script is fairly highfor a 37-line script written in an afternoon; we can re-use this dozens of times, and hopefully save countless hours of student worker boredom (and re-assign those student workers to more complex and meaningful tasks!).
Rudimentary Keyword Frequency Analysis
This second example involves, again, dealing with a task that could be done manually, but can be done much more quickly with a script.
My university needed to submit data for the AASHE Sustainability Tracking, Assessment, and Rating System (STARS) Report (https://stars.aashe.org/), which requires the analysis of data from campus course offerings as well as research output by faculty. To submit the report, we needed to know how many courses we offer in sustainability (defined by AASHE “in an inclusive way, encompassing human and ecological health, social justice, secure livelihoods and a better world for all generations”) and how many faculty do research in sustainability. This project was broken up into two components: Analysis of research data by faculty and analysis of course data.
Before we even started analyzing research or course data, we needed to define an approach to identify what counts as “sustainability.” Thankfully, there was precedent from the University of North Carolina, which had developed a list of sustainability-related keywords used to search against faculty research output 4 We adopted this list of keywords to lookup in faculty research articles and course descriptions.
Research data by faculty
We don’t have a comprehensive inventory of research done by faculty at our campus. Because we were on a somewhat tight deadline to do the analysis, I came up with a very quick and dirty way of getting a lot of citations by using Web of Science. Web of Science enables you to do a search for research published by affiliates of your university. I was able to retrieve about 8,000 citations written by current or former faculty associated with my institution going back about 15 years. Of course, we cannot consider the data in Web of Science to be fully representative of faculty research output, but it seemed like a good start at least.
Web of Science enables you to export 500-record chunks of metadata, so it took an hour or so to export the metadata in several pieces (see Figure 1 for my Web of Science export criteria).
Once I had all of the metadata for the 8,000 or so records written by faculty at my institution, I combined them into a single file. Next, I needed to identify records that had sustainability keywords in either the title or abstract.
First, I created an array of all of the keywords, and turned that list into a Python set. A Python set is different from a list in that the order of terms does not matter, and is ideal for checking membership of items in the set against strings (or in my case, a bunch of citation and abstract strings).
word_list = 'Agriculture,Alternative,Applied%Science [..snip..]' word_set <span class="pl-k">=</span> <span class="pl-c1">set</span>(word_list.split(<span class="pl-s"><span class="pl-pds">'</span>,<span class="pl-pds">'</span></span>))
Note the % in “Applied%Science”. For some reason my set lookup couldn’t match terms with spaces. My hacky solution was to replace spaces with % characters, and then do a find/replace in my spreadsheet of Web of Science data to replace all keyword matches with spaces (such as Applied Science) with percentage signs (Applied%Science). Luckily, there were only 10 or so keywords on the list with spaces, so the find/replace did not take very long. Note also that the set match lookup is case sensitive, so I actually found it easier to just turn everything to lower case in my Web of Science spreadsheet and match on the lower case term (though I kept both upper and lower case terms in my lookup set).
Then I checked to see if any words were in the title, abstract, or both, and constructed my query so that a new column would be added to an output spreadsheet indicating *which* matches were found:
for row in csv_reader: if (set(row.split()) & word_set) & (set(row.split()) & word_set) : csv_out.writerow(["title & abstract match",row,row,row,row,(set(row.split()) & word_set), (set(row.split()) & word_set)])
If any of the words in my set were found in the 23rd cell of the spreadsheet (the abstract) and the 9th cell of the spreadsheet (the title), then a row would be written to an output sheet indicating that sustainability keywords were found in the title and abstract, pulling in some citation details about the article (including author names), as well as a cell with a list of the matches found for both title and abstract fields.
I did similar conditionals for rows that found, for example, just a title match, or just an author match:
elif set(row.split()) & word_set: csv_out.writerow(["title match",row,row,row,row, (set(row.split()) & word_set)]) elif set(row.split()) & word_set: csv_out.writerow(["abstract match",row,row,row,row, (set(row.split()) & word_set)])
And that is pretty much the whole script! With the output file, I did have to do a bit more work to identify current faculty at my institution, but I basically used the same set matching method above using a list provided by HR.
Because the STARS report also required analysis of courses related to sustainability, I also created a very similar script to lookup key terms found in course titles and descriptions.
Of course, just because a research article or course description has a keyword, or even multiple keywords, does not mean it’s relevant at all to sustainability. One of the keywords identified as related to sustainability, for example, is “invest”, which basically meant that almost every finance class returned as a match. Manual work was required to review the matches and weed out false positives, but because the keyword matching was already done and we could easily see what matches were found, this work was done fairly quickly. We could, for example, identify courses and research articles that only had a single keyword match. If that single keyword match was something like “sustainability” it was likely a sustainability-related course and would merit further review; if the single keyword match was something like “systems” it could probably be weeded out.
As with my author/title lookup script, if I had a bit more time to fuss with the script, I could have probably optimized it further (for example, by assigning weight to more sustainability-related keywords to help calculate a relevance score). But again, a short amount of time invested in this script saved a huge amount of time, and enabled us to do something we would not otherwise have been able to do.
If you’re interested in learning more about Python and its syntax, and don’t have a lot of Python experience, a good (free) place to start is Google’s Python Class, created by Nick Parlante for Google (I actually took a similar class several years ago, also created by Dr. Parlante, through Coursera, which looks to still be available). If you want to get started using Python right away and don’t want to have to fuss with installing it on your computer, you can check out the interactive course How to Think Like a Computer Scientist created by Brad Miller and David Ranum at Luther College. For more examples of usage in Python for library work, check out Charles Ed Hill, Heidi Frank, and Mark Pernotto’s Python chapter in the just-released LITA Guide The Librarian’s Introduction to Programming Languages, edited by Beth Thomsett-Scott (full-disclosure: I am a contributor to this book).
- Working with CSV files and JSON Data. In Sweigart, Al (2015). Automate the Boring Stuff with Python: Practical Programming for Total Beginners. San Francisco: No Starch Press. ↩
- For an explanation of the difference between Python 2 and 3, see https://wiki.python.org/moin/Python2orPython3. The reason I use Python 2.7 for these scripts is because of my computing environment (in which Python 2 is installed by default), but if you have Python 3 installed on your computer, note that syntactical changes in Python 3 mean that many Python 2.x scripts may require revision in order to work. ↩
- For instructions on using Pip with your Python installation, see: https://pip.pypa.io/en/latest/installing/ ↩
- Blank-White, Kristen. 2014. Researching the Researchers: Developing a Sustainability Research Inventory. Presented at the 2014 AASHE Conference and Expo, Portland OR. http://www.aashe.org/files/2014conference/presentations/secondPresentationUpload/Blank-White-Kristin_Researching-the-Researchers-Developing-a-Sustainability-Research-Inventory.pdf. ↩