Hacking in Python with PyMARCPosted: October 15, 2014 | Author: Lauren Magnuson | Filed under: coding, hacking | Tags: csv, KBART, MARC, Python | Comments Off on Hacking in Python with PyMARC
The pymarc Python library is an extremely handy library that can be used for writing scripts to read, write, edit, and parse MARC records. I was first introduced to pymarc through Heidi Frank’s excellent Code4Lib journal article on using pymarc in conjunction with MARCedit.1. Here at ACRL TechConnect, Becky Yoose has also written about some cool applications of pymarc.
In this post, I’m going to walk through a couple of applications of using pymarc to make comma-separated (.csv) files for batch loading in DSpace and in the KBART format (e.g., for OCLC KnowledgeBase purposes). I know what you’re thinking – why write a script when I can use MARCedit for that? Among MARCedit’s many indispensable features is its Export Tab-Delimited Data feature. 2. But there might be a couple of use cases where scripts can come in handy:
- Your MARC records lack data that you need in the tab-delimited or .csv output – for example, a field used for processing in whatever system you’re loading the data into that isn’t found in the MARC record source.
- Your delimited file output requires things like empty fields or parsed, modified fields. MARCedit is great for exporting whole MARC fields and subfields into delimited files, but you may need to pull apart a MARC element or parse a MARC element out with some additional data. Pymarc is great for this.
- You have to process on a lot of records frequently, and just want something tailor-made for your use case. While you can certainly use MARCedit to save a frequently used export template, editing that saved template when you need to change the output parameters of your delimited file isn’t easy. With a Python script, you can easily edit your script if your delimited file requirements change.
Some General Notes about Python
Python is a useful scripting language to know for a lot of reasons, and for small scripting projects can have a fairly shallow learning curve. In order to start using Python, you’ll need to set your development environment. Macs already come with Python installed, but to double-check which version you have installed, launch Terminal and type Python -v. In a Windows environment, you’ll need to do a few more steps, detailed here. Full Python documentation for 2.x can be found here (though personally, I find it a little dense at times!), and CodeAcademy features some pretty extensive tutorials to help you learn the syntax quickly. Google also has some pretty extensive tutorials on Python. Personally, I find it easiest to learn when I have a project that actually means something to me, so if you’re familiar with MARC records, just downloading the Python scripts mentioned in this post below and learning how to run them can be a good start.
Spacing and indentation in Python is very important, so if you’re getting errors running scripts make sure that your conditional statements are indented properly 3. Also the version of Python on your machine makes a really big difference, and the version will determine whether your code runs successfully. These examples have all been tested with Python 2.6 and 2.7, but not with Python 3 or above. I find that Python 2.x has more help resources out on the web, which is nice to be able to take advantage of when you’re first starting out.
Use Case 1: MARC to KBART
The complete script described below, along with some sample MARC records, is on GitHub.
The KBART format is a NISO/United Kingdom Serials Group (UKSG) initiative designed to standardize information for use with Knowledge Base systems. The KBART format is a series of standardized fields that can be used to identify serial coverage (e.g., start date and end date) URLs, publisher information, and more.4 Notably, it’s the required format for adding and modifying customized collections in OCLC’s Knowledge Base. 5.
In this use case, I’m using OCLC’s modified KBART file – most of the fields are KBART standard but a few fields are specific to OCLC’s Knowledge Base, e.g., oclc_collection_name.6. Typically, these KBART files would be created either manually, or generated by using some VLOOKUP Excel magic with an existing KBART file 7. In my use case, I needed to batch migrate a bunch of data stored in MARC records to the OCLC Knowledge Base, and these particular records didn’t always correspond neatly to existing OCLC Knowledge Base Collections. For example, one collection we wanted to manage in the OCLC Knowledge Base was the library’s print holdings so that these titles were displayed in OCLC’s A-Z Journal lookup tool.
First, I’ll need a few helper libraries for my script:
import csv from pymarc import MARCReader from os import listdir from re import search
Then, I’ll declare the source directory where my MARC records are (in this case, in the same directory the script lives in a folder called marc) and instruct the script to go find all the .mrc files in that directory. Note that this enables the script to process either a single file of multiple MARC records, or multiple distinct files, all at once.
# change the source directory to whatever directory your .mrc files are in SRC_DIR = 'marc/'
The script identifies MARC records by looking for all files ending in .mrc using the Python filter function. Within the filter function, lambda creates a one-off anonymous function to define the filter parameters: 8
# get a list of all .mrc files in source directory file_list = filter(lambda x: search('.mrc', x), listdir(SRC_DIR))
Now I’ll define the output parameters. I need a tab-delimited file, not a comma-delimited file, but either way I’ll use the csv.writer function to create a .txt file and define the tab delimiter (\t). I also don’t really want the fields quoted unless there’s a special character or something that might cause a problem reading the file, so I’ll set quoting to minimal:
#create tab delimited text file that quotes if special characters are present csv_out = csv.writer(open('kbart.txt', 'w'), delimiter = '\t', quotechar = '"', quoting = csv.QUOTE_MINIMAL)
And I’ll also create the header row, which includes the KBART fields required by the OCLC Knowledge Base:
#create the header row csv_out.writerow(['publication_title', 'print_identifier', 'online_identifier', 'date_first_issue_online', 'num_first_vol_online', 'num_first_issue_online', 'date_last_issue_online', 'num_last_vol_online', 'num_last_issue_online', 'title_url', 'first_author', 'title_id', 'coverage_depth', 'coverage_notes', 'publisher_name', 'location', 'title_notes', 'oclc_collection_name', 'oclc_collection_id', 'oclc_entry_id', 'oclc_linkscheme', 'oclc_number', 'action'])
Next, I’ll start a loop for writing a row into the tab-delimited text file for every MARC record found.
for item in file_list: fd = file(SRC_DIR + '/' + item, 'r') reader = MARCReader(fd)
By default, we’ll need to set each field’s value to blank (”), so that errors are avoided if the value is not present in the MARC record. We can set all the values to blank to start with in one statement:
for record in reader: publication_title = print_identifier = online_identifier = date_first_issue_online = num_first_vol_online = num_first_issue_online = date_last_issue_online = num_last_vol_online = num_last_issue_online = title_url = first_author = title_id = coverage_depth = coverage_notes = publisher_name = location = title_notes = oclc_collection_name = oclc_collection_id = oclc_entry_id = oclc_linkscheme = oclc_number = action = ''
Now I can start pulling in data from MARC fields. At its simplest, for each field defined in my CSV, I can pull data out of MARC fields using a construction like this:
#title_url if record ['856'] is not None: title_url = record['856']['u']
I can use generic python string parsing tools to clean up fields. For example, if I need to strip the ending / out of a MARC 245$a field, I can use .rsplit to return everything before that final slash (/):
# publication_title if record['245'] is not None: publication_title = record['245']['a'].rsplit('/', 1) if record['245']['b'] is not None: publication_title = publication_title + " " + record['245']['b']
Also note that once you’ve declared a variable value (like publication_title) you can re-use that variable name to add more onto the string, as I’ve done with the 245$b subtitle above.
As is usually the case for MARC records, the trickiest business has to do with parsing out serial ranges. The KBART format is really designed to capture, as accurately as possible, thresholds for beginning and ending dates. MARC makes this…complicated. Many libraries might use the 866 summary field to establish summary ranges, but there are varying local practices that determine what these might look like. In my case, the minimal information I was looking for included beginning and ending years, and the records I was processing with the script stored summary information fairly cleanly in the 863 and 866 fields:
# date_first_issue_online if record ['866'] is not None: date_first_issue_online = record['863']['a'].rsplit('-', 1) # date_last_issue_online if record ['866'] is not None: date_last_issue_online = record['866']['a'].rsplit('-', 1)[-1]
Where further adjustments were necessary, and I couldn’t easily account for the edge cases programmatically, I imported the KBART file into OpenRefine for further cleanup. A sample KBART file created by this script is available here.
Use Case 2: MARC to DSpace Batch Upload
Again, the complete script for transforming MARC records for DSpace ingest is on GitHub.
This use case is very similar to the KBART scenario. We use DSpace as our institutional repository, and we had about 2000 historical theses and dissertation MARC records to transform and ingest into DSpace. DSpace facilitates batch-uploading metadata as CSV files according to the RFC4180 format, which basically means all the field elements use double quotes. 9 While the fields being pulled from are different, the structure of the script is almost exactly the same.
When we first define the CSV output, we need to ensure that quoting is used for all elements – csv.QUOTE_ALL:
csv_out = csv.writer(open('output/theses.txt', 'w'), delimiter = '\t', quotechar = '"', quoting = csv.QUOTE_ALL)
The other nice thing about using Python for this transformation is the ability to program in static text that would be the same for all the lines in the CSV file. For example, I parsed together a more friendly display of the department the thesis was submitted to like so:
# dc.contributor.department if record ['690']['x'] is not None: dccontributordepartment = ('UniversityName. Department of ') + record['690']['x'].rsplit('.', 1)
You can view a sample output file created by this script on here on Github.
Other PyMARC Projects
There are lots of other, potentially more interesting things that can be done with pymarc, although certainly one of its most common applications is converting MARC to more flexible formats, such as MARCXML 10. If you’re interested in learning more, join the pymarc Google Group, where you can often get help by the original pymarc developers (Ed Summers, Mark Matienzo, Gabriel Farrell and Geoffrey Spear).
- Frank, Heidi. “Augmenting the Cataloger’s Bag of Tricks: Using MarcEdit, Python, and pymarc for Batch-Processing MARC Records Generated from the Archivists’ Toolkit.” Code4Lib Journal, 20 (2013): http://journal.code4lib.org/articles/8336 ↩
- https://www.youtube.com/watch?v=qkzJmNOvY00 ↩
- For the specifications on indentation, take a look at http://legacy.python.org/dev/peps/pep-0008/#id12 ↩
- http://www.uksg.org/kbart/s5/guidelines/data_fields ↩
- http://oclc.org/content/dam/support/knowledge-base/kb_modify.pdf ↩
- A sample blank KBART Excel sheet for OCLC’s Knowledge Base can be found here: http://www.oclc.org/support/documentation/collection-management/kb_kbart.xlsx. KBART files must be saved as tab-delimited .txt files prior to upload, but are obviously more easily edited manually in Excel ↩
- If you happen to be in need of this for OCLC’s KB, or are in need of comparing two Excel files, here’s a walkthrough of how to use VLOOKUP to select owned titles from a large OCLC KBART file: http://youtu.be/mUhkMzpPnBE ↩
- A nice tutorial on using lambda and filter together can be found here: http://www.u.arizona.edu/~erdmann/mse350/topics/list_comprehensions.html#filter ↩
- https://wiki.duraspace.org/display/DSDOC18/Batch+Metadata+Editing ↩
- http://docs.evergreen-ils.org/2.3/_migrating_your_bibliographic_records.html ↩