POST: Automating Digital Archival Processing at Johns Hopkins University (post updated)

Elizabeth England and Eric Hanson (both Johns Hopkins University) have authored a guest post on the Signal blog about the project England in working on during her National Digital Stewardship Residency at Johns Hopkins. England manages a digital preservation project focused on “a large backlog (about 50 terabytes) of photographs documenting the university’s born-digital visual history.”

England collaborated with Hanson, John Hopkins’ Digital Content Metadata Specialist, to develop and automate a processing workflow for the photograph collection; as England describes: “I’ve relied heavily on the electronic records accessioning workflow written by my mentor, the Libraries’ Digital Archivist Lora Davis, and worked with her to adapt the workflow for the Homewood Photography collection.” England relates:

All this to say, the collection is being processed at a more granular level than may be expected for its size. From the beginning, I knew that using scripts to manage bulk actions would be a huge time-saver, but as someone with essentially zero scripting experience, I didn’t have a good sense of what could be automated. While reviewing the workflow with Lora, I showed her how time consuming it was going to be to manually move the .NEF files in order to nest them directly below the descriptive job titles. She recommended I look into using a script to collapse the directory structures, and although I found some scripts that accomplish this, none could manage the variety of disc directory structures…I described the situation to Eric Hanson, the Digital Content Metadata Specialist here at the Johns Hopkins, knowing that he had experience with Python and might be able to help.

In addition to documenting the approach and process that England and Hanson developed for the project, the post reflects on the importance of collaboration when engineering new processes and encountering unfamiliar technology. As England concludes:

Eric’s role in the greater landscape of my project is to assist with metadata clean-up (much of which is still forthcoming), and I couldn’t have predicted how extensive this collaboration would become back when Lora suggested I look into a script to collapse directory structures. One of the biggest takeaways for me has been to reach out to colleagues in other departments, ask for help, and you both might learn a new thing or two. Our collaboration has been successful not just in producing these scripts to automate processing. When we began this process in January, I was rather intimidated by Python. I still have a ways to go with learning Python, but I’m now more intrigued than apprehensive because I have a better sense of its capabilities and potential use in processing large, digital archival collections.

England shares lessons learned and links to the scripts in GitHub.

Note from dh+lib Review editors: This post was edited after publication to correct inaccuracies in the description of the automation project, to provide links to the full set of scripts, and to indicate that the project is ongoing. Thanks to Elizabeth England for bringing this to our attention. Our apologies for the problems with the initial post.

dh+lib Review

This post was produced through a cooperation between Md Intaj Ali, Nickoal Eichmann-Kalwara, Alix Keener, Doulas Luman, Elizabeth Tegeler, Shilpa Rele, and Allison Ringness. (Editors-at-large for the week), Caro Pinto (Editor for the week), Sarah Potvin (Site Editor), Caitlin Christian-Lamb, Roxanne Shirazi, and Patrick Williams (dh+lib Review Editors).