Looking Across the Digital Preservation Landscape

When it comes to digital preservation, everyone agrees that a little bit is better than nothing. Look no further than these two excellent presentations from Code4Lib 2016, “Can’t Wait for Perfect: Implementing “Good Enough” Digital Preservation” by Shira Peltzman and Alice Sara Prael, and “Digital Preservation 101, or, How to Keep Bits for Centuries” by Julie Swierczek. I highly suggest you go check those out before reading more of this post if you are new to digital preservation, since they get into some technical details that I won’t.

The takeaway from these for me was twofold. First, digital preservation doesn’t have to be hard, but it does have to be intentional, and secondly, it does require institutional commitment. If you’re new to the world of digital preservation, understanding all the basic issues and what your options are can be daunting. I’ve been fortunate enough to lead a group at my institution that has spent the last few years working through some of these issues, and so in this post I want to give a brief overview of the work we’ve done, as well as the current landscape for digital preservation systems. This won’t be an in-depth exploration, more like a key to the map. Note that ACRL TechConnect has covered a variety of digital preservation issues before, including data management and preservation in “The Library as Research Partner” and using bash scripts to automate digital preservation workflow tasks in “Bash Scripting: automating repetitive command line tasks”.

The committee I chair started examining born digital materials, but expanded focus to all digital materials, since our digitized materials were an easier test case for a lot of our ideas. The committee spent a long time understanding the basic tenets of digital preservation–and in truth, we’re still working on this. For this process, we found working through the NDSA Levels of Digital Preservation an extremely helpful exercise–you can find a helpfully annotated version with tools by Shira Peltzman and Alice Sara Prael, as well as an additional explanation by Shira Peltman. We also relied on the Library of Congress Signal blog and the work of Brad Houston, among other resources. A few of the tasks we accomplished were to create a rough inventory of digital materials, a workflow manual, and to acquire many terabytes (currently around 8) of secure networked storage space for files to replace all removable hard drives being used for backups. While backups aren’t exactly digital preservation, we wanted to at the very least secure the backups we did have. An inventory and workflow manual may sound impressive, but I want to emphasize that these are living and somewhat messy documents. The major advantage of having these is not so much for what we do have, but for identifying gaps in our processes. Through this process, we were able to develop a lengthy (but prioritized) list of tasks that need to be completed before we’ll be satisfied with our processes. An example of this is that one of the major workflow gaps we discovered is that we have many items on obsolete digital media formats, such as floppy disks, that needs to be imaged before it can even be inventoried. We identified the tool we wanted to use for that, but time and staffing pressures have left the completion of this project in limbo. We’re now working on hiring a graduate student who can help work on this and similar projects.

The other piece of our work has been trying to understand what systems are available for digital preservation. I’ll summarize my understanding of this below, with several major caveats. This is a world that is currently undergoing a huge amount of change as many companies and people work on developing new systems or improving existing systems, so there is a lot missing from what I will say. Second, none of these solutions are necessarily mutually exclusive. Some by design require various pieces to be used together, some may not require it, but your circumstances may dictate a different solution. For instance, you may not like the access layer built into one system, and so will choose something else. The dream that you can just throw money at the problem and it will go away is, at present, still just a dream–as are so many library technology problems.

The closest to such a dream is the end-to-end system. This is something where at one end you load in a file or set of files you want to preserve (for example, a large set of donated digital photographs in TIFF format), and at the other end have a processed archival package (which might include the TIFF files, some metadata about the processing, and a way to check for bit rot in your files), as well as an access copy (for example, a smaller sized JPG appropriate for display to the public) if you so desire–not all digital files should be available to the public, but still need to be preserved.

Examples of such systems include Preservica, ArchivesDirect, and Rosetta. All of these are hosted vended products, but ArchivesDirect is based on open source Archivematica so it is possible to get some idea of the experience of using it if you are able to install the tools on which it based. The issues with end-t0-end systems are similar to any other choice you make in library systems. First, they come at a high price–Preservica and ArchivesDirect are open about their pricing, and for a plan that will meet the needs of medium-sized libraries you will be looking at $10,000-$14,000 annual cost. You are pretty much stuck with the options offered in the product, though you still have many decisions to make within that framework. Migrating from one system to another if you change your mind may involve some very difficult processes, and so inertia dictates that you will be using that system for the long haul, which a short trial period or demos may not be enough to really tell you that it’s a good idea. But you do have the potential for more simplicity and therefore a stronger likelihood that you will actually use them, as well as being much more manageable for smaller staffs that lack dedicated positions for digital preservation work–or even room in the current positions for digital preservation work.  A hosted product is ideal if you don’t have the staff or servers to install anything yourself, and helps you get your long-term archival files onto Amazon Glacier. Amazon Glacier is, by the way, where pretty much all the services we’re discussing store everything you are submitting for long-term storage. It’s dirt cheap to store on Amazon Glacier and if you can restore slowly, not too expensive to restore–only expensive if you need to restore a lot quickly. But using it is somewhat technically challenging since you only interact with it through APIs–there’s no way to log in and upload files or download files as with a cloud storage service like Dropbox. For that reason, when you’re paying a service hundreds of dollars a terabyte that ultimately stores all your material on Amazon Glacier which costs pennies per gigabye, you’re paying for the technical infrastructure to get your stuff on and off of there as much as anything else. In another way you’re paying an insurance policy for accessing materials in a catastrophic situation where you do need to recover all your files–theoretically, you don’t have to pay extra for such a situation.

A related option to an end-to-end system that has some attractive features is to join a preservation network. Examples of these include Digital Preservation Network (DPN) or APTrust. In this model, you pay an annual membership fee (right now $20,000 annually, though this could change soon) to join the consortium. This gives you access to a network of preservation nodes (either Amazon Glacier or nodes at other institutions), access to tools, and a right (and requirement) to participate in the governance of the network. Another larger preservation goal of such networks is to ensure long-term access to material even if the owning institution disappears. Of course, $20,000 plus travel to meetings and work time to participate in governance may be out of reach of many, but it appears that both DPN and APTrust are investigating new pricing models that may meet the needs of smaller institutions who would like to participate but can’t contribute as much in money or time. This a world that I would recommend watching closely.

Up until recently, the way that many institutions were achieving digital preservation was through some kind of repository that they created themselves, either with open source repository software such as Fedora Repository or DSpace or some other type of DIY system. With open source Archivematica, and a few other tools, you can build your own end-to-end system that will allow you to process files, store the files and preservation metadata, and provide access as is appropriate for the collection. This is theoretically a great plan. You can make all the choices yourself about your workflows, storage, and access layer. You can do as much or as little as you need to do. But in practice for most of us, this just isn’t going to happen without a strong institutional commitment of staff and servers to maintain this long term, at possibly a higher cost than any of the other solutions. That realization is one of the driving forces behind Hydra-in-a-Box, which is an exciting initiative that is currently in development. The idea is to make it possible for many different sizes of institutions to take advantage of the robust feature sets for preservation in Fedora and workflow management/access in Hydra, but without the overhead of installing and maintaining them. You can follow the project on Twitter and by joining the mailing list.

After going through all this, I am reminded of one of my favorite slides from Julie Swierczek’s Code4Lib presentation. She works through the Open Archival Initiative System model graph to explain it in depth, and comes to a point in the workflow that calls for “Sustainable Financing”, and then zooms in on this. For many, this is the crux of the digital preservation problem. It’s possible to do a sort of ok job with digital preservation for nothing or very cheap, but to ensure long term preservation requires institutional commitment for the long haul, just as any library collection requires. Given how much attention digital preservation is starting to receive, we can hope that more libraries will see this as a priority and start to participate. This may lead to even more options, tools, and knowledge, but it will still require making it a priority and putting in the work.

Bash Scripting: automating repetitive command line tasks

Introduction

One of my current tasks is to develop workflows for digital preservation procedures. These workflows begin with the acquisition of files – either disk images or logical file transfers – both of which end up on a designated server. Once acquired, the images (or files) are checked for viruses. If clean, they are bagged using Bagit and then copied over to a different server for processing.1 This work is all done at the command line, and as you can imagine, it gets quite repetitive. It’s also a bit error-prone since our file naming conventions include a 10-digit media ID number, which is easily mistyped. So once all the individual processes were worked out, I decided to automate things a bit by placing the commands into a single script. I should mention here that I’m no Linux whiz- I use it as needed which sometimes is daily, sometimes not. This is the first time I’ve ever tried to tie commands together in a Bash script, but I figured previous programming experience would help.

Creating a Script

To get started, I placed all the virus check commands for disk images into a script. These commands are different than logical file virus checks since the disk has to be mounted to get a read. This is a pretty simple step – first add:

#!/bin/bash

as the first line of the file (this line should not be indented or have any other whitespace in front of it). This tells the kernel which kind of interpreter to invoke, in this case, Bash. You could substitute the path to another interpreter, like Python, for other types of scripts: #!/bin/python.

Next, I changed the file permissions to make the script executable:

chmod +x myscript

I separated the virus check commands so that I could test those out and make sure they were working as expected before delving into other script functions.

Here’s what my initial script looked like (comments are preceded by a #):

#!/bin/bash

#mount disk
sudo mount -o ro,loop,nodev,noexec,nosuid,noatime /mnt/transferfiles/2013_072/2013_072_DM0000000001/2013_072_DM0000000001.iso /mnt/iso

#check to verify mount
mountpoint -q /mnt/iso && echo "mounted" || "not mounted"

#call the Clam AV program to run the virus check
clamscan -r /mnt/iso > "/mnt/transferfiles/2013_072/2013_072_DM0000000001/2013_072_DM0000000001_scan_test.txt"

#unmount disk
sudo umount /mnt/iso

#check disk unmounted
mountpoint -q /mnt/iso && echo "mounted" || "not mounted"

All those options on the mount command? They give me the piece of mind that accessing the disk will in no way alter it (or affect the host server), thus preserving its authenticity as an archival object. You may also be wondering about the use of “&&” and “||”.  These function as conditional AND and OR operators, respectively. So “&&” tells the shell to run the first command, AND if that’s successful, it will run the second command. Conversely, the “||” tells the shell to run the first command OR if that fails, run the second command. So the mount check command can be read as: check to see if the directory at /mnt/iso is a mountpoint. If the mount is successful, then echo “mounted.” If it’s not, echo “not mounted.” More on redirection.

Adding Variables

You may have noted that the script above only works on one disk image (2013_072_DM0000000001.iso), which isn’t very useful. I created variables for the accession number, the digital media number, and the file extension, since they all changed depending on the disk image information.The file naming convention we use for disk images is consistent and strict. The top level directory is the accession number. Within that, each disk image acquired from that accession is stored within it’s own directory, named using it’s assigned number. The disk image is then named by a combination of the accession number and the disk image number. Yes, it’s repetitive, but it keeps track of where things came from and links to data we have stored elsewhere. Given that these disks may not be processed for 20 years, such redundancy is preferred.

At first I thought the accession number, digital media number, and extension variables would be best entered at the initial run command; type in one line to run many commands. Each variable is separated by a space, the .iso at the end is the extension for an optical disk image file:

$ ./virus_check.sh 2013_072 DM0000000001 .iso

In Bash, scripts run with arguments are named $1 for the first variable, $2 for the second, and so on. This actually tripped me up for a day or so. I initially thought the $1, $2, etc. variables names used by the book I was referencing were for examples only, and that the first variables I referenced in the script would automatically map in order, so if 2013_072 was the first argument, and $accession was the first variable, $accession = 2013_072 (much like when you pass in a parameter to a Python function). Then I realized there was a reason that more than one reference book and/or site used the $1, $2, $3 system for variables passed in as command line arguments. I assigned each to it’s proper variable, and things were rolling again.

#!/bin/bash

#assign command line variables
$1=$accession
$2=$digital_media
$3=$extension

#mount disk
sudo mount -o ro,loop,noatime /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} /mnt/iso<span style="line-height: 1.5em;"> </span>

Note: variables names are often presented without curly braces; it’s recommended to place them in curly braces when adjacent to other strings.2

Reading Data

After testing the script a bit, I realized I  didn’t like passing the variables in via the command line. I kept making typos, and it was annoying not to have any data validation done in a more interactive fashion. I reconfigured the script to prompt the user for input:

read -p "Please enter the accession number" accession

read -p "Please enter the digital media number" digital_media

read -p "Please enter the file extension, include the preceding period" extension

After reviewing some test options, I decided to test that $accession and $digital_media were valid directories, and that the combo of all three variables was a valid file. This test seems more conclusive than simply testing whether or not the variables fit the naming criteria, but it does mean that if the data entered is invalid, the feedback given to the user is limited. I’m considering adding tests for naming criteria as well, so that the user knows when the error is due to a typo vs. a non-existent directory or file. I also didn’t want the code to simply quit when one of the variables is invalid – that’s not very user-friendly. I decided to ask the user for input until valid input was received.

read -p "Please enter the accession number" accession
until [ -d /mnt/transferfiles/${accession} ]; do
     read -p "Invalid. Please enter the accession number." accession
done

read -p "Please enter the digital media number" digital_media
until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
     read -p "Invalid. Please enter the digital media number." digital_media
done

read -p  "Please enter the file extension, include the preceding period" extension
until [ -e/mnt/transferfiles/${accession}/${accession}_${digital_media}/
${accession}_${digital_media}${extension} ]; do
     read -p "Invalid. Please enter the file extension, including the preceding period" extension
done

Creating Functions

You may have noted that the command used to test if a disk is mounted or not is called twice. This is done on purpose, as it was a test I found helpful when running the virus checks; the virus check runs and outputs a report even if the disk isn’t mounted. Occasionally disks won’t mount for various reasons. In such cases, the resulting report will state that it scanned no data, which is confusing because the disk itself possibly could have contained no data. Testing if it’s mounted eliminates that confusion. The command is repeated after the disk has been unmounted, mainly because I found it easy to forget the unmount step, and testing helps reinforce good behavior. Given that the command is repeated twice, it makes sense to make it a function rather than duplicate it.

check_mount () {
     #checks to see if disk is mounted or not
     mountpoint -q /mnt/iso && echo "mounted" || "not mounted" 
}

Lastly, I created a function for the input variables. I’m sure there’s prettier, more concise ways of writing this function, but since it’s still being refined and I’m still learning Bash scripting, I decided to leave it for now. I did want it placed in it’s own function because I’m planning to add additional code that will notify me if the virus check is positive and exit the program, or if it’s negative, bag the disk image and corresponding files, and copy them over to another server where they’ll wait for further processing.

get_image () {
     #gets data from user, validates it    
     read -p "Please enter the accession number" accession
     until [ -d /mnt/transferfiles/${accession} ]; do
          read -p "Invalid. Please enter the accession number." accession
     done

     read -p "Please enter the digital media number" digital_media
     until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
          read -p "Invalid. Please enter the digital media number." digital_media
     done

     read -p  "Please enter the file extension, include the preceding period" extension
     until [ -e /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} ]; do
          read -p "Invalid. Please enter the file extension, including the preceding period" extension
     done
}

Resulting (but not final!) Script

#!/bin/bash
#takes accession number, digital media number, and extension as variables to mount a disk image and run a virus check

check_mount () {
     #checks to see if disk is mounted or not
     mountpoint -q /mnt/iso && echo "mounted" || "not mounted"
}

get_image () {
     #gets disk image data from user, validates info
     read -p "Please enter the accession number" accession
     until [ -d /mnt/transferfiles/${accession} ]; do
          read -p "Invalid. Please enter the accesion number." accession
     done

     read -p "Please enter the digital media number" digital_media
     until [ -d /mnt/transferfiles/${accession}/${accession}_${digital_media} ]; do
          read -p "Invalid. Please enter the digital media number." digital_media
     done

     read -p  "Please enter the file extension, include the preceding period" extension
     until [ -e/mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} ]; do
          read -p "Invalid. Please enter the file extension, including the preceding period" extension
     done
}

get_image

#mount disk
sudo mount -o ro,loop,noatime /mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}${extension} /mnt/${extension} 

check_mount

#run virus check
sudo clamscan -r /mnt/iso > "/mnt/transferfiles/${accession}/${accession}_${digital_media}/${accession}_${digital_media}_scan_test.txt"

#unmount disk
sudo umount /mnt/iso

check_mount

Conclusion

There’s a lot more I’d like to do with this script. In addition to what I’ve already mentioned, I’d love to enable it to run over a range of digital media numbers, since they often are sequential. It also doesn’t stop if the disk isn’t mounted, which is an issue. But I thought it served as a good example of how easy it is to take repetitive command line tasks and turn them into a script. Next time, I’ll write about the second phase of development, which will include combining this script with another one, virus scan reporting, bagging, and transfer to another server.

Suggested References

An A-Z Index of the Bash command line for Linux

The books I used, both are good for basic command line work, but they only include a small section for actual scripting:

Barrett, Daniel J., Linux Pocket Guide. O’Reilly Media, 2004.

Shotts, Jr., William E. The Linux Command Line: A complete introduction. no starch press. 2012.

The book I wished I used:

Robbins, Arnold and Beebe, Nelson H. F. Classic Shell Scripting. O’Reilly Media, 2005.

Notes

  1. Logical file transfers often arrive in bags, which are then validated and virus checked.
  2. Linux Pocket Guide