Looking Across the Digital Preservation Landscape

When it comes to digital preservation, everyone agrees that a little bit is better than nothing. Look no further than these two excellent presentations from Code4Lib 2016, “Can’t Wait for Perfect: Implementing “Good Enough” Digital Preservation” by Shira Peltzman and Alice Sara Prael, and “Digital Preservation 101, or, How to Keep Bits for Centuries” by Julie Swierczek. I highly suggest you go check those out before reading more of this post if you are new to digital preservation, since they get into some technical details that I won’t.

The takeaway from these for me was twofold. First, digital preservation doesn’t have to be hard, but it does have to be intentional, and secondly, it does require institutional commitment. If you’re new to the world of digital preservation, understanding all the basic issues and what your options are can be daunting. I’ve been fortunate enough to lead a group at my institution that has spent the last few years working through some of these issues, and so in this post I want to give a brief overview of the work we’ve done, as well as the current landscape for digital preservation systems. This won’t be an in-depth exploration, more like a key to the map. Note that ACRL TechConnect has covered a variety of digital preservation issues before, including data management and preservation in “The Library as Research Partner” and using bash scripts to automate digital preservation workflow tasks in “Bash Scripting: automating repetitive command line tasks”.

The committee I chair started examining born digital materials, but expanded focus to all digital materials, since our digitized materials were an easier test case for a lot of our ideas. The committee spent a long time understanding the basic tenets of digital preservation–and in truth, we’re still working on this. For this process, we found working through the NDSA Levels of Digital Preservation an extremely helpful exercise–you can find a helpfully annotated version with tools by Shira Peltzman and Alice Sara Prael, as well as an additional explanation by Shira Peltman. We also relied on the Library of Congress Signal blog and the work of Brad Houston, among other resources. A few of the tasks we accomplished were to create a rough inventory of digital materials, a workflow manual, and to acquire many terabytes (currently around 8) of secure networked storage space for files to replace all removable hard drives being used for backups. While backups aren’t exactly digital preservation, we wanted to at the very least secure the backups we did have. An inventory and workflow manual may sound impressive, but I want to emphasize that these are living and somewhat messy documents. The major advantage of having these is not so much for what we do have, but for identifying gaps in our processes. Through this process, we were able to develop a lengthy (but prioritized) list of tasks that need to be completed before we’ll be satisfied with our processes. An example of this is that one of the major workflow gaps we discovered is that we have many items on obsolete digital media formats, such as floppy disks, that needs to be imaged before it can even be inventoried. We identified the tool we wanted to use for that, but time and staffing pressures have left the completion of this project in limbo. We’re now working on hiring a graduate student who can help work on this and similar projects.

The other piece of our work has been trying to understand what systems are available for digital preservation. I’ll summarize my understanding of this below, with several major caveats. This is a world that is currently undergoing a huge amount of change as many companies and people work on developing new systems or improving existing systems, so there is a lot missing from what I will say. Second, none of these solutions are necessarily mutually exclusive. Some by design require various pieces to be used together, some may not require it, but your circumstances may dictate a different solution. For instance, you may not like the access layer built into one system, and so will choose something else. The dream that you can just throw money at the problem and it will go away is, at present, still just a dream–as are so many library technology problems.

The closest to such a dream is the end-to-end system. This is something where at one end you load in a file or set of files you want to preserve (for example, a large set of donated digital photographs in TIFF format), and at the other end have a processed archival package (which might include the TIFF files, some metadata about the processing, and a way to check for bit rot in your files), as well as an access copy (for example, a smaller sized JPG appropriate for display to the public) if you so desire–not all digital files should be available to the public, but still need to be preserved.

Examples of such systems include Preservica, ArchivesDirect, and Rosetta. All of these are hosted vended products, but ArchivesDirect is based on open source Archivematica so it is possible to get some idea of the experience of using it if you are able to install the tools on which it based. The issues with end-t0-end systems are similar to any other choice you make in library systems. First, they come at a high price–Preservica and ArchivesDirect are open about their pricing, and for a plan that will meet the needs of medium-sized libraries you will be looking at $10,000-$14,000 annual cost. You are pretty much stuck with the options offered in the product, though you still have many decisions to make within that framework. Migrating from one system to another if you change your mind may involve some very difficult processes, and so inertia dictates that you will be using that system for the long haul, which a short trial period or demos may not be enough to really tell you that it’s a good idea. But you do have the potential for more simplicity and therefore a stronger likelihood that you will actually use them, as well as being much more manageable for smaller staffs that lack dedicated positions for digital preservation work–or even room in the current positions for digital preservation work.  A hosted product is ideal if you don’t have the staff or servers to install anything yourself, and helps you get your long-term archival files onto Amazon Glacier. Amazon Glacier is, by the way, where pretty much all the services we’re discussing store everything you are submitting for long-term storage. It’s dirt cheap to store on Amazon Glacier and if you can restore slowly, not too expensive to restore–only expensive if you need to restore a lot quickly. But using it is somewhat technically challenging since you only interact with it through APIs–there’s no way to log in and upload files or download files as with a cloud storage service like Dropbox. For that reason, when you’re paying a service hundreds of dollars a terabyte that ultimately stores all your material on Amazon Glacier which costs pennies per gigabye, you’re paying for the technical infrastructure to get your stuff on and off of there as much as anything else. In another way you’re paying an insurance policy for accessing materials in a catastrophic situation where you do need to recover all your files–theoretically, you don’t have to pay extra for such a situation.

A related option to an end-to-end system that has some attractive features is to join a preservation network. Examples of these include Digital Preservation Network (DPN) or APTrust. In this model, you pay an annual membership fee (right now $20,000 annually, though this could change soon) to join the consortium. This gives you access to a network of preservation nodes (either Amazon Glacier or nodes at other institutions), access to tools, and a right (and requirement) to participate in the governance of the network. Another larger preservation goal of such networks is to ensure long-term access to material even if the owning institution disappears. Of course, $20,000 plus travel to meetings and work time to participate in governance may be out of reach of many, but it appears that both DPN and APTrust are investigating new pricing models that may meet the needs of smaller institutions who would like to participate but can’t contribute as much in money or time. This a world that I would recommend watching closely.

Up until recently, the way that many institutions were achieving digital preservation was through some kind of repository that they created themselves, either with open source repository software such as Fedora Repository or DSpace or some other type of DIY system. With open source Archivematica, and a few other tools, you can build your own end-to-end system that will allow you to process files, store the files and preservation metadata, and provide access as is appropriate for the collection. This is theoretically a great plan. You can make all the choices yourself about your workflows, storage, and access layer. You can do as much or as little as you need to do. But in practice for most of us, this just isn’t going to happen without a strong institutional commitment of staff and servers to maintain this long term, at possibly a higher cost than any of the other solutions. That realization is one of the driving forces behind Hydra-in-a-Box, which is an exciting initiative that is currently in development. The idea is to make it possible for many different sizes of institutions to take advantage of the robust feature sets for preservation in Fedora and workflow management/access in Hydra, but without the overhead of installing and maintaining them. You can follow the project on Twitter and by joining the mailing list.

After going through all this, I am reminded of one of my favorite slides from Julie Swierczek’s Code4Lib presentation. She works through the Open Archival Initiative System model graph to explain it in depth, and comes to a point in the workflow that calls for “Sustainable Financing”, and then zooms in on this. For many, this is the crux of the digital preservation problem. It’s possible to do a sort of ok job with digital preservation for nothing or very cheap, but to ensure long term preservation requires institutional commitment for the long haul, just as any library collection requires. Given how much attention digital preservation is starting to receive, we can hope that more libraries will see this as a priority and start to participate. This may lead to even more options, tools, and knowledge, but it will still require making it a priority and putting in the work.