When it comes to digital preservation, everyone agrees that a little bit is better than nothing. Look no further than these two excellent presentations from Code4Lib 2016, “Can’t Wait for Perfect: Implementing “Good Enough” Digital Preservation” by Shira Peltzman and Alice Sara Prael, and “Digital Preservation 101, or, How to Keep Bits for Centuries” by Julie Swierczek. I highly suggest you go check those out before reading more of this post if you are new to digital preservation, since they get into some technical details that I won’t.
The takeaway from these for me was twofold. First, digital preservation doesn’t have to be hard, but it does have to be intentional, and secondly, it does require institutional commitment. If you’re new to the world of digital preservation, understanding all the basic issues and what your options are can be daunting. I’ve been fortunate enough to lead a group at my institution that has spent the last few years working through some of these issues, and so in this post I want to give a brief overview of the work we’ve done, as well as the current landscape for digital preservation systems. This won’t be an in-depth exploration, more like a key to the map. Note that ACRL TechConnect has covered a variety of digital preservation issues before, including data management and preservation in “The Library as Research Partner” and using bash scripts to automate digital preservation workflow tasks in “Bash Scripting: automating repetitive command line tasks”.
The committee I chair started examining born digital materials, but expanded focus to all digital materials, since our digitized materials were an easier test case for a lot of our ideas. The committee spent a long time understanding the basic tenets of digital preservation–and in truth, we’re still working on this. For this process, we found working through the NDSA Levels of Digital Preservation an extremely helpful exercise–you can find a helpfully annotated version with tools by Shira Peltzman and Alice Sara Prael, as well as an additional explanation by Shira Peltman. We also relied on the Library of Congress Signal blog and the work of Brad Houston, among other resources. A few of the tasks we accomplished were to create a rough inventory of digital materials, a workflow manual, and to acquire many terabytes (currently around 8) of secure networked storage space for files to replace all removable hard drives being used for backups. While backups aren’t exactly digital preservation, we wanted to at the very least secure the backups we did have. An inventory and workflow manual may sound impressive, but I want to emphasize that these are living and somewhat messy documents. The major advantage of having these is not so much for what we do have, but for identifying gaps in our processes. Through this process, we were able to develop a lengthy (but prioritized) list of tasks that need to be completed before we’ll be satisfied with our processes. An example of this is that one of the major workflow gaps we discovered is that we have many items on obsolete digital media formats, such as floppy disks, that needs to be imaged before it can even be inventoried. We identified the tool we wanted to use for that, but time and staffing pressures have left the completion of this project in limbo. We’re now working on hiring a graduate student who can help work on this and similar projects.
The other piece of our work has been trying to understand what systems are available for digital preservation. I’ll summarize my understanding of this below, with several major caveats. This is a world that is currently undergoing a huge amount of change as many companies and people work on developing new systems or improving existing systems, so there is a lot missing from what I will say. Second, none of these solutions are necessarily mutually exclusive. Some by design require various pieces to be used together, some may not require it, but your circumstances may dictate a different solution. For instance, you may not like the access layer built into one system, and so will choose something else. The dream that you can just throw money at the problem and it will go away is, at present, still just a dream–as are so many library technology problems.
The closest to such a dream is the end-to-end system. This is something where at one end you load in a file or set of files you want to preserve (for example, a large set of donated digital photographs in TIFF format), and at the other end have a processed archival package (which might include the TIFF files, some metadata about the processing, and a way to check for bit rot in your files), as well as an access copy (for example, a smaller sized JPG appropriate for display to the public) if you so desire–not all digital files should be available to the public, but still need to be preserved.
Examples of such systems include Preservica, ArchivesDirect, and Rosetta. All of these are hosted vended products, but ArchivesDirect is based on open source Archivematica so it is possible to get some idea of the experience of using it if you are able to install the tools on which it based. The issues with end-t0-end systems are similar to any other choice you make in library systems. First, they come at a high price–Preservica and ArchivesDirect are open about their pricing, and for a plan that will meet the needs of medium-sized libraries you will be looking at $10,000-$14,000 annual cost. You are pretty much stuck with the options offered in the product, though you still have many decisions to make within that framework. Migrating from one system to another if you change your mind may involve some very difficult processes, and so inertia dictates that you will be using that system for the long haul, which a short trial period or demos may not be enough to really tell you that it’s a good idea. But you do have the potential for more simplicity and therefore a stronger likelihood that you will actually use them, as well as being much more manageable for smaller staffs that lack dedicated positions for digital preservation work–or even room in the current positions for digital preservation work. A hosted product is ideal if you don’t have the staff or servers to install anything yourself, and helps you get your long-term archival files onto Amazon Glacier. Amazon Glacier is, by the way, where pretty much all the services we’re discussing store everything you are submitting for long-term storage. It’s dirt cheap to store on Amazon Glacier and if you can restore slowly, not too expensive to restore–only expensive if you need to restore a lot quickly. But using it is somewhat technically challenging since you only interact with it through APIs–there’s no way to log in and upload files or download files as with a cloud storage service like Dropbox. For that reason, when you’re paying a service hundreds of dollars a terabyte that ultimately stores all your material on Amazon Glacier which costs pennies per gigabye, you’re paying for the technical infrastructure to get your stuff on and off of there as much as anything else. In another way you’re paying an insurance policy for accessing materials in a catastrophic situation where you do need to recover all your files–theoretically, you don’t have to pay extra for such a situation.
A related option to an end-to-end system that has some attractive features is to join a preservation network. Examples of these include Digital Preservation Network (DPN) or APTrust. In this model, you pay an annual membership fee (right now $20,000 annually, though this could change soon) to join the consortium. This gives you access to a network of preservation nodes (either Amazon Glacier or nodes at other institutions), access to tools, and a right (and requirement) to participate in the governance of the network. Another larger preservation goal of such networks is to ensure long-term access to material even if the owning institution disappears. Of course, $20,000 plus travel to meetings and work time to participate in governance may be out of reach of many, but it appears that both DPN and APTrust are investigating new pricing models that may meet the needs of smaller institutions who would like to participate but can’t contribute as much in money or time. This a world that I would recommend watching closely.
Up until recently, the way that many institutions were achieving digital preservation was through some kind of repository that they created themselves, either with open source repository software such as Fedora Repository or DSpace or some other type of DIY system. With open source Archivematica, and a few other tools, you can build your own end-to-end system that will allow you to process files, store the files and preservation metadata, and provide access as is appropriate for the collection. This is theoretically a great plan. You can make all the choices yourself about your workflows, storage, and access layer. You can do as much or as little as you need to do. But in practice for most of us, this just isn’t going to happen without a strong institutional commitment of staff and servers to maintain this long term, at possibly a higher cost than any of the other solutions. That realization is one of the driving forces behind Hydra-in-a-Box, which is an exciting initiative that is currently in development. The idea is to make it possible for many different sizes of institutions to take advantage of the robust feature sets for preservation in Fedora and workflow management/access in Hydra, but without the overhead of installing and maintaining them. You can follow the project on Twitter and by joining the mailing list.
After going through all this, I am reminded of one of my favorite slides from Julie Swierczek’s Code4Lib presentation. She works through the Open Archival Initiative System model graph to explain it in depth, and comes to a point in the workflow that calls for “Sustainable Financing”, and then zooms in on this. For many, this is the crux of the digital preservation problem. It’s possible to do a sort of ok job with digital preservation for nothing or very cheap, but to ensure long term preservation requires institutional commitment for the long haul, just as any library collection requires. Given how much attention digital preservation is starting to receive, we can hope that more libraries will see this as a priority and start to participate. This may lead to even more options, tools, and knowledge, but it will still require making it a priority and putting in the work.
After much hard work over years by the Drupal community, Drupal users rejoiced when Drupal 8 came out late last year. The system has been completely rewritten and does a lot of great stuff–but can it do what we need Drupal websites to do for libraries? The quick answer seems to be that it’s not quite ready, but depending on your needs it might be worth a look.
For those who aren’t familiar with Drupal, it’s a content management system designed to manage complex sites with multiple types of content, users, features, and appearances. Certain “core” features are available to everyone out of the box, but even more useful are the “modules”, which extend the features to do all kinds of things from the mundane but essential backup of a site to a flashy carousel slider. However, the modules are created by individuals or companies and contributed back to the community, and thus when Drupal makes a major version change they need to be rewritten, quite drastically in the case of Drupal 8. That means that right now we are in a period where developers may or may not be redoing their modules, or they may be rethinking about how a certain task should be done in the future. Because most of these developers are doing this work as volunteers, it’s not reasonable to expect that they will complete the work on your timeline. The expectation is that if a feature is really important to you, then you’ll work on development to make it happen. That is, of course, easier said than done for people who barely have enough time to do the basic web development asked of them, much less complex programming or learning a new system top to bottom, so most of us are stuck waiting or figuring out our own solutions.
Despite my knowledge of the reality of how Drupal works, I was very excited at the prospect of getting into Drupal 8 and learning all the new features. I installed it right away and started poking around, but realized pretty quickly I was going to have to do a complete evaluation for whether it was actually practical to use it for my library’s website. Our website has been on Drupal 7 since 2012, and works pretty well, though it does need a new theme to bring it into line with 2016 design and accessibility standards. Ideally, however, we could be doing even more with the site, such as providing better discovery for our digital special collections and making the site information more semantic web friendly. It was those latter, more advanced, feature desires that made me really wish to use Drupal 8, which includes semantic HTML5 integration and schema.org markup, as well as better integration with other tools and libraries. But the question remains–would it really be practical to work on migrating the site immediately, or would it make more sense to spend some development time on improving the Drupal 7 site to make it work for the next year or so while working on Drupal 8 development more slowly?
A bit of research online will tell you that there’s no right answer, but that the first thing to do in an evaluation is determine whether any the modules on which your site depends are available for Drupal 8, and if not, whether there is a good alternative. I must add that while all the functions I am going to mention can be done manually or through custom code, a lot of that work would take more time to write and maintain than I expect to have going forward. In fact, we’ve been working to move more of our customized code to modules already, since that makes it possible to distribute some of the workload to others outside of the very few people at our library who write code or even know HTML well, not to mention taking advantage of all the great expertise of the Drupal community.
I tried two different methods for the evaluation. First, I created a spreadsheet with all the modules we actually use in Drupal 7, their versions, and the current status of those modules in Drupal 8 or if I found a reasonable substitute. Next, I tried a site that automates that process, d8upgrade.org. Basically you fill in your website URL and email, and wait a day for your report, which is very straightforward with a list of modules found for your site, whether there is a stable release, an alpha or beta release, or no Drupal 8 release found yet. This is a useful timesaver, but will need some manual work to complete and isn’t always completely up to date.
My manual analysis determined that there were 30 modules on which we depend to a greater or lesser extent. Of those, 10 either moved into Drupal core (so would automatically be included) or the functions on which used them moved into another piece of core. 5 had versions available in Drupal 8, with varying levels of release (i.e. several in stable alpha release, so questionable to use for production sites but probably fine), and 5 were not migrated but it was possible to identify substitute Drupal 8 modules. That’s pretty good– 18 modules were available in Drupal 8, and in several cases one module could do the job that two or more had done in Drupal 7. Of the additional 11 modules that weren’t migrated and didn’t have an easy substitution, three of them are critical to maintaining our current site workflows. I’ll talk about those in more detail below.
d8upgrade.org found 21 modules in use, though I didn’t include all of them on my own spreadsheet if I didn’t intend to keep using them in the future. I’ve included a screenshot of the report, and there are a few things to note. This list does not have all the modules I had on my list, since some of those are used purely behind the scenes for administrative purposes and would have no indication of use without administrative access. The very last item on the list is Core, which of course isn’t going to be upgraded to Drupal 8–it is Drupal 8. I also found that it’s not completely up to date. For instance, my own analysis found a pre-release version of Workbench Moderation, but that information had not made it to this site yet. A quick email to them fixed it almost immediately, however, so this screenshot is out of date.
I decided that there were three dealbreaker modules for the upgrade, and I want to talk about why we rely on them, since I think my reasoning will be applicable to many libraries with limited web development time. I will also give honorable mention to a module that we are not currently using, but I know a lot of libraries rely on and that I would potentially like to use in the future.
Webform is a module that creates a very simple to use interface for creating webforms and doing all kinds of things with them beyond just simply sending emails. We have many, many custom PHP/MySQL forms throughout our website and intranet, but there are only two people on the staff who can edit those or download the submitted entries from them. They also occasionally have dreadful spam problems. We’ve been slowly working on migrating these custom forms to the Drupal Webform module, since that allows much more distribution of effort across the staff, and provides easier ways to stop spam using, for instance, the Honeypot module or Mollom. (We’ve found that the Honeypot module stopped nearly all our spam problems and didn’t need to move to Mollom, since we don’t have user comments to moderate). The thought of going back to coding all those webforms myself is not appealing, so for now I can’t move forward until I come up with a Drupal solution.
Redirect does a seemingly tiny job that’s extremely helpful. It allows you to create redirects for URLs on your site, which is incredibly helpful for all kinds of reasons. For instance, if you want to create a library site branded link that forwards somewhere else like a database vendor or another page on your university site, or if you want to change a page URL but ensure people with bookmarks to the old page will still find it. This is, of course, something that you can do on your web server, assuming you have access to it, but this module takes a lot of the administrative overhead away and helps keep things organized.
Backup and Migrate is my greatest helper in my goal to be someone who would like to at least be in the neighborhood of best practices for web development when web development is only half my job, or some weeks more like a quarter of my job. It makes a very quick process of keeping my development, staging, and production sites in sync, and since I created a workflow using this module I have been far more successful in keeping my development processes sane. It provides an interface for creating a backup of your site database, files directories, or your database and files that you can use in the Backup and Migrate module to completely restore a site. I use it at least every two weeks, or more often when working on a particular feature to move the database between servers (I don’t move the files with the module for this process, but that’s useful for backups that are for emergency restoration of the site). There are other ways to accomplish this work, but this particular workflow has been so helpful that I hate to dump a lot of time into redoing it just now.
One last honorable mention goes to Workbench, which we don’t use but I know a lot of libraries do use. This allows you to create a much more friendly interface for content editors so they don’t have to deal with the administrative backend of Drupal and allows them to just see their own content. We do use Workbench Moderation, which does have a Drupal 8 release, and allows a moderation queue for the six or so members of staff who can create or edit content but don’t have administrative rights to have their content checked by an administrator. None of them particularly like the standard Drupal content creation interface, and it’s not something that we would ever ask the rest of the staff to use. We know from the lack of use of our intranet, which also is on Drupal, that no one particularly cares for editing content there. So if we wanted to expand access to website editing, which we’ve talked about a lot, this would be a key module for us to use.
Given the current status of these modules with rewrites in progress, it seems likely that by the end of the year it may be possible to migrate to Drupal 8 with our current setup, or in playing around with Drupal 8 on a development site that we determine a different way to approach these needs. If you have the interest and time to do this, there are worse ways to pass the time. If you are creating a completely new Drupal site and don’t have a time crunch, starting in Drupal 8 now is probably the way to go, since by the time the site would be ready you may have additional modules available and get to take advantage of all the new features. If this is something you’re trying to roll out by the end of the semester, maybe wait on it.
Have you considered upgrading your library’s site to Drupal 8? Have you been successful? Let us know in the comments.
Anyone who has worked on an institutional repository for even a short time knows that collecting faculty scholarship is not a straightforward process, no matter how nice your workflow looks on paper or how dedicated you are. Keeping expectations for the process manageable (not necessarily low, as in my clickbaity title) and constant simplification and automation can make your process more manageable, however, and therefore work better. I’ve written before about some ways in which I’ve automated my process for faculty collection development, as well as how I’ve used lightweight project management tools to streamline processes. My newest technique for faculty scholarship collection development brings together pieces of all those to greatly improve our productivity.
Allocating Your Human and Machine Resources
First, here is the personnel situation we have for the institutional repository I manage. Your own circumstances will certainly vary, but I think institutions of all sizes will have some version of this distribution. I manage our repository as approximately half my position, and I have one graduate student assistant who works about 10-15 hours a week. From week to week we only average about 30-40 hours total to devote to all aspects of the repository, of which faculty collection development is only a part. We have 12 librarians who are liaisons with departments and do the majority of the outreach to faculty and promotion of the repository, but a limited amount of the collection development except for specific parts of the process. While they are certainly welcome to do more, in reality, they have so much else to do that it doesn’t make sense for them to spend their time on data entry unless they want to (and some of them do). The breakdown of work is roughly that the liaisons promote the repository to the faculty and answer basic questions; I answer more complex questions, develop procedures, train staff, make interpretations of publishing agreements, and verify metadata; and my GA does the simple research and data entry. From time to time we have additional graduate or undergraduate student help in the form of faculty research assistants, and we have a group of students available for digitization if needed.
Those are our human resources. The tools that we use for the day-to-day work include Digital Measures (our faculty activity system), Excel, OpenRefine, Box, and Asana. I’ll say a bit about what each of these are and how we use them below. By far the most important innovation for our faculty collection development workflow has been integration with the Faculty Activity System, which is how we refer to Digital Measures on our campus. Many colleges and universities have some type of faculty activity system or are in the process of implementing one. These generally are adopted for purposes of annual reports, retention, promotion, and tenure reviews. I have been at two different universities working on adopting such systems, and as you might imagine, it’s a slow process with varying levels of participation across departments. Faculty do not always like these systems for a variety of reasons, and so there may be hesitation to complete profiles even when required. Nevertheless, we felt in the library that this was a great source of faculty publication information that we could use for collection development for the repository and the collection in general.
We now have a required question about including the item in the repository on every item the faculty member enters in the Faculty Activity System. If a faculty member is saying they published an article, they also have to say whether it should be included in the repository. We started this in late 2014, and it revolutionized our ability to reach faculty and departments who never had participated in the repository before, as well as simplify the lives of faculty who were eager participants but now only had to enter data in one place. Of course, there are still a number of people whom we are missing, but this is part of keeping your expectation low–if you can’t reach everyone, focus your efforts on the people you can. And anyway, we are now so swamped with submissions that we can’t keep up with them, which is a good if unusual problem to have in this realm. Note that the process I describe below is basically the same as when we analyze a faculty member’s CV (which I described in my OpenRefine post), but we spend relatively little time doing that these days since it’s easier for most people to just enter their material in Digital Measures and select that they want to include it in the repository.
The ease of integration between your own institution’s faculty activity system (assuming it exists) and your repository certainly will vary, but in most cases it should be possible for the library to get access to the data. It’s a great selling point for the faculty to participate in the system for your Office of Institutional Research or similar office who administers it, since it gives faculty a reason to keep it up to date when they may be in between review cycles. If your institution does not yet have such a system, you might still discuss a partnership with that office, since your repository may hold extremely useful information for them about research activity of which they are not aware.
We get reports from the Faculty Activity System on roughly a quarterly basis. Faculty member data entry tends to bunch around certain dates, so we focus on end of semesters as the times to get the reports. The reports come by email as Excel files with information about the person, their department, contact information, and the like, as well as information about each publication. We do some initial processing in Excel to clean them up, remove duplicates from prior reports, and remove irrelevant information. It is amazing how many people see a field like “Journal Title” as a chance to ask a question rather than provide information. We focus our efforts on items that have actually been published, since the vast majority of people have no interest in posting pre-prints and those that do prefer to post them in arXiv or similar. The few people who do know about pre-prints and don’t have a subject archive generally submit their items directly. This is another way to lower expectations of what can be done through the process. I’ve already described how I use OpenRefine for creating reports from faculty CVs using the SHERPA/RoMEO API, and we follow a similar but much simplified process since we already have the data in the correct columns. Of course, following this process doesn’t tell us what we can do with every item. The journal title may be entered incorrectly so the API call didn’t pick it up, or the journal may not be in SHERPA/RoMEO. My graduate student assistant fills in what he is able to determine, and I work on the complex cases. As we are doing this, the Excel spreadsheet is saved in Box so we have the change history tracked and can easily add collaborators.
At this point, we are ready to move to Asana, which is a lightweight project management tool ideal for several people working on a group of related projects. Asana is far more fun and easy to work with than Excel spreadsheets, and this helps us work together better to manage workload and see where we are with all our on-going projects. For each report (or faculty member CV), we create a new project in Asana with several sections. While it doesn’t always happen in practice, in theory each citation is a task that moves between sections as it is completed, and finally checked off when it is either posted or moved off into some other fate not as glamorous as being archived as open access full text. The sections generally cover posting the publisher’s PDF, contacting publishers, reminders for followup, posting author’s manuscripts, or posting to SelectedWorks, which is our faculty profile service that is related to our repository but mainly holds citations rather than full text. Again, as part of the low expectations, we focus on posting final PDFs of articles or book chapters. We add books to a faculty book list, and don’t even attempt to include full text for these unless someone wants to make special arrangements with their publisher–this is rare, but again the people who really care make it happen. If we already know that the author’s manuscript is permitted, we don’t add these to Asana, but keep them in the spreadsheet until we are ready for them.
We contact publishers in batches, trying to group citations by journal and publisher to increase efficiency so we can send one letter to cover many articles or chapters. We note to follow up with a reminder in one month, and then again in a month after that. Usually the second notice is enough to catch the attention of the publisher. As they respond, we move the citation to either posting publisher’s PDF section or to author’s manuscript section, or if it’s not permitted at all to the post to SelectedWorks section. While we’ve tried several different procedures, I’ve determined it’s best for the liaison librarians to ask just for author’s accepted manuscripts for items after we’ve verified that no other version may be posted. And if we don’t ever get them, we don’t worry about it too much.
I hope you’ve gotten some ideas from this post about your own procedures and new tools you might try. Even more, I hope you’ll think about which pieces of your procedures are really working for you, and discard those that aren’t working any more. Your own situation will dictate which those are, but let’s all stop beating ourselves up about not achieving perfection. Make sure to let your repository stakeholders know what works and what doesn’t, and if something that isn’t working is still important, work collaboratively to figure out a way around that obstacle. That type of collaboration is what led to our partnership with the Office of Institutional Research to use the Digital Measures platform for our collection development, and that in turn has led to other collaborative opportunities.
A few of us at Tech Connect participated in the #1Lib1Ref campaign that’s running from January 15th to the 23rd . What’s #1Lib1Ref? It’s a campaign to encourage librarians to get involved with improving Wikipedia, specifically by citation chasing (one of my favorite pastimes!). From the project’s description:
Imagine a World where Every Librarian Added One More Reference to Wikipedia.
Wikipedia is a first stop for researchers: let’s make it better! Your goal today is to add one reference to Wikipedia! Any citation to a reliable source is a benefit to Wikipedia readers worldwide. When you add the reference to the article, make sure to include the hashtag #1Lib1Ref in the edit summary so that we can track participation.
Below, we each describe our experiences editing Wikipedia. Did you participate in #1Lib1Ref, too? Let us know in the comments or join the conversation on Twitter!
I recorded a short screencast of me adding a citation to the Darbhanga article.
— Eric Phetteplace
I used the Citation Hunt tool to find an article that needed a citation. I selected the second one I found, which was about urinary tract infections in space missions. That is very much up my alley. I discovered after a quick Google search that the paragraph in question was plagiarized from a book on Google Books! After a hunt through the Wikipedia policy on quotations, I decided to rewrite the paragraph to paraphrase the quote, and then added my citation. As is usual with plagiarism, the flow was wrong, since there was a reference to a theme in the previous paragraph of the book that wasn’t present in the Wikipedia article, so I chose to remove that entirely. The Wikipedia Citation Tool for Google Books was very helpful in automatically generating an acceptable citation for the appropriate page. Here’s my shiny new paragraph, complete with citation: https://en.wikipedia.org/wiki/
— Margaret Heller
I edited the “Library Facilities” section of the “University of Maryland Baltimore” article in Wikipedia. There was an outdated link in the existing citation, and I also wanted to add two additional sentences and citations. You can see how I went about doing this in my screen recording below. I used the “edit source” option to get the source first in the Text Editor and then made all the changes I wanted in advance. After that, I copy/pasted the changes I wanted from my text file to the Wikipedia page I was editing. Then, I previewed and saved the page. You can see that I also had a typo in my text and had to fix that again to make the citation display correctly. So I had to edit the article more than once. After my recording, I noticed another typo in there, which I fixed it using the “edit” option. The “edit” option is much easier to use than the “edit source” option for those who are not familiar with editing Wiki pages. It offers a menu bar on the top with several convenient options.
The recording of editing a Wikipedia article:
— Bohyun Kim
It has been so long since I’ve edited anything on Wikipedia that I had to make a new account and read the “how to add a reference” link; which is to say, if I could do it in 30 minutes while on vacation, anyone can. There is a WYSIWYG option for the editing interface, but I learned to do all this in plain text and it’s still the easiest way for me to edit. See the screenshot below for a view of the HTML editor.
I wondered what entry I would find to add a citation to…there have been so many that I’d come across but now I was drawing a total blank. Happily, the 1Lib1Ref campaign gave some suggestions, including “Provinces of Afghanistan.” Since this is my fatherland, I thought it would be a good service to dive into. Many of Afghanistan’s citations are hard to provide for a multitude of reasons. A lot of our history has been an oral tradition. Also, not insignificantly, Afghanistan has been in conflict for a very long time, with much of its history captured from the lens of Great Game participants like England or Russia. Primary sources from the 20th century are difficult to come by because of the state of war from 1979 onwards and there are not many digitization efforts underway to capture what there is available (shout out to NYU and the Afghanistan Digital Library project).
Once I found a source that I thought would be an appropriate reference for a statement on the topography of Uruzgan Province, I did need to edit the sentence to remove the numeric values that had been written since I could not find a source that quantified the area. It’s not a precise entry, to be honest, but it does give the opportunity to link to a good map with other opportunities to find additional information related to Afghanistan’s agriculture. I also wanted to chose something relatively uncontroversial, like geographical features rather than historical or person-based, for this particular campaign.
— Yasmeen Shorish
Keeping any large technical project user-centered is challenging at best. Adding in something like an extremely tight timeline makes it too easy to dispense with this completely. Say, for instance, six months to migrate to a new integrated library system that combines your old ILS plus your link resolver and many other tools and a new discovery layer. I would argue, however, that it’s on a tight timeline like that that a major focus on user experience research can become a key component of your success. I am referring in this piece specifically to user experience on the web, but of course there are other aspects of user experience that go into such a project. While none of my observations about usability testing and user experience are new, I have realized from talking to others that they need help advocating for the importance of user research. As we turn to our hopes and goals for 2016, let’s all make a resolution to figure out a way to make better user experience research happen, even if it seems impossible.
Selling the Need For User Testing
When I worked on implementing a discovery layer at my job earlier this year, I had a team of 18 people from three campuses with varying levels of interest and experience in user testing. It was really important to us that we had an end product that would work for everyone at all levels, whether novice or experienced researcher, as well as for the library staff who would need to use the system on a daily basis. With so many people and such a tight timeline building user testing into the schedule in the first place helped us to frame our decisions as a hypothesis to confirm or nullify in the next round of testing. We tried to involve as many people as possible in the testing, though we had a core group who had experience with running the tests administer them. Doing a test as early as possible is good to convince others of the need for testing. People who had never seen a usability test done before found them convincing immediately and were much more on board for future tests.
Remembering Who Your Users Are
Reference and instruction librarians are users too. We sometimes get so focused on reminding librarians that they are not the users that we don’t make things work for them–and they do need to use the catalog too. Librarians who work with students in the classroom and in research consultations on a daily basis have a great deal of insight into seemingly minor issues that may lead to major frustrations. Here’s an example. The desktop view of our discovery layer search box was about 320 pixels long which works fine–if you are typing in just one word. Yet we were “selling” the discovery layer as something that handled known-item searching well, which meant that much of a pasted in citation wasn’t visible. The reference librarians who were doing this exact work knew this would be an issue. We expanded the search box so more words are visible and so it works better for known-item searching.
The same goes for course reserves, interlibrary loan, or other staff who work with a discovery layer frequently often with an added pressure of tight deadlines. If you can shave seconds off for them that adds up a huge amount over the course of the year, and will additionally potentially solve issues for other users. One example is that the print view of a book record had very small text–the print stylesheet was set to print at 85% font size, which meant it was challenging to read. The reserves staff relied on this print view to complete their daily work with the student worker. For one student the small print size created an accessibility issue which led to inefficient manual workarounds. We were able to increase the print stylesheet to greater than 100% font size which made the printed page easily readable, and therefore fix the accessibility issue for this specific use case. I suspect there are many other people whom this benefits as well.
Divide the Work
I firmly believe that everyone who is interested in user experience on the web should get some hands on experience with it. That said, not everyone needs to do the hands on work, and with a large project it is important that people focus on their core reason for being on the team. Dividing the group into overlapping teams who worked on data testing, interface testing, and user education and outreach helped us to see the big picture but not overwhelm everyone (a little Overwhelm is going to happen no matter what). These groups worked separately much of the time for deep dives into specific issues, but helped inform each other across the board. For instance, the data group might figure out a potential issue, for which the interface group would determine a test scenario. If testing indicated a change, the user education group could be aware of implications for outreach.
A Quick Timeline is Your Friend
Getting a new tool out with only a few months turnaround time is certainly challenging, but it forces you to forget about perfection and get features done. We got our hands on the discovery layer on Friday, and were doing tests the following Tuesday, with additional tests scheduled for two weeks after the first look. This meant that our first tests were on something very rough, but gave us a big list of items to fix in the next two weeks before the next test (or put on hold if lower priority). We ended up taking off two months from live usability in the middle of the process to focus on development and other types of testing (such as with trusted beta testers). But that early set of tests was crucial in setting the agenda and showing the importance of testing. We ultimately did 5 rounds of testing, 4 of which happened before the discovery layer went live, and 1 a few months after.
Think on the Long Scale
The vendor or the community of developers is presumably not going to stop working on the product, and neither should you. For this reason, it is helpful to make it clear who is doing the work and ensure that it is written into committee charges, job descriptions, or other appropriate documentation. Maintain a list of long-term goals, and in those short timescales figure out just one or two changes you could make. The academic year affords many peaks and lulls, and those lulls can be great times to make minor changes. Regular usability testing ensures that these changes are positive, as well as uncovering new needs as tools and needs change.
Iteration is the way to ensure that your long timescale stays manageable. Work never really stops, but that’s ok. You need a job, right? Back to that idea of a short timeline–borrow from the Agile method to think in timescales of 2 weeks-1 month. Have the end goal in mind, but know that getting there will happen in tiny pieces. This does require some faith that all the crucial pieces will happen, but as long as someone is keeping an eye on those (in our case, the vendor helped a lot with this), the pressure is off on being “finished”. If a test shows that something is broken that really needs to work, that can become high priority, and other desired features can move to a future cycle. Iteration helps you stay on track and get small pieces done regularly.
I hope I’ve made the case for why you need to have a user focus in any project, particularly a large and complex one. Whether you’re a reference librarian, project manager, web developer or cataloger, you have a responsibility to ensure the end result is usable, useful, and something people actually want to use. And no matter how tight your timeline, stick to making sure the process is user centered, and you’ll be amazed at how many impossible things you accomplished.
The Directory of Open Access Journals (DOAJ) is an international directory of journals and index of articles that are available open access. Dating back to 2003, the DOAJ was at the center of a controversy surrounding the “sting” conducted by John Bohannon in Science, which I covered in 2013. Essentially Bohannon used journals listed in DOAJ to try to find journals that would publish an article of poor quality as long as authors paid a fee. At the time many suggested that a crowdsourced journal reviewing platform might be the way to resolve the problem if DOAJ wasn’t a good source. While such a platform might still be a good idea, the simpler and more obvious solution is the one that seems to have happened: for DOAJ to be more strict with publishers about requirements for inclusion in the directory. 1.
The process of cleaning up the DOAJ has been going on for some time and is getting close to an important milestone. All the 10,000+ journals listed in DOAJ were required to reapply for inclusion, and the deadline for that is December 30, 2015. After that time, any journals that haven’t reapplied will be removed from the DOAJ.
“Proactive Not Reactive”
Contrary to popular belief, the process for this started well before the Bohannon piece was published 2. In December 2012 an organization called Infrastructure Services for Open Access (IS4OA) (founded by Alma Swan and Caroline Sutton) took over DOAJ from Lund University, and announced several initiatives, including a new platform, distributed editorial help, and improved criteria for inclusion. 3 Because DOAJ grew to be an important piece of the scholarly communications infrastructure it was inevitable that they would have to take such a step sooner or later. With nearly 10,000 journals and only a small team of editors it wouldn’t have been sustainable over time, and to lose the DOAJ would have been a blow to the open access community.
One of the remarkable things about the revitalization of the DOAJ is the transparency of the process. The DOAJ News Service blog has been detailing the behind the scenes processes in detail since May 2014. One of the most useful things is a list of journals who have claimed to be listed in DOAJ but are not. Another important piece of information is the 2015-2016 development roadmap. There is a lot going on with the DOAJ update, however, so below I will pick out what I think is most important to know.
The New DOAJ
In March 2014, the DOAJ created a new application form with much higher standards for inclusion. Previously the form for inclusion was only 6 questions, but after working with the community they changed the application to require 58 questions. The requirements are detailed on a page for publishers, and the new application form is available as a spreadsheet.
While 58 questions seems like a lot, it is important to note that journals need not fulfill every single requirement, other than the basic requirements for inclusion. The idea is that journal publishers must be transparent about the structure and funding of the journal, and that journals explicitly labeled as open access meet some basic theoretical components of open access. For instance, one of the basic requirements is that “the full text of ALL content must be available for free and be Open Access without delay”. Certain other pieces are strong suggestions, but not meeting them will not reject a journal. For instance, the DOAJ takes a strong stand against impact factors and suggests that they not be presented on journal websites at all 4.
To highlight journals that have extremely high standards for “accessibility, openness, discoverability reuse and author rights”, the DOAJ has developed a “Seal” that is awarded to journals who answer “yes” to the following questions (taken from the DOAJ application form):
have an archival arrangement in place with an external party (Question 25). ‘No policy in place’ does not qualify for the Seal.
provide permanent identifiers in the papers published (Question 28). ‘None’ does not qualify for the Seal.
provide article level metadata to DOAJ (Question 29). ‘No’ or failure to provide metadata within 3 months do not qualify for the Seal.
embed machine-readable CC licensing information in article level metadata (Question 45). ‘No’ does not qualify for the Seal.
allow reuse and remixing of content in accordance with a CC BY, CC BY-SA or CC BY-NC license (Question 47). If CC BY-ND, CC BY-NC-ND, ‘No’ or ‘Other’ is selected the journal will not qualify for the Seal.
have a deposit policy registered in a deposit policy directory. (Question 51) ‘No’ does not qualify for the Seal.
allow the author to hold the copyright without restrictions. (Question 52) ‘No’ does not qualify for the Seal.
Part of the appeal of the Seal is that it focuses on the good things about open access journals rather than the questionable practices. Having a whitelist is much more appealing for people doing open access outreach than a blacklist. Journals with the Seal are available in a facet on the new DOAJ interface.
Getting In and Out of the DOAJ
Part of the reworking of the DOAJ was the requirementand required all currently listed journals to reapply–as of November 19 just over 1,700 journals had been accepted under the new criteria, and just over 800 had been removed (you can follow the list yourself here). For now you can find journals that have reapplied with a green check mark (what DOAJ calls The Tick!). That means that about 85% of journals that were previously listed either have not reapplied, or are still in the verification pipeline 5. While DOAJ does not discuss specific reasons a journal or publisher is removed, they do give a general category for removal. I did some analysis of the data provided in the added/removed/rejected spreadsheet.
At the time of analysis, there were 1776 journals on the accepted list. 20% of these were added since September, and with the deadline looming this number is sure to grow. Around 8% of the accepted journals have the DOAJ Seal.
There were 809 journals removed from the DOAJ, and the reasons fell into the following general categories. I manually checked some of the journals with only 1 or 2 titles, and suspect that some of these may be reinstated if the publisher chooses to reapply. Note that well over half the removed journals weren’t related to misconduct but were ceased or otherwise unavailable.
|Inactive (has not published in the last calender year)||233|
|Suspected editorial misconduct by publisher||229|
|Website URL no longer works||124|
|Journal not adhering to Best Practice||62|
|Journal is no longer Open Access||45|
|Has not published enough articles this calendar year||2|
|Other; delayed open access||1|
|Other; no content||1|
|Other; taken offline||1|
|Removed at publisher’s request||1|
The spreadsheet lists 26 journals that were rejected. Rejected journals will know the specific reasons why their applications were rejected, but those specific reasons are not made public. Journals may reapply after 6 months once they have had an opportunity to amend the issues. 6 The general stated reasons were as follows:
|Has not published enough articles||2|
|Journal website lacks necessary information||2|
|Not an academic/scholarly journal||1|
|Web site URL doesn’t work||1|
The work that DOAJ is doing to improve transparency and the screening process is very important for open access advocates, who will soon have a tool that they can trust to provide much more complete information for scholars and librarians. For too long we have been forced to use the concept of a list of “questionable” or even “predatory” journals. A directory of journals with robust standards and easy to understand interface will be a fresh start for the rhetoric of open access journals.
Are you the editor of an open access journal? What do you think of the new application process? Leave your thoughts in the comments (anonymously if you like).
I have been mostly absent from ACRL Tech Connect this year because the last nine months have been spent migrating to a new library systems platform and discovery layer. As one of the key members of the implementation team, I have devoted more time to meetings, planning, development, more meetings, and more planning than any other part of my job has required thus far. We have just completed the official implementation project and are regular old customers by now. At this point I finally feel I can take a deep breath and step back to think about the past nine months in a holistic manner to glean some lessons learned from this incredible professional opportunity that was also incredibly challenging at times.
In this post I won’t go into the details of exactly which system we implemented and how, since it’s irrelevant to the larger discussion. Rather I’d like to stay at a high level to think about what working on such a project is like for a professional working with others on a team and as an individual trying to make things happen. For those who are curious about the details of the project, including management and process, those will be detailed in a forthcoming book chapter in Exploring Discovery (ALA Editions) edited by Ken Varnum. I will also be participating in an AL Live episode on this topic on October 8.
A project like this doesn’t come as a surprise. My library had been planning a move to a new platform for a number of years, and had an extremely inclusive selection process when selecting a new platform. When we found out that we would be able to go ahead with the implementation process I knew that I would have the opportunity to lead the implementation of the new discovery layer on the technical side, as well as coordinate much of the effort on the user outreach and education side. That was an exciting and terrifying role, since while it was far less challenging technically to my mind than working on the data migration, it would be the most public piece of the project. In addition it quickly became clear that our multi-campus situation wasn’t going to fit exactly into line with the built in solutions in the products, which required a great deal of additional work to understand the interoperability of the products and how they interacted with other systems. Ultimately it was a great education, but in the thick of it seemed to have no end in sight.
To that end, I wanted to share some of the lessons I learned from this process both as a leader and a member of a team. Of course, many of these are widely applicable to any project, whether it’s in a library systems department or any work place.
Someone has to say the obvious thing
One of the joys of doing something that is new to everyone is that the dread of impostor syndrome is diminished. If no one knows the answer, then no one can look like an idiot for not knowing, after all. Yet that is not always clear to everyone working on the project, and as the leader it’s useful to make it clear you have no idea how something works when you don’t, or if something is “simple” to you to still to say exactly how it works to make sure everyone understands. There’s a point at which assuming others do know the obvious thing is forgetting your own path to learning, in which it’s helpful to hear the simple thing stated clearly, which may take several attempts. Besides the obvious implications of people not understanding how something works, it robs them of a chance to investigate something of interest and become a real contributor. Try to not make other people have to admit they have no idea what you’re talking about, whether or not you think they should have known it. This also forces you to actually know what you’re talking about. Teaching something is, after all, the best way to learn it.
Don’t answer questions all the time
Human brains can be rather pathetic moment to moment even if they do all right in the end. A service mentality leads (or in some cases requires) us to answer questions as fast as we can, but it’s better to give the correct answer or the well-considered answer a little later than answer something in haste and get the answer wrong or say something in a poor manner. If you are trying to figure out things as you go along, there’s no reason for you to know anything off the top of your head. If you get a question in a meeting and need to double check, no one would be surprised. If you get an email at 5:13 PM after a long day and need to postpone even thinking about the answer until the following day, that is the best thing for your sanity and for the success of the project both.
Keep the end goal in mind, and know when to abandon pieces
This is an obvious insight, but crucial to feeling like you’ve got some control of the process. We tend to think of way more than we can possibly accomplish in a timeframe, and continual re-prioritization is essential. Some features you were sold on in the sales demo end up being lackluster, and other features you didn’t know existed will end up thrilling you. Competing opportunities and priorities will always exist. Good project management can account for those variables and still keep the core goals central and happening on time. But that said…
Project management is not a panacea
The whole past nine months I’ve had a vision that with perfect project management everything could go perfectly. This has crept into all areas of my life and made me imagine that I could project manage my way to perfection in my life with a toddler (way too many variables) or my house (110 year old houses are nearly as tricky as toddlers). We had excellent project management support from the vendor as well as internally, but I kept seeing room for improvement in everything. “If only we had foreseen that, we could have avoided this.” “If only I had communicated the action items more clearly after that meeting, we wouldn’t be so behind.” We actually learned very late in our project that other libraries undertaking similar projects hired a consultant to do nothing but project management on the library side which seemed like a very good idea–though we managed all right without one. In any event, a project manager wouldn’t have changed some of the most challenging issues, which didn’t have anything to do with timelines or resources but with differences in approach and values between departments and libraries. Everyone wants the “best” for the users, but the “best” for one person doesn’t work at all for another. Coming to a compromise is the right way to handle this, there’s no way to avoid conflict and the resulting change in the plan.
Hopefully we all get to experience projects in our careers of this magnitude, whether technical or not. Anything that shifts an institution to something new that touches everyone is something to take very seriously. It’s time-consuming and stressful because it should be! Nevertheless, managing time and stress is key to ensure that you view the work as thrilling rather than diminishing.
A decade ago, Stephen Colbert introduced the concept of “truthiness”, or a fact that was so because it felt right “from the gut.” When we search for information online, we are always up against the risk that the creator of a page is someone who, like Stephen Colbert’s character doesn’t trust books, because “they’re all fact, no heart.”1 Since sites with questionable or outright false facts that “feel right” often end up at the top of Google search results, librarians teach students how to evaluate online sources for accuracy, relevancy, and so on rather than just trusting the top result. But what if there were a way to ensure that truthiness was removed, and only sites with true information appeared at the top of the results?
This idea is what underlies a new Google algorithm called Knowledge-Based Trust(KBT)2. Google’s original founding principles and the PageRank algorithm were based on academic citation practices–loosely summarized, pages linked to by a number of other pages are more likely to be useful than those with fewer links. The content of the page, while it needs to match the search query, is less crucial to its ranking than outside factors, which is otherwise known as an exogenous model. The KBT, by contrast, is an endogenous model relying on the actual content of the page. Ranking is based on the probability that the page is accurate, and therefore more trustworthy. This is designed to address the problem of sites with high PageRank scores that aren’t accurate, either because their truthiness quotient is high, or because they have gamed the system by scraping content and applying misleading SEO. On the other side, pages with great information that aren’t very popular may be buried.
“Wait a second,” you are now asking yourself, “Google now determines what is true?” The answer is: sort of, but of course it’s not as simple as that. Let’s look at the paper in detail, and then come back to the philosophical questions.
Digging Into the KBT
First, this paper is technical, but the basic information is fairly straightforward. This model is based on extracting facts from a web source, evaluating whether those facts are true or not, and then whether a source is accurate or not. This leads to a determination that the facts are correct in an iterative process. Of course, verifying that determination is essential to ensuring that all the algorithms are working correctly, and this paper describes ways of checking the extracted facts for accuracy.
The extractors are described more fully in an earlier version of this work, Knowledge Vault (KV), which was designed to fill in large-scale knowledge bases such as Freebase by extracting facts from a web source using techniques like Natural Language Processing of text files followed by machine learning, HTML DOM trees, HTML tables, and human processed pages with schema.org metadata. The extractors themselves can perform poorly in creating these triples, however, and this is more common than the facts being wrong, and so sites may be unfairly flagged as inaccurate. The KBT project aims to introduce an algorithm for determining what type of error is present, as well as how to judge sites with many or few facts accurately, and lastly to test their assumptions using real world data against known facts.
The specific example given in the paper is the birthplace of President Barack Obama. The extractor would determine a predicate, subject, object triple from a web source and match these strings to Freebase (for example). This can lead to a number of errors–there is a huge problem in computationally determining the truth even when the semantics are straightforward (which we all know it rarely is). For this example, it’s possible to check data from the web against the known value in Freebase, and so if that extractor works set an option to 1 (for yes) and 0 (for no). Then this can be charted in a two-dimensional or three-dimensional matrix that helps show the probability of a given extractor working, as well as whether the value pulled by the extractor was true or not.
They go on to examine two models for computing the data, single-layer and multi-layer. The single-layer model, which looks at each web source and its facts separately, is easier to work with using standard techniques but is limited because it doesn’t take into account extraction errors. The multi-layer model is more complex to analyze, but takes the extraction errors into account along with the truth errors. I am not qualified to comment on the algorithm math in detail, but essentially it computes probability of accuracy for each variable in turn, ultimately arriving at an equation that estimates how accurate a source is, weighted by the likelihood that source contains those facts. There are additional considerations for precision and recall, as well as confidence levels returned by extractors.
Lastly, they consider how to split up large sources to avoid computational bottlenecks, as well as to merge sources with few facts in order to not penalize them but not accidentally combine unrelated sources. Their experimental results determined that generally PageRank and KBT are orthogonal, but with a few outliers. In some cases, the site has a low PageRank but a high KBT. They manually verified the top three predicates with high extraction accuracy scores for web sources with a high KBT to check what was happening. 85% of these sources were trustworthy without extraction errors and with predicates related to the topic of the page, but only 23% of these sources had PageRank scores over 0.5. In other cases, sources had a low KBT but high PageRank, which included sites such as celebrity gossip sites and forums such as Yahoo Answers. Yes, indeed, Google computer scientists finally have definitive proof that Yahoo Answers tends to be inaccurate.
The conclusion of the article with future improvements reads like the learning outcomes for any basic information literacy workshop. First, the algorithm would need to be able to tell the main topic of the website and filter out unrelated facts, to understand which triples are trivial, to have better comprehension of what is a fact, and to correctly remove sites with data scraped from other sources. That said, for what it does, this is a much more sophisticated model than anything else out there, and at least proves that there is a possibility to computationally determine the accuracy of a web source.
What is Truth, Anyway?
Despite the promise of this model there are clearly many potential problems, of which I’ll mention just a few. The source for this exercise, Freebase, is currently in read-only mode as its data migrates to Wikidata. Google is apparently dropping Freebase to focus on their Open Knowledge Graph, which is partially Freebase/Wikidata content and partially schema.org data 3. One interesting wrinkle is that much of Freebase content cites Wikipedia as a source, which means there are currently recursive citations that must be properly cited before they will be accepted as facts. We already know that Wikipedia suffers from a lack of diversity in contributors and topic coverage, so a focus on content from Wikipedia has the danger of reducing the sources of information from which the KBT could check triples.
That said, most of human knowledge and understanding is difficult to fit into triples. While surely no one would search Google for “What is love?” or similar and expect to get a factual answer, there are plenty of less extreme examples that are unclear. For instance, how does this account for controversial topics? I.e. “anthropogenic global warming is real” vs. “global warming is real, but it’s not anthropogenic.” 97% of scientists agree to the former, but what if you are looking for what the 3% are saying?
And we might question whether it’s a good idea to trust an algorithm’s definition of what is true. As Bess Sadler and Chris Bourg remind us, algorithms are not neutral, and may ignore large parts of human experience, particularly from groups underrepresented in computer science and technology. Librarians should have a role in reducing that ignorance by supporting “inclusion, plurality, participation and transparency.” 4 Given the limitations of what is available to the KBT it seems unlikely that this algorithm would markedly reduce this inequity, though I could see how it could be possible if Wikidata could be seeded with more information about diverse groups.
Librarians take note, this algorithm is still under development, and most likely won’t be appearing in our Google results any time in the near future. And even once it does, we need to ensure that we are still paying attention to nuance, edge cases, and our own sense of truthiness–and more importantly, truth–as we evaluate web sources.
- http://thecolbertreport.cc.com/videos/63ite2/the-word—truthiness. ↩
- Dong, X. et al. “Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources”. Proceedings of the VLDB Endowment, 2015. Retrieved from http://arxiv.org/abs/1502.03519 ↩
- https://www.wikidata.org/wiki/Help:FAQ/Freebase ↩
- Sadler, Bess and Chris Bourg, “Feminism and the Future of Library Discovery.” Code4Lib Journal 28, April 2015. ↩
The recent publication of Monica Berger and Jill Cirasella’s piece in College and Research Libraries News “Beyond Beall’s List: Better understanding predatory publishers” is a reminder that the issue of “predatory publishers” continues to require focus for those working in scholarly communication. Berger and Cirasella have done a exemplary job of laying out some of the issues with Beall’s list, and called on librarians to be able “to describe the beast, its implications, and its limitations—neither understating nor overstating its size and danger.”
At my institution academic deans have identified “predatory” journals as an area of concern, and I am sure similar conversations are happening at other institutions. Here’s how I’ve “described the beast” at my institution, and models for services we all can provide, whether subject librarian or scholarly communication librarian.
What is a Predatory Publisher? And Why Does the Dean Care?
The concept of predatory publishers became much more widely known in 2013 with a publication of an open access sting by John Bohannon in Science, which I covered in this post. As a recap, Bohannon created a fake but initially believable poor quality scientific article, and submitted it to open access journals. He found that the majority of journals accepted the poor quality paper, 45% of which were included in the Directory of Open Access Journals. At the time of publication in October 2013 the response to this article was explosive in the scholarly communications world. It seems that more than a year later the reaction continues to spread. Late in the fall semester of 2014, library administration asked me to prepare a guide about predatory publishers, due to concern among the deans that unscrupulous publishers might be taking advantage of faculty. This was a topic I’d been educating faculty about on an ad hoc basis for years, but I never realized we needed to address it more systematically. That all has changed, with senior library administration now doing regular presentations about predatory publishers to faculty.
If we are to be advocates of open access, we need to focus on the positive impact that open access has rather than dwell for too long on the bad sides of it. We also need faculty to be clear on their own goals for making their work open access so that they may make more informed choices. Librarians have limited faculty bandwidth on the topic, and so focusing on education about self-archiving articles (otherwise known as green open access) or choosing no-fee (also known as gold) open access journals is a better way to achieve advocacy goals than suggesting faculty choose only a certain set of gold open access journals. Unless we are offering money for paying article fees, we also don’t have much say about where faculty choose to publish. Education about how to choose a journal and a license responsibly is what we should focus on, even if it diverges from certain ideals (see Meredith Farkas on choosing creative commons licenses.)
Understanding the Needs and Preparing the Material
As I mentioned, my library administration asked for a guide that that they could use in presentations and share with faculty. In preparing this guide, I worked with our library’s Scholarly Communications committee (of which I am co-chair) to determine the format and content.
We decided that adding this material to our existing Open Access research guide would be the best move, since it was already up and we shared the URL widely already. We have a robust series of Open Access Week events (which I wrote about last fall) and this seemed to ideal place to continue engaging people. That said, we determined that the guide needed an overhaul to make it more clear that open access was an on-going area of concern, not a once a year event. Since faculty are not always immediately thinking of making work open access but of the mechanics of publishing, I preferred to start with the title “Publishing Your Own Work”.
To describe its features a bit more, I wanted to start from the mindset of self-archiving work to make it open access with a description of our repository and Peter Suber’s useful guide to making one’s own work open access. I then continued with an explanation of article publication fees, since I often get questions along those lines. They are not unique to open access journals, and don’t imply any fee to accept for publication, which was a fear that I heard more than once during Open Access Week last year. I only then discussed the concept of predatory journals, with the hope that a basic understanding of the process would allay fears. I then present a list of steps to research a journal. I thought these steps were more common sense than anything, but after conversations with faculty and administration, I realized that my intuition about what type of journal I am dealing with is obvious because I have daily practice and experience. For people new to the topic I tried to break down research into easy steps that help them to figure out where a journal is on the continuum from outright scams to legitimate but new or unusual journals. It was also important to me to emphasize self-archiving as a strategy no matter the journal publication model.
Lastly, while most academic libraries have a model of liaison librarians engaging in scholarly communications activities, the person who spends every day working on these issues is likely to be more versed in emerging trends. So it is important to work with liaisons to help them research journals and to identify quality open access journals in their disciplines. We plan to add this information to the guide in a future version.
Taking it on the Road
We felt that in-person instruction on these matters with faculty was a crucial next step, particularly for people who publish in traditional journals but want to make their work available. Traditional journals’ copyright transfer agreements can be predatory, even if we don’t think about it in those terms. Taking inspiration from the ACRL Scholarly Communications Roadshow I attended a few years ago, I decided to take the curriculum from that program and offer it to faculty and graduate students. We read through three publication agreements as a group, and then discussed how open the publishers were to reuse of material, or whether they mentioned it at all. We then included a section on addenda to contracts for negotiation about additional rights.
The first workshop received modest attendance, but included some thoughtful conversations, and we have promised to run it again. Some people may never have read their agreements closely, and never realized they were doing something illegal or not specifically allowed by, for instance, sharing an article they wrote with their students. That concrete realization is more likely to spur action than more abstract arguments about the benefits of open access.
Escaping the Predator Metaphor
If I could go back, I would get rid of the concept of “predator” attached to open access journals. Let’s call it instead unscrupulous entrants into an emerging business model. That’s not as catchy, but it explains why this has happened. I would argue, personally, that the hybrid gold journals by large publishers are just as predatory, as they capitalize on funding requirements to make articles open access with high fees. They too are trying new business models, and those may not be tenable either. As I said above, choosing a journal with eyes wide open and understanding all the ramifications of different publication models is the only way forward. To suggest that faculty are innocently waiting to be pounced on by predators is to deny their agency and their ability to make choices about their own work. There may be days where that metaphor seems apt, but I think overall this is a damaging mentality to librarians interested in promoting new models of scholarly communication. I hope we can provide better resources and programming to escape this, as well as to help administration to understand how to choose to fund open access initiatives.
In the comments I’d like to hear more suggestions about how to escape the “predator” metaphor, as well as your own techniques for educating faculty on your campus.
Imagine this scenario: you don’t normally have a whole lot to do at your job. It’s a complex job, sure, but day-to-day you’re spending most of your time monitoring a computer and typing in data. But one day, something goes wrong. The computer fails. You are suddenly asked to perform basic job functions that the computer normally takes care of for you, and you don’t really remember well how to do them. In the mean time, the computer is screaming at you about an error, and asking for additional inputs. How well do you function?
The Glass Cage
In Nicholas Carr’s new book The Glass Cage, this scenario is the frightening result of malfunctions with airplanes, and in the cases he describes, result in crashes and massive loss of life. As librarians, we are thankfully not responsible on a daily basis for the lives of hundreds of people, but like pilots, we too have automated much of our work and depend on systems that we often have no control over. What happens when a database we rely on goes down–say, all OCLC services go down for a few hours in December when many students are trying to get a few last sources for their papers? Are we able to take over seamlessly from the machines in guiding students?
Carr is not against automation, nor indeed against technology in general, though this is a criticism frequently leveled at him. But he is against the uncritical abnegation of our faculties to technology companies. In his 2011 book The Shallows, he argues that offloading memory to the internet and apps makes us more shallow, distractable thinkers. While I didn’t buy all his arguments (after all, Socrates didn’t approve of off-loading memory to writing since it would make us all shallow, distractable thinkers), it was thought-provoking. In The Glass Cage, he focuses on automation specifically, using autopilot technologies as the focal point–“the glass cage” is the name pilots use for cockpits since they are surrounded by screens. Besides the danger of not knowing what to do when the automated systems fail, we create potentially more dangerous situations by not paying attention to what choices automated systems make. As Carr writes, “If we don’t understand the commercial, political, intellectual, and ethical motivations of the people writing our software, or the limitations inherent in automated data processing, we open ourselves to manipulation.” 1
We have automated many mundane functions of library operation that have no real effect, or a positive effect. For instance, no longer do students sign out books by writing their names on paper cards which are filed away in drawers. While some mourn for the lost history of who had out the book–or even the romance novel scenario of meeting the other person who checks out the same books–by tracking checkouts in a secure computerized system we can keep better track of where books are, as well as maintain privacy by not showing who has checked out each book. And when the checkout system goes down, it is easy to figure out how to keep things going in the interim. We can understand on an instinctual level how such a system works and what it does. Like a traditional computerized library catalog, we know more or less how data gets in the system, and how data gets out. We have more access points to the data, but it still follows its paper counterpart in creation and structure.
Over the past decade, however, we have moved away more and more from those traditional systems. We want to provide students with systems that align with their (and our) experience outside libraries. Discovery layers take traditional library data and transform it with indexes and algorithms to create a new, easier way to find research material. If traditional automated systems, like autopilot systems, removed the physical effort of moving between card catalogs, print indexes, and microfilm machines, these new systems remove much of the mental effort of determining where to search for that type of information and the particular skills needed to search the relevant database. That is a surely a useful and good development. When one is immersed in a research question, the system shouldn’t get in the way.
That said, the nearly wholesale adoption of discovery systems provided by vendors leaves academic librarians in an awkward position. We can find a parallel in medicine. Carr relates the rush into electronic medical records (EMR) starting in 2004 with the Heath Information Technology Adoption Initiative. This meant huge amounts of money available for digitizing records, as well as a huge windfall for health information companies. While an early study by the RAND corporation (funded in part by those health information companies) indicated enormous promise from electronic medical records to save money and improve care. 2 But in actual fact, these systems did not do everything they were supposed to do. All the data that was supposed to be easy to share between providers was locked up in proprietary systems. 3 In addition, other studies showed that these systems did not merely substitute automated record-keeping for manual, they changed the way medicine was practiced. 4 EMR systems provide additional functions beyond note-taking, such as checklists and prompts with suggestions for questions and tests, which in turn create additional and more costly bills, test requests, and prescriptions. 5 The EMR systems change the dynamic between doctor and patient as well. The systems encourage the use of boilerplate text that lacks the personalized story of an individual patient, and the inability to flip through pages tended to diminish the long view of a patient’s entire medical history. 6 The presence of the computer in the room and the constant multitasking of typing notes into a computer means that doctors cannot be fully present with the patient. 7 With the constant presence of the EMR and its checklists, warnings, and prompts, doctors lose the ability to gain intuition and new understandings that the EMR could never provide. 8
The reference librarian has an interaction with patrons that is not all that different from doctors with patients (though as with pilots, the stakes are usually quite different). We work one on one with people on problems that are often undefined or misunderstood at the beginning of the interaction, and work towards a solution through conversation and cursory examinations of resources. We either provide the resource that solves the problem (e.g. the prescription), or make sure the patron has the tools available to solve problem over time (e.g. diet and exercise recommendations). We need to use subtle queues of body language and tone of voice to see how things are going, and use instinctive knowledge to understand if there is a deeper but unexpressed problem. We need our tools at hand to work with patrons, but we need to be present and use our own experience and judgment in knowing the appropriate tool to use. That means that we have to understand how the tool we have works, and ideally have some way of controlling it. Unfortunately that has not always been the case with vendor discovery systems. We are at the mercy of the system, and reactions to this vary. Some people avoid using it at all costs and won’t teach using the discovery system, which means that students are even less likely to use it, preferring the easier to get to even if less robust Google search. Or, if students do use it, they may still be missing out on the benefits of having academic librarians available–people who have spent years developing domain knowledge and the best resources available at the library, which knowledge can’t be replaced by an algorithm. Furthermore, the vendor platforms and content only interoperate to the extent the vendors are willing to work together, for which many of them have a disincentive since they want their own index to come out on top.
Enter the ODI
Just as doctors may have given up some of their professional ability and autonomy to proprietary databases of patient information, academic librarians seem to have done something similar with discovery systems. But the NISO Open Discovery Initiative (ODI) has potential to make the black box more transparent. This group has been working for two years to develop a set of practices that aim to make some aspects of discovery even across providers, and so give customers and users more control in understanding what they are seeing and ensure that indexes are complete. The Recommended Practice addresses some (but not all) major concerns in discovery service platforms. Essentially it covers requirements for metadata that content providers must provide to discovery service providers and to libraries, as well as best practices for content providers and discovery service providers. The required core metadata is followed by the “enriched” content which is optional–keywords, abstract, and full text. (Though the ODI makes it clear that including these is important–one might argue that the abstract is essential). 9 Discovery service providers are in turn strongly encouraged to make the content their repositories hold clear to their customers, and the metadata required for this. Discovery service providers should follow suggested practices to ensure “fair linking”, specifically to not use business relationships as a ranking or ordering consideration, and allow libraries to set their own preferences about choice of providers and wording for links. ODI suggests a fairly simple set of usage statistics that should be provided and exactly what they should measure. 10
While this all sets a good baseline, what is out of scope for ODI is equally important. It “does not address issues related to performance or features of the discovery services, as these are inherently business and design decisions guided by competitive market forces.” 11 Performance and features includes the user interface and experience, the relevancy ranking algorithms, APIs, specific mechanisms for fair linking, and data exchange (which is covered by other protocols). The last section of the Recommended Practice covers some of those in “Recommended Next Steps”. One of those that jumps out is the “on-demand lookup by discovery service users” 12, which suggests that users should be able to query the discovery service to determine “…whether or not a particular collection, journal, or book is included in the indexed content”13–seemingly the very goal of discovery in the first place.
“Automation of Intellect”
We know that many users only look at the first page of results for the resource they want. If we don’t know what results should be there, or how they get there, we are leaving users at the mercy of the tool. Disclosure of relevancy rankings is a major piece of transparency that ODI leaves out, and without understanding or controlling that piece of discovery, I think academic librarians are still caught in the trap of the glass cage–or become the chauffeur in the age of the self-driving car. This has been happening in all professional fields as machine learning algorithms and processing power to crunch big data sets improve. Medicine, finance, law, business, and information technology itself have been increasingly automated as software can run algorithms to analyze scenarios that in the past would require a senior practitioner. 14 So what’s the problem with this? If humans are fallible (and research shows that experts are equally if not more fallible), why let them touch anything? Carr argues that “what makes us smart is not our ability to pull facts from documents.…It’s our ability to make sense of things…” 15 We can grow to trust the automated system’s algorithms beyond our own experience and judgment, and lose the possibility of novel insights. 16
This is not to say that discovery systems do not solve major problems or that libraries should not use them. They do, and as much as practical libraries should make discovery as easy as possible. But as this ODI Recommended Practice makes clear, much remains a secret business decision for discovery service vendors, and thus something over which academic librarian can exercise control only though their dollars in choosing a platform and their advocacy in working with vendors to ensure they understand the system and it does what they need.
- Nicholas Carr, The Glass Cage: Automation and Us (New York: Norton, 2014), 208. ↩
- Carr, 93. ↩
- Carr, 95. ↩
- Carr, 97. ↩
- Carr, 98. ↩
- Carr, 101-102. ↩
- Carr, 103. ↩
- Carr, 105-106. ↩
- National Information Standards Organization (NISO) Open Discovery Initiative (ODI) Working Group, Open Discovery Initiative: Promoting Transparency in Discovery (Baltimore: NISO, 2014): 25-26. ↩
- NISO ODI, 25-27. ↩
- NISO ODI, 3. ↩
- NISO ODI, 32. ↩
- NISO ODI, 32. ↩
- Carr, 115-117. ↩
- Carr, 121. ↩
- Carr, 124. ↩