It’s Open Access Week, which for scholarly communications librarians and institutional repository managers is one of the big events of the year to reflect on our work and educate others. Over the years, it has become less necessary to explain what open access is. Rather, everyone seems to have a perception of open access and an opinion about it. But those perceptions and opinions may not be based on the original tenets of the open access movement. The commercialization of open access means that it may now seem too expensive to pursue for individuals to publish open access, and too complicated for institutions to attempt without buying a product.
In some ways, the open access movement is analogous to punk music–a movement that grew out of protest and DIY sensibilities, but was quickly coopted as soon as it became financially successful. While it was never free or easy to make work open access, changes in the market recently might make it feel even more expensive and complicated. Those who want to continue to build open access repositories and promote open access need to understand where their institution fits in the larger picture, the motivations of researchers and administration, and be able to put the right solutions together to avoid serious missteps that will set back the open access movement.
Like many exciting new ideas, open access is partially a victim of its own success. Heather Morrison has kept for the past ten years a tally of the dramatic growth of open access in an on-going series. Her post for this year’s Open Access Week is the source for the statistics in this paragraph. Open access content makes up a sizeable portion of online content, and therefore is more of a market force. BASE now includes 100 million articles. Directory of Open Access Journals, even after the stricter inclusion process, has an 11% growth in article level searching with around 500,000 items. There are well over a billion items with a Creative Commons license. These numbers are staggering, and give a picture of how overwhelming the amount of content available all told is, much less open access. But it also means that almost everyone doing academic research will have benefited from open access content. Not everyone who has used open access (or Creative Commons licensed) content will know what it is, but as it permeates more of the web it becomes more and more relevant. It also becomes much harder to manage, and dealing with that complexity requires new solutions–which may bring new areas of profit.
An example of this new type of service is 1Science, which launched a year ago. This is a service that helps libraries manage their open access collections, both in terms of understanding what is available in their subscribed collections as well as their faculty output. 1Science grew out of longer term research projects around emerging bibliometrics, started by Eric Archambault, and according to their About Us page started as a way to improve findability of open access content, and grew into a suite of tools that analyzes collections for open access content availability. The market is now there for this to be a service that libraries are interested in purchasing. Similar moves happened with alternative metrics in the last few years as well (for instance, Plum Analytics).
But the big story for commercial open access in 2016 was Elsevier. Elsevier already had a large stable of open access author-pays journals, with fees of up to $5000. That is the traditional way that large commercial publishers have participated in open access. But Elsevier has made several moves in 2016 that are changing the face of open access. They acquired SSRN in May, which built on their acquisition of Mendeley in 2013, and hints at a longer term strategy for combining a content platform and social citation network that potentially could form a new type of open access product that could be marketed to libraries. Their move into different business models for open access is also illustrated in their controversial partnership with the University of Florida. This uses an API to harvest content from ScienceDirect published by UF researchers, but will not provide access to those without subscriptions except to certain accepted manuscripts, and grew out of a recognition that UF researchers published heavily in Elsevier journals and that working directly with Elsevier would allow them to get a large dataset of their researchers’ content and funder compliance status more easily. 1 There is a lot to unpack in this partnership, but the fact that it can even take place shows that open access–particularly funder compliance for open access versions–is something about which university administration outside the library in the Office of Research Services is taking note. Such a partnership serves certain institutional needs, but it does not create an open access repository, and in most ways serves the needs of the publisher in driving content to their platform (though UF did get a mention of interlibrary loan into the process rather than just a paywall). It also removes incentives for UF faculty to publish in non-Elsevier journals, since their content in those journals will become even easier to find, and there will be no need to look elsewhere for open access grant compliance. Either way, this type of move takes the control of open access out of the hands of libraries, just as so many previous deals with commercial enterprises have done.
As I said in the beginning of this piece, more and more people already know about and benefit from open access, but all those people have different motivations. I break those into three categories, and which administrative unit I think is most likely to care about that aspect of open access:
- Open access is about the justice of wider access to academic content or getting back at the big publishers for exploitative practices. These people aren’t going to be that interested in a commercial open access solution, except inasmuch as it allows more access for a lower cost–for instance, a hosted institutional repository that doesn’t require institutional investment in developers. This group may include librarians and individual researchers.
- Open access is about following the rules for a grant-funded project since so many of those require open access versions of articles. Such requirements lead to an increase in author-pays open access, since publishers can command a higher fee that can be part of the grant award or subsidized by an institution. Repositories to serve these requirements and to address these needs are in progress but still murky to many. This group may include the Office of Research Services or Office of Institutional Research.
- “Open access” is synonymous with putting articles or article citations online to create a portfolio for reputation-building purposes. This last group is going to find something like that UF/Elsevier partnership to be a great situation, since they may not be aware of how many people cannot actually read the published articles. This last group may include administrators concerned with building the institution’s reputation.
For librarians who fall into the first category but are sensitive to the needs of everyone in each category, it’s important to select the right balance of solutions to meet everyone’s needs but still maintain the integrity of the open access repository. That said, this is not easy. Meeting these variety of needs is exactly why some of these new products are entering the market, and it may seem easier to go with one of them even if it’s not exactly the right long-term solution. I see this as an important continuing challenge facing librarians who believe in open access, and have to underpin future repository and outreach strategies.
- Russell, Judith C.; Wise, Alicia; Dinsmore, Chelsea S.; Spears, Laura I.; Phillips, Robert V.; and Taylor, Laurie (2016) “Academic Library and Publisher Collaboration: Utilizing an Institutional Repository to Maximize the Visibility and Impact of Articles by University Authors,” Collaborative Librarianship: Vol. 8: Iss. 2, Article 4.
Carousels are a popular website feature because they allow one to fit extra information within the same footprint and provide visual interest on a page. But as you most likely know, there is wide disagreement about whether they should ever be used. Reasons include: they can be annoying, no one spends long enough on a page to ever see beyond the first item, people rarely click on them (even if they read the information) and they add bloat to pages (Michael Schofield has a very compelling set of slides on this topic). But by far the most compelling argument against them is that they are difficult if not impossible to make accessible, and accessibility issues exist for all types of users.
In reality, however, it’s not always possible to avoid carousels or other features that may be less than ideal. We all work within frameworks, both technical and political, and we need to figure out how to create the best case scenario within those frameworks. If you work in a university or college library, you may be constrained by a particular CMS you need to use, a particular set of brand requirements, and historical design choices that may be slower to go away in academia than elsewhere. This post is a description of how I made some small improvements to my library website’s carousel to increase accessibility, but I hope it can serve as a larger discussion of how we can always make small improvements within whatever frameworks we work.
What Makes an Accessible Carousel?
We’ve covered accessibility extensively on ACRL TechConnect before. Cynthia Ng wrote a three part series in 2013 on making your website accessible, and Lauren Magnuson wrote about accessibility testing LibGuides in 2015. I am not an expect by any means on web accessibility, and I encourage you to do additional research about the basics of accessibility. For this specific project I needed to understand what it is specifically about carousels that makes them particular inaccessible, and how to ameliorate that. When I was researching this specific project, I found the following resources the most helpful.
The basic issues with carousels are that they move at their own pace but in a way that may be difficult to predict, and are an inherently visual medium. For people with visual impairments the slideshow images are irrelevant unless they provide useful information, and their presence on the page causes difficulty for screen reading software. For people with motor or cognitive impairments (which covers nearly everyone at some point in their lives) a constantly shifting image may be distracting and even if the content is interesting it may not be possible to click on the image at the rate it is set to move.
You can increase accessibility of carousels by making it obvious and easy for users to stop the slideshow and view images at their own pace, make the role of the slideshow and the controls on the page obvious to screen reading software, to make it possible to control the slideshow without a mouse, and to make it still work without stylesheets. Alternative methods of accessing the content have to be available and useful.
I chose to work on the slideshow as part of a retheming of the library website to bring it up to current university branding standards and to make it responsive. The current slideshow lacked obvious controls or any instructions for screen readers, and was not possible to control without a mouse. My general plan in approaching this was to ensure that there were obvious controls to control the slideshow (and that it would pause quickly without a lot of work), have ARIA roles for screen readers, and be keyboard controllable. I had to work with the additional constraints of making this something that would work in Drupal, be responsive, and that would allow the marketing committee to post their own images without my intervention but would still require alt tags and other crucial items for accessibility.
Because the library’s website uses Drupal, it made sense to look for a solution that was designed to work with Drupal. Many options exist, and everyone has a favorite or a more appropriate choice for a particular situation, so if you are looking for a good Drupal solution you’ll want to do your own research. I ended up choosing a Drupal module called Views Slideshow after looking at several options. It seemed to be customizable enough that I was pretty sure I could make it accessible even though it lacked some of the features out of the box. The important thing to me is that it would make it possible to give the keys to the slideshow operation to someone else. The way our slideshow traditionally worked required writing HTML into the middle of a hardcoded homepage and uploading the image to the server in a separate process. This meant that my department was a roadblock to updating the images, and required careful coordination before vacations or times away to ensure we could get the images changed. We all agreed that if the slideshow was going to stay, this process had to improve.
Why not just remove the slideshow entirely? That’s one option we definitely considered, but one important caveat I set early in the redesign process was to leave the site content and features alone and just update the look and feel of the site. Thus I wanted to leave every current piece of information that was an important part of the homepage as is, though slightly reorganized. I also didn’t want to change the size of the homepage slideshow images, since the PR committee already had a large stock of images they were using and I didn’t want them to have to redesign everything. In general, we are moving to a much more flexible and iterative process for changing website features and content, so nothing is ruled out for the future.
I won’t go into a lot of detail about the technical fixes I made, since this won’t be widely applicable. Views Slideshow uses a very standard Drupal module called Views to create a list of content. While it is a very popular module, I found it challenging to install correctly without a lot of help (I mainly used this site), since the settings are hard to figure out. In setting up the module, you are able to control things like whether alt text is required, the most basic type of accessibility feature, which allows users who cannot see images to understand their content through screen readers or other assistive technologies. Beyond that, you can set some things up in the templates for the modules. First I created a Drupal content type is called Featured Slideshow. It includes fields for title of the slide, image, and the link it should go to. The image has an alt and title field, which can be set automatically using tokens (text templates), or manually by the person entering data. The module uses jQuery Cycle to control which image is available. I then customized the templates (several PHP files) to include ARIA roles and to edit the controls to make them plain English rather than icons (I can think of downsides to this approach for sure, but at least it makes the point of them clear for many people).
ARIA role. This is frequently updated but non-essential page content. Its default ARIA live state is “off”, meaning unless the user is focused on it changes in state won’t be announced. You can change this to “polite” as well, which means a change in state will be announced at the next convenient opportunity. You would never want to use “assertive”, since that would interrupt the user for no reason.
Features I’m still working on are detailed in The Unbearable Inaccessibility of Slideshows, specifically keyboard focus order and improved performance with stylesheets unavailable. However with a few small changes I’ve improved accessibility of a feature on the site–and this technique can be applied to any feature on any site.
Making Small Improvements to Improve Accessibility.
While librarians who get the privilege of working on their own library’s website have the possibilities to guide the design choices, we are not always able to create exactly the ideal situation. Whether you are dealing with a carousel or any other feature that requires some work to improve accessibility, I would suggest the following strategy:
- Review what the basic requirements are for making the feature work with your platform and situation. This means both technically and politically.
- Research the approaches others have taken. You probably won’t be able to use someone else’s technique unless they are in a very similar situation, but you can at least use lessons learned.
- Create a step by step plan to ensure you’re not missing anything, as well as a list of questions to answer as you are working through the development process.
- Test the feature. You can use achecker or WAVE, which has a browser plugin to help you test sites in a local development environment.
- Review errors and fix these. If you can’t fix everything, list the problems and plan for future development, or see if you can pick a new solution.
- Test with actual users.
This may seem overwhelming, but taking it slow and only working on one feature at a time can be a good way to manage the process. And even better, you’ll improve your practices so that the next time you start a project you can do it correctly from the beginning.
When it comes to digital preservation, everyone agrees that a little bit is better than nothing. Look no further than these two excellent presentations from Code4Lib 2016, “Can’t Wait for Perfect: Implementing “Good Enough” Digital Preservation” by Shira Peltzman and Alice Sara Prael, and “Digital Preservation 101, or, How to Keep Bits for Centuries” by Julie Swierczek. I highly suggest you go check those out before reading more of this post if you are new to digital preservation, since they get into some technical details that I won’t.
The takeaway from these for me was twofold. First, digital preservation doesn’t have to be hard, but it does have to be intentional, and secondly, it does require institutional commitment. If you’re new to the world of digital preservation, understanding all the basic issues and what your options are can be daunting. I’ve been fortunate enough to lead a group at my institution that has spent the last few years working through some of these issues, and so in this post I want to give a brief overview of the work we’ve done, as well as the current landscape for digital preservation systems. This won’t be an in-depth exploration, more like a key to the map. Note that ACRL TechConnect has covered a variety of digital preservation issues before, including data management and preservation in “The Library as Research Partner” and using bash scripts to automate digital preservation workflow tasks in “Bash Scripting: automating repetitive command line tasks”.
The committee I chair started examining born digital materials, but expanded focus to all digital materials, since our digitized materials were an easier test case for a lot of our ideas. The committee spent a long time understanding the basic tenets of digital preservation–and in truth, we’re still working on this. For this process, we found working through the NDSA Levels of Digital Preservation an extremely helpful exercise–you can find a helpfully annotated version with tools by Shira Peltzman and Alice Sara Prael, as well as an additional explanation by Shira Peltman. We also relied on the Library of Congress Signal blog and the work of Brad Houston, among other resources. A few of the tasks we accomplished were to create a rough inventory of digital materials, a workflow manual, and to acquire many terabytes (currently around 8) of secure networked storage space for files to replace all removable hard drives being used for backups. While backups aren’t exactly digital preservation, we wanted to at the very least secure the backups we did have. An inventory and workflow manual may sound impressive, but I want to emphasize that these are living and somewhat messy documents. The major advantage of having these is not so much for what we do have, but for identifying gaps in our processes. Through this process, we were able to develop a lengthy (but prioritized) list of tasks that need to be completed before we’ll be satisfied with our processes. An example of this is that one of the major workflow gaps we discovered is that we have many items on obsolete digital media formats, such as floppy disks, that needs to be imaged before it can even be inventoried. We identified the tool we wanted to use for that, but time and staffing pressures have left the completion of this project in limbo. We’re now working on hiring a graduate student who can help work on this and similar projects.
The other piece of our work has been trying to understand what systems are available for digital preservation. I’ll summarize my understanding of this below, with several major caveats. This is a world that is currently undergoing a huge amount of change as many companies and people work on developing new systems or improving existing systems, so there is a lot missing from what I will say. Second, none of these solutions are necessarily mutually exclusive. Some by design require various pieces to be used together, some may not require it, but your circumstances may dictate a different solution. For instance, you may not like the access layer built into one system, and so will choose something else. The dream that you can just throw money at the problem and it will go away is, at present, still just a dream–as are so many library technology problems.
The closest to such a dream is the end-to-end system. This is something where at one end you load in a file or set of files you want to preserve (for example, a large set of donated digital photographs in TIFF format), and at the other end have a processed archival package (which might include the TIFF files, some metadata about the processing, and a way to check for bit rot in your files), as well as an access copy (for example, a smaller sized JPG appropriate for display to the public) if you so desire–not all digital files should be available to the public, but still need to be preserved.
Examples of such systems include Preservica, ArchivesDirect, and Rosetta. All of these are hosted vended products, but ArchivesDirect is based on open source Archivematica so it is possible to get some idea of the experience of using it if you are able to install the tools on which it based. The issues with end-t0-end systems are similar to any other choice you make in library systems. First, they come at a high price–Preservica and ArchivesDirect are open about their pricing, and for a plan that will meet the needs of medium-sized libraries you will be looking at $10,000-$14,000 annual cost. You are pretty much stuck with the options offered in the product, though you still have many decisions to make within that framework. Migrating from one system to another if you change your mind may involve some very difficult processes, and so inertia dictates that you will be using that system for the long haul, which a short trial period or demos may not be enough to really tell you that it’s a good idea. But you do have the potential for more simplicity and therefore a stronger likelihood that you will actually use them, as well as being much more manageable for smaller staffs that lack dedicated positions for digital preservation work–or even room in the current positions for digital preservation work. A hosted product is ideal if you don’t have the staff or servers to install anything yourself, and helps you get your long-term archival files onto Amazon Glacier. Amazon Glacier is, by the way, where pretty much all the services we’re discussing store everything you are submitting for long-term storage. It’s dirt cheap to store on Amazon Glacier and if you can restore slowly, not too expensive to restore–only expensive if you need to restore a lot quickly. But using it is somewhat technically challenging since you only interact with it through APIs–there’s no way to log in and upload files or download files as with a cloud storage service like Dropbox. For that reason, when you’re paying a service hundreds of dollars a terabyte that ultimately stores all your material on Amazon Glacier which costs pennies per gigabye, you’re paying for the technical infrastructure to get your stuff on and off of there as much as anything else. In another way you’re paying an insurance policy for accessing materials in a catastrophic situation where you do need to recover all your files–theoretically, you don’t have to pay extra for such a situation.
A related option to an end-to-end system that has some attractive features is to join a preservation network. Examples of these include Digital Preservation Network (DPN) or APTrust. In this model, you pay an annual membership fee (right now $20,000 annually, though this could change soon) to join the consortium. This gives you access to a network of preservation nodes (either Amazon Glacier or nodes at other institutions), access to tools, and a right (and requirement) to participate in the governance of the network. Another larger preservation goal of such networks is to ensure long-term access to material even if the owning institution disappears. Of course, $20,000 plus travel to meetings and work time to participate in governance may be out of reach of many, but it appears that both DPN and APTrust are investigating new pricing models that may meet the needs of smaller institutions who would like to participate but can’t contribute as much in money or time. This a world that I would recommend watching closely.
Up until recently, the way that many institutions were achieving digital preservation was through some kind of repository that they created themselves, either with open source repository software such as Fedora Repository or DSpace or some other type of DIY system. With open source Archivematica, and a few other tools, you can build your own end-to-end system that will allow you to process files, store the files and preservation metadata, and provide access as is appropriate for the collection. This is theoretically a great plan. You can make all the choices yourself about your workflows, storage, and access layer. You can do as much or as little as you need to do. But in practice for most of us, this just isn’t going to happen without a strong institutional commitment of staff and servers to maintain this long term, at possibly a higher cost than any of the other solutions. That realization is one of the driving forces behind Hydra-in-a-Box, which is an exciting initiative that is currently in development. The idea is to make it possible for many different sizes of institutions to take advantage of the robust feature sets for preservation in Fedora and workflow management/access in Hydra, but without the overhead of installing and maintaining them. You can follow the project on Twitter and by joining the mailing list.
After going through all this, I am reminded of one of my favorite slides from Julie Swierczek’s Code4Lib presentation. She works through the Open Archival Initiative System model graph to explain it in depth, and comes to a point in the workflow that calls for “Sustainable Financing”, and then zooms in on this. For many, this is the crux of the digital preservation problem. It’s possible to do a sort of ok job with digital preservation for nothing or very cheap, but to ensure long term preservation requires institutional commitment for the long haul, just as any library collection requires. Given how much attention digital preservation is starting to receive, we can hope that more libraries will see this as a priority and start to participate. This may lead to even more options, tools, and knowledge, but it will still require making it a priority and putting in the work.
After much hard work over years by the Drupal community, Drupal users rejoiced when Drupal 8 came out late last year. The system has been completely rewritten and does a lot of great stuff–but can it do what we need Drupal websites to do for libraries? The quick answer seems to be that it’s not quite ready, but depending on your needs it might be worth a look.
For those who aren’t familiar with Drupal, it’s a content management system designed to manage complex sites with multiple types of content, users, features, and appearances. Certain “core” features are available to everyone out of the box, but even more useful are the “modules”, which extend the features to do all kinds of things from the mundane but essential backup of a site to a flashy carousel slider. However, the modules are created by individuals or companies and contributed back to the community, and thus when Drupal makes a major version change they need to be rewritten, quite drastically in the case of Drupal 8. That means that right now we are in a period where developers may or may not be redoing their modules, or they may be rethinking about how a certain task should be done in the future. Because most of these developers are doing this work as volunteers, it’s not reasonable to expect that they will complete the work on your timeline. The expectation is that if a feature is really important to you, then you’ll work on development to make it happen. That is, of course, easier said than done for people who barely have enough time to do the basic web development asked of them, much less complex programming or learning a new system top to bottom, so most of us are stuck waiting or figuring out our own solutions.
Despite my knowledge of the reality of how Drupal works, I was very excited at the prospect of getting into Drupal 8 and learning all the new features. I installed it right away and started poking around, but realized pretty quickly I was going to have to do a complete evaluation for whether it was actually practical to use it for my library’s website. Our website has been on Drupal 7 since 2012, and works pretty well, though it does need a new theme to bring it into line with 2016 design and accessibility standards. Ideally, however, we could be doing even more with the site, such as providing better discovery for our digital special collections and making the site information more semantic web friendly. It was those latter, more advanced, feature desires that made me really wish to use Drupal 8, which includes semantic HTML5 integration and schema.org markup, as well as better integration with other tools and libraries. But the question remains–would it really be practical to work on migrating the site immediately, or would it make more sense to spend some development time on improving the Drupal 7 site to make it work for the next year or so while working on Drupal 8 development more slowly?
A bit of research online will tell you that there’s no right answer, but that the first thing to do in an evaluation is determine whether any the modules on which your site depends are available for Drupal 8, and if not, whether there is a good alternative. I must add that while all the functions I am going to mention can be done manually or through custom code, a lot of that work would take more time to write and maintain than I expect to have going forward. In fact, we’ve been working to move more of our customized code to modules already, since that makes it possible to distribute some of the workload to others outside of the very few people at our library who write code or even know HTML well, not to mention taking advantage of all the great expertise of the Drupal community.
I tried two different methods for the evaluation. First, I created a spreadsheet with all the modules we actually use in Drupal 7, their versions, and the current status of those modules in Drupal 8 or if I found a reasonable substitute. Next, I tried a site that automates that process, d8upgrade.org. Basically you fill in your website URL and email, and wait a day for your report, which is very straightforward with a list of modules found for your site, whether there is a stable release, an alpha or beta release, or no Drupal 8 release found yet. This is a useful timesaver, but will need some manual work to complete and isn’t always completely up to date.
My manual analysis determined that there were 30 modules on which we depend to a greater or lesser extent. Of those, 10 either moved into Drupal core (so would automatically be included) or the functions on which used them moved into another piece of core. 5 had versions available in Drupal 8, with varying levels of release (i.e. several in stable alpha release, so questionable to use for production sites but probably fine), and 5 were not migrated but it was possible to identify substitute Drupal 8 modules. That’s pretty good– 18 modules were available in Drupal 8, and in several cases one module could do the job that two or more had done in Drupal 7. Of the additional 11 modules that weren’t migrated and didn’t have an easy substitution, three of them are critical to maintaining our current site workflows. I’ll talk about those in more detail below.
d8upgrade.org found 21 modules in use, though I didn’t include all of them on my own spreadsheet if I didn’t intend to keep using them in the future. I’ve included a screenshot of the report, and there are a few things to note. This list does not have all the modules I had on my list, since some of those are used purely behind the scenes for administrative purposes and would have no indication of use without administrative access. The very last item on the list is Core, which of course isn’t going to be upgraded to Drupal 8–it is Drupal 8. I also found that it’s not completely up to date. For instance, my own analysis found a pre-release version of Workbench Moderation, but that information had not made it to this site yet. A quick email to them fixed it almost immediately, however, so this screenshot is out of date.
I decided that there were three dealbreaker modules for the upgrade, and I want to talk about why we rely on them, since I think my reasoning will be applicable to many libraries with limited web development time. I will also give honorable mention to a module that we are not currently using, but I know a lot of libraries rely on and that I would potentially like to use in the future.
Webform is a module that creates a very simple to use interface for creating webforms and doing all kinds of things with them beyond just simply sending emails. We have many, many custom PHP/MySQL forms throughout our website and intranet, but there are only two people on the staff who can edit those or download the submitted entries from them. They also occasionally have dreadful spam problems. We’ve been slowly working on migrating these custom forms to the Drupal Webform module, since that allows much more distribution of effort across the staff, and provides easier ways to stop spam using, for instance, the Honeypot module or Mollom. (We’ve found that the Honeypot module stopped nearly all our spam problems and didn’t need to move to Mollom, since we don’t have user comments to moderate). The thought of going back to coding all those webforms myself is not appealing, so for now I can’t move forward until I come up with a Drupal solution.
Redirect does a seemingly tiny job that’s extremely helpful. It allows you to create redirects for URLs on your site, which is incredibly helpful for all kinds of reasons. For instance, if you want to create a library site branded link that forwards somewhere else like a database vendor or another page on your university site, or if you want to change a page URL but ensure people with bookmarks to the old page will still find it. This is, of course, something that you can do on your web server, assuming you have access to it, but this module takes a lot of the administrative overhead away and helps keep things organized.
Backup and Migrate is my greatest helper in my goal to be someone who would like to at least be in the neighborhood of best practices for web development when web development is only half my job, or some weeks more like a quarter of my job. It makes a very quick process of keeping my development, staging, and production sites in sync, and since I created a workflow using this module I have been far more successful in keeping my development processes sane. It provides an interface for creating a backup of your site database, files directories, or your database and files that you can use in the Backup and Migrate module to completely restore a site. I use it at least every two weeks, or more often when working on a particular feature to move the database between servers (I don’t move the files with the module for this process, but that’s useful for backups that are for emergency restoration of the site). There are other ways to accomplish this work, but this particular workflow has been so helpful that I hate to dump a lot of time into redoing it just now.
One last honorable mention goes to Workbench, which we don’t use but I know a lot of libraries do use. This allows you to create a much more friendly interface for content editors so they don’t have to deal with the administrative backend of Drupal and allows them to just see their own content. We do use Workbench Moderation, which does have a Drupal 8 release, and allows a moderation queue for the six or so members of staff who can create or edit content but don’t have administrative rights to have their content checked by an administrator. None of them particularly like the standard Drupal content creation interface, and it’s not something that we would ever ask the rest of the staff to use. We know from the lack of use of our intranet, which also is on Drupal, that no one particularly cares for editing content there. So if we wanted to expand access to website editing, which we’ve talked about a lot, this would be a key module for us to use.
Given the current status of these modules with rewrites in progress, it seems likely that by the end of the year it may be possible to migrate to Drupal 8 with our current setup, or in playing around with Drupal 8 on a development site that we determine a different way to approach these needs. If you have the interest and time to do this, there are worse ways to pass the time. If you are creating a completely new Drupal site and don’t have a time crunch, starting in Drupal 8 now is probably the way to go, since by the time the site would be ready you may have additional modules available and get to take advantage of all the new features. If this is something you’re trying to roll out by the end of the semester, maybe wait on it.
Have you considered upgrading your library’s site to Drupal 8? Have you been successful? Let us know in the comments.
Anyone who has worked on an institutional repository for even a short time knows that collecting faculty scholarship is not a straightforward process, no matter how nice your workflow looks on paper or how dedicated you are. Keeping expectations for the process manageable (not necessarily low, as in my clickbaity title) and constant simplification and automation can make your process more manageable, however, and therefore work better. I’ve written before about some ways in which I’ve automated my process for faculty collection development, as well as how I’ve used lightweight project management tools to streamline processes. My newest technique for faculty scholarship collection development brings together pieces of all those to greatly improve our productivity.
Allocating Your Human and Machine Resources
First, here is the personnel situation we have for the institutional repository I manage. Your own circumstances will certainly vary, but I think institutions of all sizes will have some version of this distribution. I manage our repository as approximately half my position, and I have one graduate student assistant who works about 10-15 hours a week. From week to week we only average about 30-40 hours total to devote to all aspects of the repository, of which faculty collection development is only a part. We have 12 librarians who are liaisons with departments and do the majority of the outreach to faculty and promotion of the repository, but a limited amount of the collection development except for specific parts of the process. While they are certainly welcome to do more, in reality, they have so much else to do that it doesn’t make sense for them to spend their time on data entry unless they want to (and some of them do). The breakdown of work is roughly that the liaisons promote the repository to the faculty and answer basic questions; I answer more complex questions, develop procedures, train staff, make interpretations of publishing agreements, and verify metadata; and my GA does the simple research and data entry. From time to time we have additional graduate or undergraduate student help in the form of faculty research assistants, and we have a group of students available for digitization if needed.
Those are our human resources. The tools that we use for the day-to-day work include Digital Measures (our faculty activity system), Excel, OpenRefine, Box, and Asana. I’ll say a bit about what each of these are and how we use them below. By far the most important innovation for our faculty collection development workflow has been integration with the Faculty Activity System, which is how we refer to Digital Measures on our campus. Many colleges and universities have some type of faculty activity system or are in the process of implementing one. These generally are adopted for purposes of annual reports, retention, promotion, and tenure reviews. I have been at two different universities working on adopting such systems, and as you might imagine, it’s a slow process with varying levels of participation across departments. Faculty do not always like these systems for a variety of reasons, and so there may be hesitation to complete profiles even when required. Nevertheless, we felt in the library that this was a great source of faculty publication information that we could use for collection development for the repository and the collection in general.
We now have a required question about including the item in the repository on every item the faculty member enters in the Faculty Activity System. If a faculty member is saying they published an article, they also have to say whether it should be included in the repository. We started this in late 2014, and it revolutionized our ability to reach faculty and departments who never had participated in the repository before, as well as simplify the lives of faculty who were eager participants but now only had to enter data in one place. Of course, there are still a number of people whom we are missing, but this is part of keeping your expectation low–if you can’t reach everyone, focus your efforts on the people you can. And anyway, we are now so swamped with submissions that we can’t keep up with them, which is a good if unusual problem to have in this realm. Note that the process I describe below is basically the same as when we analyze a faculty member’s CV (which I described in my OpenRefine post), but we spend relatively little time doing that these days since it’s easier for most people to just enter their material in Digital Measures and select that they want to include it in the repository.
The ease of integration between your own institution’s faculty activity system (assuming it exists) and your repository certainly will vary, but in most cases it should be possible for the library to get access to the data. It’s a great selling point for the faculty to participate in the system for your Office of Institutional Research or similar office who administers it, since it gives faculty a reason to keep it up to date when they may be in between review cycles. If your institution does not yet have such a system, you might still discuss a partnership with that office, since your repository may hold extremely useful information for them about research activity of which they are not aware.
We get reports from the Faculty Activity System on roughly a quarterly basis. Faculty member data entry tends to bunch around certain dates, so we focus on end of semesters as the times to get the reports. The reports come by email as Excel files with information about the person, their department, contact information, and the like, as well as information about each publication. We do some initial processing in Excel to clean them up, remove duplicates from prior reports, and remove irrelevant information. It is amazing how many people see a field like “Journal Title” as a chance to ask a question rather than provide information. We focus our efforts on items that have actually been published, since the vast majority of people have no interest in posting pre-prints and those that do prefer to post them in arXiv or similar. The few people who do know about pre-prints and don’t have a subject archive generally submit their items directly. This is another way to lower expectations of what can be done through the process. I’ve already described how I use OpenRefine for creating reports from faculty CVs using the SHERPA/RoMEO API, and we follow a similar but much simplified process since we already have the data in the correct columns. Of course, following this process doesn’t tell us what we can do with every item. The journal title may be entered incorrectly so the API call didn’t pick it up, or the journal may not be in SHERPA/RoMEO. My graduate student assistant fills in what he is able to determine, and I work on the complex cases. As we are doing this, the Excel spreadsheet is saved in Box so we have the change history tracked and can easily add collaborators.
At this point, we are ready to move to Asana, which is a lightweight project management tool ideal for several people working on a group of related projects. Asana is far more fun and easy to work with than Excel spreadsheets, and this helps us work together better to manage workload and see where we are with all our on-going projects. For each report (or faculty member CV), we create a new project in Asana with several sections. While it doesn’t always happen in practice, in theory each citation is a task that moves between sections as it is completed, and finally checked off when it is either posted or moved off into some other fate not as glamorous as being archived as open access full text. The sections generally cover posting the publisher’s PDF, contacting publishers, reminders for followup, posting author’s manuscripts, or posting to SelectedWorks, which is our faculty profile service that is related to our repository but mainly holds citations rather than full text. Again, as part of the low expectations, we focus on posting final PDFs of articles or book chapters. We add books to a faculty book list, and don’t even attempt to include full text for these unless someone wants to make special arrangements with their publisher–this is rare, but again the people who really care make it happen. If we already know that the author’s manuscript is permitted, we don’t add these to Asana, but keep them in the spreadsheet until we are ready for them.
We contact publishers in batches, trying to group citations by journal and publisher to increase efficiency so we can send one letter to cover many articles or chapters. We note to follow up with a reminder in one month, and then again in a month after that. Usually the second notice is enough to catch the attention of the publisher. As they respond, we move the citation to either posting publisher’s PDF section or to author’s manuscript section, or if it’s not permitted at all to the post to SelectedWorks section. While we’ve tried several different procedures, I’ve determined it’s best for the liaison librarians to ask just for author’s accepted manuscripts for items after we’ve verified that no other version may be posted. And if we don’t ever get them, we don’t worry about it too much.
I hope you’ve gotten some ideas from this post about your own procedures and new tools you might try. Even more, I hope you’ll think about which pieces of your procedures are really working for you, and discard those that aren’t working any more. Your own situation will dictate which those are, but let’s all stop beating ourselves up about not achieving perfection. Make sure to let your repository stakeholders know what works and what doesn’t, and if something that isn’t working is still important, work collaboratively to figure out a way around that obstacle. That type of collaboration is what led to our partnership with the Office of Institutional Research to use the Digital Measures platform for our collection development, and that in turn has led to other collaborative opportunities.
A few of us at Tech Connect participated in the #1Lib1Ref campaign that’s running from January 15th to the 23rd . What’s #1Lib1Ref? It’s a campaign to encourage librarians to get involved with improving Wikipedia, specifically by citation chasing (one of my favorite pastimes!). From the project’s description:
Imagine a World where Every Librarian Added One More Reference to Wikipedia.
Wikipedia is a first stop for researchers: let’s make it better! Your goal today is to add one reference to Wikipedia! Any citation to a reliable source is a benefit to Wikipedia readers worldwide. When you add the reference to the article, make sure to include the hashtag #1Lib1Ref in the edit summary so that we can track participation.
Below, we each describe our experiences editing Wikipedia. Did you participate in #1Lib1Ref, too? Let us know in the comments or join the conversation on Twitter!
I recorded a short screencast of me adding a citation to the Darbhanga article.
— Eric Phetteplace
I used the Citation Hunt tool to find an article that needed a citation. I selected the second one I found, which was about urinary tract infections in space missions. That is very much up my alley. I discovered after a quick Google search that the paragraph in question was plagiarized from a book on Google Books! After a hunt through the Wikipedia policy on quotations, I decided to rewrite the paragraph to paraphrase the quote, and then added my citation. As is usual with plagiarism, the flow was wrong, since there was a reference to a theme in the previous paragraph of the book that wasn’t present in the Wikipedia article, so I chose to remove that entirely. The Wikipedia Citation Tool for Google Books was very helpful in automatically generating an acceptable citation for the appropriate page. Here’s my shiny new paragraph, complete with citation: https://en.wikipedia.org/wiki/
— Margaret Heller
I edited the “Library Facilities” section of the “University of Maryland Baltimore” article in Wikipedia. There was an outdated link in the existing citation, and I also wanted to add two additional sentences and citations. You can see how I went about doing this in my screen recording below. I used the “edit source” option to get the source first in the Text Editor and then made all the changes I wanted in advance. After that, I copy/pasted the changes I wanted from my text file to the Wikipedia page I was editing. Then, I previewed and saved the page. You can see that I also had a typo in my text and had to fix that again to make the citation display correctly. So I had to edit the article more than once. After my recording, I noticed another typo in there, which I fixed it using the “edit” option. The “edit” option is much easier to use than the “edit source” option for those who are not familiar with editing Wiki pages. It offers a menu bar on the top with several convenient options.
The recording of editing a Wikipedia article:
— Bohyun Kim
It has been so long since I’ve edited anything on Wikipedia that I had to make a new account and read the “how to add a reference” link; which is to say, if I could do it in 30 minutes while on vacation, anyone can. There is a WYSIWYG option for the editing interface, but I learned to do all this in plain text and it’s still the easiest way for me to edit. See the screenshot below for a view of the HTML editor.
I wondered what entry I would find to add a citation to…there have been so many that I’d come across but now I was drawing a total blank. Happily, the 1Lib1Ref campaign gave some suggestions, including “Provinces of Afghanistan.” Since this is my fatherland, I thought it would be a good service to dive into. Many of Afghanistan’s citations are hard to provide for a multitude of reasons. A lot of our history has been an oral tradition. Also, not insignificantly, Afghanistan has been in conflict for a very long time, with much of its history captured from the lens of Great Game participants like England or Russia. Primary sources from the 20th century are difficult to come by because of the state of war from 1979 onwards and there are not many digitization efforts underway to capture what there is available (shout out to NYU and the Afghanistan Digital Library project).
Once I found a source that I thought would be an appropriate reference for a statement on the topography of Uruzgan Province, I did need to edit the sentence to remove the numeric values that had been written since I could not find a source that quantified the area. It’s not a precise entry, to be honest, but it does give the opportunity to link to a good map with other opportunities to find additional information related to Afghanistan’s agriculture. I also wanted to chose something relatively uncontroversial, like geographical features rather than historical or person-based, for this particular campaign.
— Yasmeen Shorish
Keeping any large technical project user-centered is challenging at best. Adding in something like an extremely tight timeline makes it too easy to dispense with this completely. Say, for instance, six months to migrate to a new integrated library system that combines your old ILS plus your link resolver and many other tools and a new discovery layer. I would argue, however, that it’s on a tight timeline like that that a major focus on user experience research can become a key component of your success. I am referring in this piece specifically to user experience on the web, but of course there are other aspects of user experience that go into such a project. While none of my observations about usability testing and user experience are new, I have realized from talking to others that they need help advocating for the importance of user research. As we turn to our hopes and goals for 2016, let’s all make a resolution to figure out a way to make better user experience research happen, even if it seems impossible.
Selling the Need For User Testing
When I worked on implementing a discovery layer at my job earlier this year, I had a team of 18 people from three campuses with varying levels of interest and experience in user testing. It was really important to us that we had an end product that would work for everyone at all levels, whether novice or experienced researcher, as well as for the library staff who would need to use the system on a daily basis. With so many people and such a tight timeline building user testing into the schedule in the first place helped us to frame our decisions as a hypothesis to confirm or nullify in the next round of testing. We tried to involve as many people as possible in the testing, though we had a core group who had experience with running the tests administer them. Doing a test as early as possible is good to convince others of the need for testing. People who had never seen a usability test done before found them convincing immediately and were much more on board for future tests.
Remembering Who Your Users Are
Reference and instruction librarians are users too. We sometimes get so focused on reminding librarians that they are not the users that we don’t make things work for them–and they do need to use the catalog too. Librarians who work with students in the classroom and in research consultations on a daily basis have a great deal of insight into seemingly minor issues that may lead to major frustrations. Here’s an example. The desktop view of our discovery layer search box was about 320 pixels long which works fine–if you are typing in just one word. Yet we were “selling” the discovery layer as something that handled known-item searching well, which meant that much of a pasted in citation wasn’t visible. The reference librarians who were doing this exact work knew this would be an issue. We expanded the search box so more words are visible and so it works better for known-item searching.
The same goes for course reserves, interlibrary loan, or other staff who work with a discovery layer frequently often with an added pressure of tight deadlines. If you can shave seconds off for them that adds up a huge amount over the course of the year, and will additionally potentially solve issues for other users. One example is that the print view of a book record had very small text–the print stylesheet was set to print at 85% font size, which meant it was challenging to read. The reserves staff relied on this print view to complete their daily work with the student worker. For one student the small print size created an accessibility issue which led to inefficient manual workarounds. We were able to increase the print stylesheet to greater than 100% font size which made the printed page easily readable, and therefore fix the accessibility issue for this specific use case. I suspect there are many other people whom this benefits as well.
Divide the Work
I firmly believe that everyone who is interested in user experience on the web should get some hands on experience with it. That said, not everyone needs to do the hands on work, and with a large project it is important that people focus on their core reason for being on the team. Dividing the group into overlapping teams who worked on data testing, interface testing, and user education and outreach helped us to see the big picture but not overwhelm everyone (a little Overwhelm is going to happen no matter what). These groups worked separately much of the time for deep dives into specific issues, but helped inform each other across the board. For instance, the data group might figure out a potential issue, for which the interface group would determine a test scenario. If testing indicated a change, the user education group could be aware of implications for outreach.
A Quick Timeline is Your Friend
Getting a new tool out with only a few months turnaround time is certainly challenging, but it forces you to forget about perfection and get features done. We got our hands on the discovery layer on Friday, and were doing tests the following Tuesday, with additional tests scheduled for two weeks after the first look. This meant that our first tests were on something very rough, but gave us a big list of items to fix in the next two weeks before the next test (or put on hold if lower priority). We ended up taking off two months from live usability in the middle of the process to focus on development and other types of testing (such as with trusted beta testers). But that early set of tests was crucial in setting the agenda and showing the importance of testing. We ultimately did 5 rounds of testing, 4 of which happened before the discovery layer went live, and 1 a few months after.
Think on the Long Scale
The vendor or the community of developers is presumably not going to stop working on the product, and neither should you. For this reason, it is helpful to make it clear who is doing the work and ensure that it is written into committee charges, job descriptions, or other appropriate documentation. Maintain a list of long-term goals, and in those short timescales figure out just one or two changes you could make. The academic year affords many peaks and lulls, and those lulls can be great times to make minor changes. Regular usability testing ensures that these changes are positive, as well as uncovering new needs as tools and needs change.
Iteration is the way to ensure that your long timescale stays manageable. Work never really stops, but that’s ok. You need a job, right? Back to that idea of a short timeline–borrow from the Agile method to think in timescales of 2 weeks-1 month. Have the end goal in mind, but know that getting there will happen in tiny pieces. This does require some faith that all the crucial pieces will happen, but as long as someone is keeping an eye on those (in our case, the vendor helped a lot with this), the pressure is off on being “finished”. If a test shows that something is broken that really needs to work, that can become high priority, and other desired features can move to a future cycle. Iteration helps you stay on track and get small pieces done regularly.
I hope I’ve made the case for why you need to have a user focus in any project, particularly a large and complex one. Whether you’re a reference librarian, project manager, web developer or cataloger, you have a responsibility to ensure the end result is usable, useful, and something people actually want to use. And no matter how tight your timeline, stick to making sure the process is user centered, and you’ll be amazed at how many impossible things you accomplished.
The Directory of Open Access Journals (DOAJ) is an international directory of journals and index of articles that are available open access. Dating back to 2003, the DOAJ was at the center of a controversy surrounding the “sting” conducted by John Bohannon in Science, which I covered in 2013. Essentially Bohannon used journals listed in DOAJ to try to find journals that would publish an article of poor quality as long as authors paid a fee. At the time many suggested that a crowdsourced journal reviewing platform might be the way to resolve the problem if DOAJ wasn’t a good source. While such a platform might still be a good idea, the simpler and more obvious solution is the one that seems to have happened: for DOAJ to be more strict with publishers about requirements for inclusion in the directory. 1.
The process of cleaning up the DOAJ has been going on for some time and is getting close to an important milestone. All the 10,000+ journals listed in DOAJ were required to reapply for inclusion, and the deadline for that is December 30, 2015. After that time, any journals that haven’t reapplied will be removed from the DOAJ.
“Proactive Not Reactive”
Contrary to popular belief, the process for this started well before the Bohannon piece was published 2. In December 2012 an organization called Infrastructure Services for Open Access (IS4OA) (founded by Alma Swan and Caroline Sutton) took over DOAJ from Lund University, and announced several initiatives, including a new platform, distributed editorial help, and improved criteria for inclusion. 3 Because DOAJ grew to be an important piece of the scholarly communications infrastructure it was inevitable that they would have to take such a step sooner or later. With nearly 10,000 journals and only a small team of editors it wouldn’t have been sustainable over time, and to lose the DOAJ would have been a blow to the open access community.
One of the remarkable things about the revitalization of the DOAJ is the transparency of the process. The DOAJ News Service blog has been detailing the behind the scenes processes in detail since May 2014. One of the most useful things is a list of journals who have claimed to be listed in DOAJ but are not. Another important piece of information is the 2015-2016 development roadmap. There is a lot going on with the DOAJ update, however, so below I will pick out what I think is most important to know.
The New DOAJ
In March 2014, the DOAJ created a new application form with much higher standards for inclusion. Previously the form for inclusion was only 6 questions, but after working with the community they changed the application to require 58 questions. The requirements are detailed on a page for publishers, and the new application form is available as a spreadsheet.
While 58 questions seems like a lot, it is important to note that journals need not fulfill every single requirement, other than the basic requirements for inclusion. The idea is that journal publishers must be transparent about the structure and funding of the journal, and that journals explicitly labeled as open access meet some basic theoretical components of open access. For instance, one of the basic requirements is that “the full text of ALL content must be available for free and be Open Access without delay”. Certain other pieces are strong suggestions, but not meeting them will not reject a journal. For instance, the DOAJ takes a strong stand against impact factors and suggests that they not be presented on journal websites at all 4.
To highlight journals that have extremely high standards for “accessibility, openness, discoverability reuse and author rights”, the DOAJ has developed a “Seal” that is awarded to journals who answer “yes” to the following questions (taken from the DOAJ application form):
have an archival arrangement in place with an external party (Question 25). ‘No policy in place’ does not qualify for the Seal.
provide permanent identifiers in the papers published (Question 28). ‘None’ does not qualify for the Seal.
provide article level metadata to DOAJ (Question 29). ‘No’ or failure to provide metadata within 3 months do not qualify for the Seal.
embed machine-readable CC licensing information in article level metadata (Question 45). ‘No’ does not qualify for the Seal.
allow reuse and remixing of content in accordance with a CC BY, CC BY-SA or CC BY-NC license (Question 47). If CC BY-ND, CC BY-NC-ND, ‘No’ or ‘Other’ is selected the journal will not qualify for the Seal.
have a deposit policy registered in a deposit policy directory. (Question 51) ‘No’ does not qualify for the Seal.
allow the author to hold the copyright without restrictions. (Question 52) ‘No’ does not qualify for the Seal.
Part of the appeal of the Seal is that it focuses on the good things about open access journals rather than the questionable practices. Having a whitelist is much more appealing for people doing open access outreach than a blacklist. Journals with the Seal are available in a facet on the new DOAJ interface.
Getting In and Out of the DOAJ
Part of the reworking of the DOAJ was the requirementand required all currently listed journals to reapply–as of November 19 just over 1,700 journals had been accepted under the new criteria, and just over 800 had been removed (you can follow the list yourself here). For now you can find journals that have reapplied with a green check mark (what DOAJ calls The Tick!). That means that about 85% of journals that were previously listed either have not reapplied, or are still in the verification pipeline 5. While DOAJ does not discuss specific reasons a journal or publisher is removed, they do give a general category for removal. I did some analysis of the data provided in the added/removed/rejected spreadsheet.
At the time of analysis, there were 1776 journals on the accepted list. 20% of these were added since September, and with the deadline looming this number is sure to grow. Around 8% of the accepted journals have the DOAJ Seal.
There were 809 journals removed from the DOAJ, and the reasons fell into the following general categories. I manually checked some of the journals with only 1 or 2 titles, and suspect that some of these may be reinstated if the publisher chooses to reapply. Note that well over half the removed journals weren’t related to misconduct but were ceased or otherwise unavailable.
|Inactive (has not published in the last calender year)||233|
|Suspected editorial misconduct by publisher||229|
|Website URL no longer works||124|
|Journal not adhering to Best Practice||62|
|Journal is no longer Open Access||45|
|Has not published enough articles this calendar year||2|
|Other; delayed open access||1|
|Other; no content||1|
|Other; taken offline||1|
|Removed at publisher’s request||1|
The spreadsheet lists 26 journals that were rejected. Rejected journals will know the specific reasons why their applications were rejected, but those specific reasons are not made public. Journals may reapply after 6 months once they have had an opportunity to amend the issues. 6 The general stated reasons were as follows:
|Has not published enough articles||2|
|Journal website lacks necessary information||2|
|Not an academic/scholarly journal||1|
|Web site URL doesn’t work||1|
The work that DOAJ is doing to improve transparency and the screening process is very important for open access advocates, who will soon have a tool that they can trust to provide much more complete information for scholars and librarians. For too long we have been forced to use the concept of a list of “questionable” or even “predatory” journals. A directory of journals with robust standards and easy to understand interface will be a fresh start for the rhetoric of open access journals.
Are you the editor of an open access journal? What do you think of the new application process? Leave your thoughts in the comments (anonymously if you like).
I have been mostly absent from ACRL Tech Connect this year because the last nine months have been spent migrating to a new library systems platform and discovery layer. As one of the key members of the implementation team, I have devoted more time to meetings, planning, development, more meetings, and more planning than any other part of my job has required thus far. We have just completed the official implementation project and are regular old customers by now. At this point I finally feel I can take a deep breath and step back to think about the past nine months in a holistic manner to glean some lessons learned from this incredible professional opportunity that was also incredibly challenging at times.
In this post I won’t go into the details of exactly which system we implemented and how, since it’s irrelevant to the larger discussion. Rather I’d like to stay at a high level to think about what working on such a project is like for a professional working with others on a team and as an individual trying to make things happen. For those who are curious about the details of the project, including management and process, those will be detailed in a forthcoming book chapter in Exploring Discovery (ALA Editions) edited by Ken Varnum. I will also be participating in an AL Live episode on this topic on October 8.
A project like this doesn’t come as a surprise. My library had been planning a move to a new platform for a number of years, and had an extremely inclusive selection process when selecting a new platform. When we found out that we would be able to go ahead with the implementation process I knew that I would have the opportunity to lead the implementation of the new discovery layer on the technical side, as well as coordinate much of the effort on the user outreach and education side. That was an exciting and terrifying role, since while it was far less challenging technically to my mind than working on the data migration, it would be the most public piece of the project. In addition it quickly became clear that our multi-campus situation wasn’t going to fit exactly into line with the built in solutions in the products, which required a great deal of additional work to understand the interoperability of the products and how they interacted with other systems. Ultimately it was a great education, but in the thick of it seemed to have no end in sight.
To that end, I wanted to share some of the lessons I learned from this process both as a leader and a member of a team. Of course, many of these are widely applicable to any project, whether it’s in a library systems department or any work place.
Someone has to say the obvious thing
One of the joys of doing something that is new to everyone is that the dread of impostor syndrome is diminished. If no one knows the answer, then no one can look like an idiot for not knowing, after all. Yet that is not always clear to everyone working on the project, and as the leader it’s useful to make it clear you have no idea how something works when you don’t, or if something is “simple” to you to still to say exactly how it works to make sure everyone understands. There’s a point at which assuming others do know the obvious thing is forgetting your own path to learning, in which it’s helpful to hear the simple thing stated clearly, which may take several attempts. Besides the obvious implications of people not understanding how something works, it robs them of a chance to investigate something of interest and become a real contributor. Try to not make other people have to admit they have no idea what you’re talking about, whether or not you think they should have known it. This also forces you to actually know what you’re talking about. Teaching something is, after all, the best way to learn it.
Don’t answer questions all the time
Human brains can be rather pathetic moment to moment even if they do all right in the end. A service mentality leads (or in some cases requires) us to answer questions as fast as we can, but it’s better to give the correct answer or the well-considered answer a little later than answer something in haste and get the answer wrong or say something in a poor manner. If you are trying to figure out things as you go along, there’s no reason for you to know anything off the top of your head. If you get a question in a meeting and need to double check, no one would be surprised. If you get an email at 5:13 PM after a long day and need to postpone even thinking about the answer until the following day, that is the best thing for your sanity and for the success of the project both.
Keep the end goal in mind, and know when to abandon pieces
This is an obvious insight, but crucial to feeling like you’ve got some control of the process. We tend to think of way more than we can possibly accomplish in a timeframe, and continual re-prioritization is essential. Some features you were sold on in the sales demo end up being lackluster, and other features you didn’t know existed will end up thrilling you. Competing opportunities and priorities will always exist. Good project management can account for those variables and still keep the core goals central and happening on time. But that said…
Project management is not a panacea
The whole past nine months I’ve had a vision that with perfect project management everything could go perfectly. This has crept into all areas of my life and made me imagine that I could project manage my way to perfection in my life with a toddler (way too many variables) or my house (110 year old houses are nearly as tricky as toddlers). We had excellent project management support from the vendor as well as internally, but I kept seeing room for improvement in everything. “If only we had foreseen that, we could have avoided this.” “If only I had communicated the action items more clearly after that meeting, we wouldn’t be so behind.” We actually learned very late in our project that other libraries undertaking similar projects hired a consultant to do nothing but project management on the library side which seemed like a very good idea–though we managed all right without one. In any event, a project manager wouldn’t have changed some of the most challenging issues, which didn’t have anything to do with timelines or resources but with differences in approach and values between departments and libraries. Everyone wants the “best” for the users, but the “best” for one person doesn’t work at all for another. Coming to a compromise is the right way to handle this, there’s no way to avoid conflict and the resulting change in the plan.
Hopefully we all get to experience projects in our careers of this magnitude, whether technical or not. Anything that shifts an institution to something new that touches everyone is something to take very seriously. It’s time-consuming and stressful because it should be! Nevertheless, managing time and stress is key to ensure that you view the work as thrilling rather than diminishing.
A decade ago, Stephen Colbert introduced the concept of “truthiness”, or a fact that was so because it felt right “from the gut.” When we search for information online, we are always up against the risk that the creator of a page is someone who, like Stephen Colbert’s character doesn’t trust books, because “they’re all fact, no heart.”1 Since sites with questionable or outright false facts that “feel right” often end up at the top of Google search results, librarians teach students how to evaluate online sources for accuracy, relevancy, and so on rather than just trusting the top result. But what if there were a way to ensure that truthiness was removed, and only sites with true information appeared at the top of the results?
This idea is what underlies a new Google algorithm called Knowledge-Based Trust(KBT)2. Google’s original founding principles and the PageRank algorithm were based on academic citation practices–loosely summarized, pages linked to by a number of other pages are more likely to be useful than those with fewer links. The content of the page, while it needs to match the search query, is less crucial to its ranking than outside factors, which is otherwise known as an exogenous model. The KBT, by contrast, is an endogenous model relying on the actual content of the page. Ranking is based on the probability that the page is accurate, and therefore more trustworthy. This is designed to address the problem of sites with high PageRank scores that aren’t accurate, either because their truthiness quotient is high, or because they have gamed the system by scraping content and applying misleading SEO. On the other side, pages with great information that aren’t very popular may be buried.
“Wait a second,” you are now asking yourself, “Google now determines what is true?” The answer is: sort of, but of course it’s not as simple as that. Let’s look at the paper in detail, and then come back to the philosophical questions.
Digging Into the KBT
First, this paper is technical, but the basic information is fairly straightforward. This model is based on extracting facts from a web source, evaluating whether those facts are true or not, and then whether a source is accurate or not. This leads to a determination that the facts are correct in an iterative process. Of course, verifying that determination is essential to ensuring that all the algorithms are working correctly, and this paper describes ways of checking the extracted facts for accuracy.
The extractors are described more fully in an earlier version of this work, Knowledge Vault (KV), which was designed to fill in large-scale knowledge bases such as Freebase by extracting facts from a web source using techniques like Natural Language Processing of text files followed by machine learning, HTML DOM trees, HTML tables, and human processed pages with schema.org metadata. The extractors themselves can perform poorly in creating these triples, however, and this is more common than the facts being wrong, and so sites may be unfairly flagged as inaccurate. The KBT project aims to introduce an algorithm for determining what type of error is present, as well as how to judge sites with many or few facts accurately, and lastly to test their assumptions using real world data against known facts.
The specific example given in the paper is the birthplace of President Barack Obama. The extractor would determine a predicate, subject, object triple from a web source and match these strings to Freebase (for example). This can lead to a number of errors–there is a huge problem in computationally determining the truth even when the semantics are straightforward (which we all know it rarely is). For this example, it’s possible to check data from the web against the known value in Freebase, and so if that extractor works set an option to 1 (for yes) and 0 (for no). Then this can be charted in a two-dimensional or three-dimensional matrix that helps show the probability of a given extractor working, as well as whether the value pulled by the extractor was true or not.
They go on to examine two models for computing the data, single-layer and multi-layer. The single-layer model, which looks at each web source and its facts separately, is easier to work with using standard techniques but is limited because it doesn’t take into account extraction errors. The multi-layer model is more complex to analyze, but takes the extraction errors into account along with the truth errors. I am not qualified to comment on the algorithm math in detail, but essentially it computes probability of accuracy for each variable in turn, ultimately arriving at an equation that estimates how accurate a source is, weighted by the likelihood that source contains those facts. There are additional considerations for precision and recall, as well as confidence levels returned by extractors.
Lastly, they consider how to split up large sources to avoid computational bottlenecks, as well as to merge sources with few facts in order to not penalize them but not accidentally combine unrelated sources. Their experimental results determined that generally PageRank and KBT are orthogonal, but with a few outliers. In some cases, the site has a low PageRank but a high KBT. They manually verified the top three predicates with high extraction accuracy scores for web sources with a high KBT to check what was happening. 85% of these sources were trustworthy without extraction errors and with predicates related to the topic of the page, but only 23% of these sources had PageRank scores over 0.5. In other cases, sources had a low KBT but high PageRank, which included sites such as celebrity gossip sites and forums such as Yahoo Answers. Yes, indeed, Google computer scientists finally have definitive proof that Yahoo Answers tends to be inaccurate.
The conclusion of the article with future improvements reads like the learning outcomes for any basic information literacy workshop. First, the algorithm would need to be able to tell the main topic of the website and filter out unrelated facts, to understand which triples are trivial, to have better comprehension of what is a fact, and to correctly remove sites with data scraped from other sources. That said, for what it does, this is a much more sophisticated model than anything else out there, and at least proves that there is a possibility to computationally determine the accuracy of a web source.
What is Truth, Anyway?
Despite the promise of this model there are clearly many potential problems, of which I’ll mention just a few. The source for this exercise, Freebase, is currently in read-only mode as its data migrates to Wikidata. Google is apparently dropping Freebase to focus on their Open Knowledge Graph, which is partially Freebase/Wikidata content and partially schema.org data 3. One interesting wrinkle is that much of Freebase content cites Wikipedia as a source, which means there are currently recursive citations that must be properly cited before they will be accepted as facts. We already know that Wikipedia suffers from a lack of diversity in contributors and topic coverage, so a focus on content from Wikipedia has the danger of reducing the sources of information from which the KBT could check triples.
That said, most of human knowledge and understanding is difficult to fit into triples. While surely no one would search Google for “What is love?” or similar and expect to get a factual answer, there are plenty of less extreme examples that are unclear. For instance, how does this account for controversial topics? I.e. “anthropogenic global warming is real” vs. “global warming is real, but it’s not anthropogenic.” 97% of scientists agree to the former, but what if you are looking for what the 3% are saying?
And we might question whether it’s a good idea to trust an algorithm’s definition of what is true. As Bess Sadler and Chris Bourg remind us, algorithms are not neutral, and may ignore large parts of human experience, particularly from groups underrepresented in computer science and technology. Librarians should have a role in reducing that ignorance by supporting “inclusion, plurality, participation and transparency.” 4 Given the limitations of what is available to the KBT it seems unlikely that this algorithm would markedly reduce this inequity, though I could see how it could be possible if Wikidata could be seeded with more information about diverse groups.
Librarians take note, this algorithm is still under development, and most likely won’t be appearing in our Google results any time in the near future. And even once it does, we need to ensure that we are still paying attention to nuance, edge cases, and our own sense of truthiness–and more importantly, truth–as we evaluate web sources.
- http://thecolbertreport.cc.com/videos/63ite2/the-word—truthiness. ↩
- Dong, X. et al. “Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources”. Proceedings of the VLDB Endowment, 2015. Retrieved from http://arxiv.org/abs/1502.03519 ↩
- https://www.wikidata.org/wiki/Help:FAQ/Freebase ↩
- Sadler, Bess and Chris Bourg, “Feminism and the Future of Library Discovery.” Code4Lib Journal 28, April 2015. ↩