Harvesting Metadata into a Discovery Layer

Domesticating Wild Metadata: Harvesting Your Metadata into a Discovery Layer Using OAI-PMH Feeds
Margaret Heller
February 2017

Setting up a discovery layer is an excellent opportunity to see the metadata from your digital collections in another system, and often the first time you will have an opportunity to see your metadata outside wherever you first created it. This was my experience when Loyola University Chicago implemented Primo in 2015. We were excited about the opportunity to increase discoverability of our digital collections by indexing them in Primo. Over the years, we had created unique content in many platforms: Omeka, CONTENTdm, WordPress, and Digital Commons, plus our LibGuides and website in Drupal. I'd always been told that seeing my data in a discovery layer for the first time would be a big shock, and that was definitely the case for me. Some of the collection harvesting is still a work in progress; some we decided not to pursue. In all cases, it was a good lesson in what we will need to think about as we create new digital collections in the future and choices we make going forward in existing systems.

Some short background to understand what OAI-PMH means is necessary, since finding or creating that feed is the first step in the process for everything I discuss below (there are definitely other ways to harvest metadata into other systems, but this is the easiest way to start in most cases). OAI-PMH stands for the Open Archives Initiative Protocol for Metadata Harvesting. As the name indicates, this is a protocol that works through HTTP (which is the protocol we use for accessing content through a website) to make metadata available for harvest. You can be a "data provider" who creates OAI-PMH data, or a "service provider" who harvests it. Practically speaking, however, most of us do a little of both at different stages in our repositories. For instance, you might harvest data from one service and repackage it to provide as data in another repository. That is exactly what ends up happening in most discovery layers. While you do not have to understand all the technical details, you can work through a tutorial to understand the basics.

Our discovery layer (and our only catalog) is Ex Libris' Primo. Primo can harvest a variety of data, and has a built-in harvester and parser for Dublin Core metadata in an OAI-PMH feed. While the default parser works well for standard bibliographic data, it needs some customization to understand how to display certain fields correctly. The exact mechanisms for this are not relevant, but I learned quickly that I needed to view the original OAI feed and analyze it myself to determine how to set up the discovery layer to work as I wanted. My suggestion is that even if you do not intend to ever harvest your data, still take the time to find the OAI-PMH feed. Why? You may want to provide that data to others, and it will be useful to know what data you are already providing. It also gives you a chance to review what your metadata looks like "in the wild" and see how well it works out of context.

My first harvesting effort was our institutional repository, which runs on the bepress Digital Commons platform. It was not hard to find the OAI feed, since there was a lengthy document provided by bepress that described how to work with this. The metadata provided by Digital Commons was in Dublin Core and included normal bibliographic elements, so this was reasonably straightforward to harvest, and initially looked good with only a few adjustments to display required, including figuring out ways to deduplicate print and electronic theses and dissertations so the two versions would appear in one record. However, after some time we discovered a big problem. Some of the data in our institutional repository was metadata only, but those records appeared to be full text in the discovery layer. There were a few options for how to address that, but because it was not that important to include those particular records, we were able to adjust the OAI feed so it did not include those records, and the problem went away. The takeaway for me is that it is simpler to start by figuring out if you can eliminate a problem by removing it rather than trying to program around it.

Our next attempt at harvesting an OAI feed was Springshare's LibGuides. Springshare provides data export options that include an OAI-PMH feed, but because of the nature of LibGuides, the data was sparse, only including title, creator, description, publication, date originally created, and the URL as the identifier. There are other options for exporting and working with LibGuides data in our discovery layer, but we decided that the sparse data was enough for the use case we had in mind (based on usability testing with students).The main issue was that the OAI feed added "Research Guides ::" to each title, which we didn't want to display in the discovery layer. We were able to remove that from display, which helped with ranking the research guide higher in the results list when someone searches for a subject area. Another unsolved problem is that the date provided is the date of creation, rather than the date last updated. This means that the pages all look a bit old when viewed through the discovery layer until you actually open the link.

We recently decided that we wanted to harvest our library website into the discovery layer, since users frequently searched for information in the discovery layer that only existed on the website, as well as a way to get database titles into the discovery layer. This is still a work in progress, as it currently exists only on a test server, but we expect to implement it very soon. Our library website runs on Drupal, which is an open source content management system, and thus requires us to set up our own OAI feed. I will not get into how that works in detail here (you can find out more on the Drupal project page), but there are several things to consider when you are setting up an OAI feed. What data is it important to convey from your originating system, and what metadata schema will you use? Again, we used Dublin Core, but had to be creative in mapping metadata about pages on our site to bibliographic metadata. The initial harvest revealed a few missteps in the first harvest, but the great thing about creating our own feed is that I could go back in and adjust the data being sent so it was less work on the discovery layer side. The lesson here is that the best scenario for sharing data is to be able to create and control your metadata yourself, but that is not possible until you understand what the result should look like.

The last harvest attempt I will mention is our digital collections in CONTENTdm, which ultimately we did not complete. We had a number of digitized collections stored in CONTENTdm, and wanted to make these more discoverable outside of that. The OAI feed automatically created by CONTENTdm had too much granularity for us to cope with easily. The default is to include all files from all collections, including individual pages of compound objects. When that came into our discovery layer we realized right away that while some collections had great metadata and well-described individual objects, in general we only had good descriptions at a higher level. It is possible to adjust a few pieces of the CONTENTdm OAI feed to address this. You can harvest just specified collections, and you can have only the compound object metadata harvested rather than all the files within. There would have still been a great deal of metadata cleanup required, but we did not end up pursuing this. We will be moving away from CONTENTdm later in 2017, and migrating our digital collections to a new system—so we will be revisiting all the lessons learned so far to work with harvesting metadata from that system.

Implementing a discovery layer truly is a good opportunity to see your data as it exists outside the systems you have poured a lot of time into getting just right, but it does not have to be the only time you think about it. While it was the impetus for me, through spending time looking at my data in the wild I understand a lot more about what choices make sense for my library and our systems when it comes to metadata. It is all too easy to shape your decisions about data on the tool you use to create them, even though your digital objects will not always live or be found through that tool. You may or may not share your data with others to use in other systems, but you should be able to share metadata with yourself, and understand what the experience of others will be if you ever do want to share data with them.