Related product Digital Collections (CONTENTdm)

Northern Illinois University Data Dictionary

Northern Illinois University Data Dictionary
Matthew Short, Metadata Librarian
November 2016

Introduction

Since 2012, Northern Illinois University (NIU) has been using a data dictionary when creating all non-MARC metadata. This document details local decisions we've made, schema or standard clarifications, and best practices for shareable metadata, including what elements are required and what vocabularies and values to use. The format and much of the contents are based on the Digital Library Federation's Implementation Guidelines for Shareable MODS Records. In addition to the data dictionary, we also create metadata specifications for each new project (see example), based on the "Specifications for Metadata Creation" from UCLA's Digital Projects Toolkit. These document any divergences from the data dictionary for a particular project, indicating any additional requirements (for more on data dictionaries and application profiles, see CARLI's Metadata Matters webinar "Marrying Local Metadata needs with Accepted Standards: The Creation of a data Dictionary at UIC Library").

Creating this documentation can be a lot of work, requiring consultation with content providers and other cataloging staff. It also changes frequently, sometimes because new schema versions are released or simply to reflect changes to local practice. Right now, almost all of our non-MARC metadata is created by just two people--myself and a Library Specialist--so it may seem like a lot of effort for little reward. And in fact, we very seldom refer to the data dictionary or project specifications, unless a problem comes up.

While it's important to clarify guidelines and to document local practice, I've found that the real value to these documents has been in the act of writing itself. They've given us a reason to think critically about what we're doing when we do metadata--to consider who and what were creating all of these records for. How is the data going to be indexed and displayed to the user? Where and with whom might we want to share this data for maximum exposure? How easily will we be able to incorporate a new collection into our existing collections? Until recently, many of these questions were never asked, which has led to a variety of issues that I've spent much of my time at NIU trying to resolve. This case study details how the data dictionary has fit into that clean-up work and how project specifications have helped us avoid many of those problems moving forward.

Learning from our mistakes

NIU has over a dozen digital collections, half of which were created in the late 90s and early 00s through multiple grant-funded projects. Although project staff were located in the Library, the Library itself had little involvement with digitization. Equipment and salaries were paid by the grants, which made the unit more-or-less independent. These included projects like Lincoln/Net, Prairie Fire, Illinois During the Gilded Age, and Illinois During the Civil War--interactive websites with curriculum materials and contextual essays about the history of Illinois--and the Southeast Asia Digital Library (SEADL), a collaborative project to digitize materials from Southeast Asia with partners from around the world.

The various historical projects shared two databases: one which stored metadata about images and another that stored full-text, with Dublin Core in the document headers. Because there was often significant overlap between periods and subjects (e.g. Lincoln and the Civil War), images and texts were reused many times across projects. All of the image metadata was created using the same local schema and students were given input guidelines for dates and formats, enabling some interoperability. The same was true of the text, which instead used Dublin Core. Because each format used a different schema, there was no searching across text and images. No controlled vocabularies for names or subjects were used for either format, expect in the case of local "themes." The metadata itself was created almost exclusively by undergraduate students, who produced nearly 50,000 records.

All of the metadata in SEADL was contributed by project partners, many of whom were located in Thailand, Vietnam, Malaysia, or Indonesia. Although the digitization unit eventually began using Dublin Core for this project, partners were provided with very little guidance or oversight. Resource types were fairly consistent, because partners were limited to DCMI types, but a wide variety of dates and formats were used. Except in a few rare cases (e.g. materials contributed by the British Library), no controlled vocabularies were used for names or subjects. Most of the metadata within a collection was fairly consistent, but the lack of any input guidelines made searching across all of the collections difficult. Partners contributed around 8,000 records.

Beginning in 2012, the digitization unit formally became a part of the Library. As part of this restructuring, my position--Metadata Librarian--was created in 2012. At the same time that I was being interviewed, the Library decided to migrate all of its digital collections into Islandora, which they had been testing as an alternative to LAMP for SEADL. Before I was hired, this migration had already begun.

The Library contracted with a consultant to convert all existing metadata into MODS prior to the migration. Unfortunately, they didn't have any catalogers on staff at the time, relying on staff from the former digitization unit to provide a metadata mapping. While there was at least one librarian in this unit, they had no cataloging training and no prior experience working with MODS. The consultant wrote one migration for the images based on this mapping, using the Library of Congress’s default DC>MODS XSLT for the historical texts and all of SEADL. This ignored all of the variability between collections, especially problematic in the case of SEADL.

The resulting MODS records were a disaster.

Most of the records included HTML markup in titles, notes, and names, which made it possible to italicize or underline titles in the previous system, but had no use in the current one. All subjects were wrapped in a single subject element, many using a "type" attribute that does not exist. There was no normalization of values for mods:typeOfResource, which combined both format and genre. Perhaps worst of all, an element was used in virtually every image record-- mods:physicalDescript/mods:form/mods:type--that does not exist in the MODS standard, which meant that none of the records would validate. Not only did all of the old problems with the data exist, but a whole host of new ones had been introduced.

The first thing I did when I started in July 2012 was write a data dictionary.

Beginning with the data dictionary gave me an opportunity to think carefully about what needed to be prioritized as I attempted to normalize our data. Full-level artisanal records were impossible without touching each record individually, which I didn't have time to do because new projects were already underway. Instead, I planned to write as many global changes as possible, then slowly--over years, it turned out--work my way through persistent problems that stubbornly resisted bulk transformation. I started by writing transformations for each collection. Many of the changes that needed to be made were obvious, such as making the records validate, but other decisions--the choice of vocabularies, for example--weren't so obvious. The data dictionary codified these decisions, so that they didn’t have to be revisited each time I started to work on a new collection.

The data dictionary is also one of the standards against which our metadata is evaluated. After a batch transformation is run on a collection, metadata quality assurance scripts compare the records against our requirements. This is a simple check to see whether or not required elements exists, which then informs follow-up remediation work.

My long-term goal is to build a digital collections gateway that would allow users to search across all of our collections. When I started to work with NIU's metadata, that was impossible, because the data was so different from collection-to-collection and had been so badly mangled by the migration into Islandora. The data dictionary defined the ideal form of our MODS records, so that I had some idea what it was that I was working towards as I developed normalization scripts. And as more progress has been made on cleaning up these records, it has also served as a tool for evaluating how close we are to that ideal.

Moving forward

Before a new project is started, I sit down with the content provider or project lead to discuss their metadata needs. This discussion usually focuses not on MODS, but on what expectations they have for using the collection, including search, facets, display, citations, etc. I then review any existing best practices for similar collections, and, within the bounds of our data dictionary, modify our requirements as needed to suit the particular project. This involves striking a balance between interoperability and shareability, on the one hand, and specific requirements of a particular audience, format, or collection, on the other. These modifications are documented in the project specification.

As a rule, I prefer to only create non-MARC metadata from MARC records, so before materials are even scanned, they're first cataloged. This allows us to take advantage of our cataloger's expertise, without having to retrain the entire department, and it fits much better into existing workflows. Whenever possible, the project specification is created before the catalogers start, long before we even think about creating MODS and DC records. If we know exactly what we want out of our MODS records, then that will often inform the creation of our MARC records. This creates far less clean-up and enhancement work on the back end, after the MARC records have been transformed.

The data dictionary and project specification have a limited role in our day-to-day work. If we consult them, it’s only to review previous decisions in order to fill any gaps in our memory. Where they’ve been important is at the beginning of project. In the case of remediation, the data dictionary defines the ideal that we’re working towards. When it comes to new projects, specifications help us to avoid many of these problems to begin with.