Cleaning Up Legacy Data

Cleaning Up Legacy Data at Lewis University
Alice Creason, Head of Library Technology and Technical Services
May 2017

Introduction

The Lewis University Library created its first collection in the CARLI Digital Collections sometime around 2008 or earlier. This first collection, the "Adele Fay Williams Collection of Drawings and Prints," includes more than 100 examples of artwork by the late-19th/early-20th century Joliet artist. Many more digital collections followed in 2011, and the Library has continued to work with departments on campus to digitize institutional and cultural heritage materials since that time. But, by mid-2015, nearly all of the staff who worked on those initial projects had left the institution, and the new Library staff needed to chart a new direction for the digital collections.

The new Library staff established a couple of priorities for the digital collections. First, we needed to create a data dictionary to guide the metadata process. The first case study in this CARLI series covers the importance of the data dictionary. [https://www.carli.illinois.edu/products-services/contentdm/dpla/case_study1] There did not appear to be an existing data dictionary for the Library's digital collections, so this was an important first step in moving away from making metadata decisions on a collection-by-collection basis and towards establishing some consistent standards. Second, we needed to review the legacy collections to perform clean-up on the metadata by applying the standardization established in the data dictionary where needed. Finally, we needed to be able to perform this clean-up with no additional staff or budget. As a result, we focused on small changes that could have a meaningful impact on metadata quality.

Method

The Lewis University Library has two staff members responsible for both digitization projects and metadata—the Head of Library Technology and Technical Services and one full-time Library Technical Assistant. Because of staff and time constraints, we knew we could not address all of the issues in the legacy digital collections, but we tried to identify a small number of collections we could improve as test cases.

Of the more than a dozen existing digital collections, only four were targeted for this initial clean-up process. When we selected the collections, we prioritized several issues:

  • Inconsistent data and formats
    Inconsistent data included data entry errors such as misspellings or lack of capitalization of proper names and instances where creator names were entered in direct order format rather than inverted order. Inconsistent formats included issues such as records where PDFs were identified as JPEGs—or vice versa—in the Format field.
  • Lack of standardization
    A wide variety of encoding schemes exist to provide standardization for data such as languages, dates, geographic regions, media types, and more. CARLI provides documentation on the best practices for digital collections, which served as a starting point. [https://www.carli.illinois.edu/products-services/contentdm/cdm-documentation] We also used the Dublin Core Metadata Initiative documentation to guide our efforts. The legacy digital collections did not consistently utilize existing standards.
  • Use of natural language instead of controlled vocabularies
    In addition to the standards above, we were also interested in consistently applying controlled vocabularies such as Library of Congress Subject Headings and the Library of Congress Name Authority File wherever possible.
  • Unused fields
    We found many of the legacy digital collections had fields defined in the records that contained no data. If we could not easily identify the original purpose or intent of these fields, they were eliminated.

Fortunately, most of these issues could be resolved without too much effort.

One of the collections identified for this project was the Adele Fay Williams Collection of Drawings and Prints. We selected this collection not just because it was the earliest digital collection created by the Library, but also because the metadata exhibited all of the issues discussed above—inconsistent formats, lack of standardization, use of natural language, and unused fields.

To start, we exported a copy of the metadata as a tab-delimited text file from CONTENTdm. This was important for preservation purposes, so we had a copy of the metadata in case we needed to undo any changes, and it allowed us to see all of the records at a glance and identify areas for clean-up.

Taking a closer look at some of the fields in the records, this is what we did to improve the metadata quality in the Williams collection:

Creator

The artist’s name had been variously entered both in "Last Name, First Name" format and in direct order format. We changed all instances of the name to follow "Last Name, First Name" format, to conform with LC Name Authorities and best practices for personal names in the creator field.

Date

Dates in the date field were originally entered as MM/DD/YYYY. All of the dates in the records were updated so they conformed to the Extended Date/Time Format (EDTF) standard of YYYY-MM-DD.

Dimensions

Although this was less of an issue in the Williams collection, the other target collections we identified for clean-up variously entered the dimensions of the original works in inches or centimeters, and sometimes both inches and centimeters appeared in the same collection. Using our data dictionary as a guide, we updated this field so all dimensions are entered in centimeters.

Language

The data in the Language field was entered as "English," which is a natural language expression. This field was updated to conform to ISO 639-2, so "English" was replaced with the 3-letter code "eng" throughout the collection.

Rights

The rights field contained the phone number of a specific staff member who no longer worked at the Library. As a result, the rights field was globally updated with a general email address for the Library as the contact for the collection. If we were doing this project today, we may have considered using a statement from http://rightsstatements.org/en/, but when we performed this clean-up in the fall of 2015, the rights statement initiative had not yet launched. This is an area we may have to revisit again in order to increase standardization across the digital collections.

Subject

Unauthorized headings were either removed or updated to conform with Library of Congress Subject Headings or FAST terms. Past practice for the digital collections included taking keywords from the constructed title or description of a collection item and using them as subject headings, often subdivided geographically. For example, a drawing in the Williams collection depicting a shopfront originally was assigned the following subject headings:

Williams, Adele Fay; Old Flagstone Store -- Joliet (Ill.) -- History; Flagstone Store -- Joliet (Ill.)-- History; Joliet Street -- Joliet (Ill.); Joliet (Ill.) -- History

A note written in pencil on the scanned print reads, "Old Flagstone Store," but it is not clear whether this note refers to the business depicted or merely the building material of the shop itself. Either way, "Old Flagstone Store" did not appear to be a true proper business name. As a result, this record was updated to the following headings:

Williams, Adele Fay; Joliet (Ill.) -- History

Removing Unused Fields

Finally, we deleted at least two metadata fields in the Williams collection that were defined in the record but unused -- an Audience field and an OCLC number field. Since the original staff members who worked on the project were gone, we did not know what the original intention might have been for those fields, but since they contained no data, we decided to eliminate them.

Conclusion

The Library Technical Assistant performed most of the clean-up tasks unaided. Many of the changes were made manually, one record at a time. Fixing data entry errors and updating the subject fields to remove natural language terms required manual, record-by-record evaluation. Fortunately, one advantage that we had was that none of the digital collections were very large. The largest collection we targeted for this project only had 134 records. With our limited staffing, this project may have been more difficult if we had been working with larger sets of metadata. But, even with our modestly sized digital collections, we only managed to clean up four out of at least a dozen legacy collections.

While some of the metadata changes may seem minor, by standardizing data and taking advantage of controlled vocabularies wherever possible, we tried to ensure that our metadata is shareable within the CARLI Digital Collections and interoperable with portals such as the Digital Public Library of America. We would like to update all of the legacy collections, but we currently have no specific timeline to complete this project. At the same time, cleaning up legacy metadata serves as a reminder that digital collections are never finished or perfect. They need to be revisited over and over as standards and technologies evolve.