Streamlining Word-Based Classification with AI

In my role as a librarian, I embarked on a transformative data science project aimed at streamlining the cataloging process for our diverse team. Leveraging the power of scikit-learn in Python, I trained a model on a substantial dataset of nonfiction library items, consisting of titles from various domains.

The core functionality of the model is its ability to predict the most suitable area of the library based solely on an item’s title. This intuitive approach, employing machine learning, provided our catalogers with a valuable starting point, particularly beneficial for those who weren’t native English speakers.

In the US Marine Corps Libraries, we employ a unique word-based classification system known as Semper FindIt! This system, distinct from the widely-used Dewey Decimal or Library of Congress Classification, adopts a modular approach. Items are categorized into specific “Neighborhoods” such as Cooking, History, Social Science, Biography, Education, Test Prep, Language, and numerous others. These neighborhoods can then be placed in the library as needed so that patrons can easily browse and intuitively discover content as in a book store. Dewey and LoC require items stay in a rigid spectrum that spans entire collections, but Semper FindIt! allows for sections such as Language and Travel that make sense together to be colocated.

Semper FindIt! is nearly fully implemented in the libraries, so most items in these libraries have a Semper FindIt! classification. Recognizing the wealth of pre-existing classifications as a valuable dataset, it became evident that supervised machine learning could capitalize on this opportunity. The intricate structure of Semper FindIt! provided a rich foundation for training our model, enabling it to learn and replicate the nuanced patterns within the existing classification system. This synergy between traditional cataloging and cutting-edge machine learning became the cornerstone of our endeavor to enhance cataloging efficiency and accuracy.

Recognizing the challenge faced by our team in cataloging new items lacking clear classification cues, I implemented this model to alleviate the burden. Approximately 20% of our items fell into this category, consuming a disproportionate 80% of catalogers’ time. The model’s predictions became a valuable resource, offering guidance where traditional sources like other libraries’ classifications or cataloging-in-publication data were absent.

To enhance accessibility and usability, I took the project a step further by developing a lightweight Flask app, making the model available on the web. This not only facilitated easy integration into our workflow but also empowered catalogers with a user-friendly tool that significantly reduced the time spent on the initial classification process.

This innovative approach not only expedited cataloging but also fostered a more efficient and collaborative environment within our team. By harnessing the capabilities of data science, we revolutionized the cataloging landscape, providing our catalogers with a reliable, language-agnostic tool that ensured a more streamlined and effective workflow.