Modeling the Massive HathiTrust Corpus: Creating Concept-Based Representations of 15 Million Volumes

J. Stephen Downie
College: School of Information Sciences
Award year: 2017-2018

The goal of this project is to make the HathiTrust book collection available for large-scale research use through optimized, concept-based representations. The massive HathiTrust corpus—containing 15 million books spanning multiple centuries—provides invaluable raw material for learning about and modeling historical, cultural, linguistic, scientific, and structural trends. With the assistance of NCSA, the HTRC, is now uniquely positioned to evaluate and train reduced-dimensional term-topic matrix models for generalized use, inferring the implicit patterns of word co-occurrence in different languages. The term-concept data matrix will be made openly available to the world and we plan to integrate the generative model into NCSA's Brown Dog project framework for use with the new non-HathiTrust text collections.