New technology for old text

09.21.10 -

by Vince Dixon

Computing time from I-CHASS and NCSA helps scholars improve digitization of 18th century text.

From the works of Fielding to Austen, many pieces of 18th century literature are still preserved today. While modern editions of such classics can be found in libraries or bookstores, the 18th century originals are sometimes stored for safekeeping.

One way to provide copies of the text to the public is by using optical character recognition (OCR), software that transforms pieces of art and literature into digital files. Current OCR software works well with most modern print pieces, but converting scanned copies of pre-19th century works into plain text (ASCII) poses a challenge to scholars. Lines of 18th century writing are not as straight and the ink not as clear. Some symbols may show up distorted in plain text copy with a typo about every seven words.

Inspired by the success of NINES, an online database of digital 19th century resources for scholars, Laura Mandell, professor of English literature at Miami University, felt a similar space was needed for 18th century literature. Through a multi-university collaboration, Mandell and a group of researchers created 18thConnect.

"It is a peer-reviewing community and online finding aide in the same way that NINES is," Mandell said. "This is basically a new form of cyberinfrastructure for the humanities that has been built from the ground up."

At 18thConnect.org scholars will be able to search and tag texts from Eighteenth Century Collections Online (ECCO), a digital library created by Gale Cengage Learning that stores over 150,000 scanned and digitized texts. Users can then interact over the site by creating personal pages, participating in user groups, and submitting work for peer-review.

While users will be able search and sift through the collection of 18th century writing, a problem occurs when preserving the scanned works, Mandell said.

"You think 'oh they're all digitized as PDFs; they're all preserved,' but they're not because PDF is, of course, a fungible format—it's a format of our moment, but everybody knows how fast technology changes," she said. "So what we need to do is digitize it as plain text like ASCII, text that has been encoded ultimately (at bottom) as zeros and ones.

"That's the only way we can be sure that it will be transmitted safely into the future," Mandell said.

This process is especially difficult for 18th century text, she said. Prior to the adoption of mathematical precision in typesetting during the 19th century, printing produced off-centered text that is poorly read by current OCR programs. OCR software typically requires letters to be in exact relation to one another on a line, Mandell said. If not, the result is a flawed plain text copy of the piece. "And you can see this on Google when you click on plain text for anything before 1800," she said.

The solution: slowly train open-sourced OCR software to increase character recognition rates from 90 percent to near 100 percent accuracy.

Robert Markley is a professor in English at the University of Illinois at Urbana-Champaign and director of development for 18thConnect. Markley and his team used over 25,000 hours of computing time from I-CHASS and NCSA to run thousands of tests against various OCR programs to train the software to recognize off-center, misaligned, or poorly reproduced type from the eighteenth-century.

"The only way to do this, really, is with pure supercomputing power," Markley said. "[This] part of 18thConnect is to get us to a point where all the material can be ingested into sophisticated software environments."

"Supercomputing allows humanities researchers to focus on the most important problems in their fields without limiting the scope of their work," said Kevin Franklin, executive director of I-CHASS. "The 18th Connect project illustrates how easy access to this type of resource allows scholars to apply advanced computing to their work and explore new paradigms of research."

With the help of I-CHASS and NCSA, the team is on the right path of fully developing better OCR software for 18th century text, Markley said.

"Ultimately, the hope would be that by making the OCR software more robust, more flexible, more accurate, it will have other uses beyond simply looking at 18th century text," he said. "We'll be able to import it backwards and deal with books published before 1700 in England and there, you're dealing with a greater frequency of misreads by off-the-shelf OCR software."