They love a challenge

04.15.09 -

University of Illinois engineers use NCSA resources to score in an international data retrieval competition and to advance automatic speech and video recognition.

Apparently, Mark Hasegawa-Johnson and I are an object lesson in the challenges of automated speech recognition, his field of expertise. As we discuss his work, we frequently throw around the name "Barack Obama" as a phrase that might be parsed by the algorithms and software that his team creates. Just a name in the news that he can use to illustrate points to me.

We talk about the way that subunits in human speech are turned into sounds and how those sounds form words. We talk about how challenging it is to build an acoustic model of that process. And we talk about training a computer to use that model to identify words when accents vary so widely and when languages have a host of sounds that appear with varying frequency or not at all in other languages.

All the while, one of us is saying Bear-ack while the other is saying Buh-rock.

"The traditional model [for training a speech recognition system] is to provide 300 hours [of speech] exactly like what you're testing on and test it only on that. That's almost licked. You can get greater than 90 percent on a medium-sized vocabulary," Hasegawa-Johnson explains. "But if you change the conditions..."

His research team, made up of graduate students in electrical and computer engineering at the University of Illinois, works on the problems that surround those changed conditions. With colleague Tom Huang's students, they tackled the Star Challenge, a data retrieval competition hosted by Singapore's Agency for Science, Technology, and Research in October 2008.

Nearly a year of planning and algorithm and software development went into the Star Challenge's three preliminary rounds and grand final, as well as nearly 50,000 hours of compute time on NCSA's recently retired Tungsten supercomputer.

Members of the Illinois team included Yuxiao Hu, Dennis Lin, Xiaodan Zhuang, Jui-Ting Huang, Xi Zhou, and Zhen Li.

The teamwork was exceptional among these electrical and computer engineering students. "My students work together but not to this extent. It was a very intense experience," Huang told Synergy, a publication that highlights work done by faculty at Illinois' Beckman Institute.

The challenge

The Star Challenge tested their speech and video recognition systems' abilities under conditions that varied not only in terms of accent but also in terms of language. For the final round, for example, they had to find eight 30-second snippets of video and audio in about 15 hours of footage from Singaporean television news. The footage was in English, Mandarin, Malay, and Tamil—the four official languages of the southeast Asian city-state and four languages that weren't revealed until the competition was under way.

The team's speech recognition software considered that footage in 10 millisecond segments, comparing the segments to a dictionary of words and trying to identify snippets that matched the terms they had been tasked with finding. It also kept track of as many as 200 preceding words that it had already identified and used that information to improve the likelihood of getting a correct match. Prior words provide hints on what a word or sound might be. In a given language, particular sounds tend to follow other sounds and certain words tend to go together.

For the qualifying rounds, those comparisons meant running more than a thousand jobs at a time on Tungsten over the course of 48 hours. The team used the best algorithms it could find, using existing software and building new, developing approaches in-house and borrowing techniques developed elsewhere.

The result was a speech recognition system that is multilingual. It can search media in different languages based on the sounds that are produced instead of being focused on a single language’s particulars. For the Star Challenge, the Hasegawa-Johnson team trained it in 12 languages.

The payoff

After leading throughout the qualifying rounds, the Illinois team ultimately came in third. But Hasegawa-Johnson expects that the speech recognition system built for the Star Challenge will serve as a testbed for their work for the next few years.

"Language-independent speech recognition is something we have less of a handle on," Hasegawa-Johnson says. But that's the direction his team is headed.

Language-independent speech recognition systems move researchers away from systems that are trained for a single language. This, in turn, can be used to improve the operation of systems that today are easily tricked by varying accents or noise—like automated telephone systems that ask you to say the name of the party you are trying to reach or to say "two" for Spanish.

Their work is also being applied to speech recognition software for people with disabilities. A study under way in Hasegawa-Johnson's group considers software for those with cerebral palsy, who often can't use keyboards due to the impact the condition has on their muscles and motor control and whose speech is often slurred. They're using data from 20 speakers with cerebral palsy and similar conditions, statistically modeling both the small and large impacts they can have on speakers.

By creating a dictionary of words customized to a speaker with cerebral palsy, researchers can improve the performance of contemporary speech recognition systems. One member of the study, for example, is understood only six percent of the time by a human listening to her and transcribing what she says. Using an automated speaker-independent system, accuracy drops to two percent. Using a custom dictionary, trained on her speech, accuracy rockets to 75 percent for the automated system, according to findings from the Hasegawa-Johnson team.

The problem is that custom dictionaries are expensive, time-consuming, and difficult to create. Language- and speaker-independent speech recognition systems may someday overcome that hurdle, understanding or automatically adapting to the nuances of that speech.

"We would like to make it possible to start from a speaker-independent system and adapt it gradually—or with relatively little training data—to the speech of a talker with cerebral palsy. Similar to the way that dictation software usually adapts to the speech of talkers with less unique characteristics, but using algorithms that are aware of the particular kinds of distortions that can happen in the speech of talkers with cerebral palsy," Hasegawa-Johnson says. "It would, I think, give us the chance to improve accuracy. A lot, I hope."

For more information, go to http://www.ifp.uiuc.edu/~hasegawa/.

This project was funded by the National Science Foundation and the National Institutes of Health.

Team members
Mark Hasegawa-Johnson
Yuxiao Hu
Jui-Ting Huang
Tom Huang
Heejin Kim
Zhen Li
Dennis Lin
Harsh Sharma
Xiaodan Zhuang
Xi Zhou