Word search

10.20.15 -

by Trish Barker

Social science researchers tap advanced computing and data techniques and resources to utilize thousands of research documents.

Ruby Mendenhall is an associate professor in Sociology, African American Studies, Urban and Regional Planning, and Social Work at the University of Illinois at Urbana-Champaign. Her research focuses on issues of race and social inequality, and she uses both quantitative and qualitative methods to analyze data.

The story of Mendenhall’s collaboration with NCSA exemplifies the center’s philosophy of transdisciplinary convergence—NCSA aims to be a hub where faculty and students from different disciplines, NCSA staff with diverse expertise, and digital resources can unite to tackle challenging problems in new ways, yielding unprecedented results.

The social sciences aren’t necessarily the first use cases that leap to mind when someone mentions supercomputers and big data, but one of NCSA’s six thematic focus areas is Culture & Society.

“There are many new opportunities that advanced computing and data techniques and resources can provide in the social sciences, humanities, and the arts, whether it’s understanding social aspects of successful collaboration, pioneering new ways for audiences to interact with performances or developing tools to analyze massive quantities of image data,” says Gabrielle Allen, NCSA associate director for research and education. “And there are complex challenges in other disciplines that can really benefit from collaborations with researchers in the social sciences, arts, and humanities.”

Not long after joining the Illinois faculty in 2006, Mendenhall attended a talk by Kevin Franklin, the executive director of the Illinois Institute for Computing in Humanities, Arts and Social Sciences (I-CHASS). She was intrigued by Franklin’s descriptions of I-CHASS projects that help scholars in the humanities and social sciences leverage digital tools and saw potential applications to her own research. She filed the information away.

I-CHASS is a key component of NCSA’s focus on Culture & Society. The center helps connect and support scholars in the humanities, arts, and social sciences to enable them to use digital techniques and resources to spur their research.

In 2013, Mendenhall applied for and was awarded an NCSA Fellowship. The Fellowship program provides support for joint research and development projects involving U of I faculty and researchers and staff, faculty, and students at NCSA. The goal is to catalyze long-term collaborative projects.

Cutting through the words

Mendenhall used her fellowship to support her effort to apply text mining techniques, topic modeling, and data visualization to determine key concepts and relationships in documents available through JSTOR and the HathiTrust Digital Library. Her goal was to “capture the nuances of African American women’s experiences and their efforts to negotiate and maximize resources in their everyday lives.”

“The traditional humanities method has been to just go out and read everything you want to talk about. But if you’re talking about thousands or tens of thousands of volumes, you can’t read that. Text mining is a good way to go through something that is beyond the human capacity to read and make some sort of valuative or discursive judgment,” says project collaborator Mike Black, until recently the associate director of I-CHASS and a post-doctoral research associate (now an assistant professor at the University of Massachusetts).

Topic modeling, Black explains, “looks for patterns of word distribution within and across documents. To get even simpler, it looks for words that are likely to appear together in segments of text—usually a page or a paragraph. The idea being that when we talk about some topics, we use the same words over and over again, and we don’t use other words.”

First the team used the Illinois Campus Cluster to test different techniques and develop their analysis workflow. Because I-CHASS is an investor in the Campus Cluster program it always has guaranteed access to computing power, which can help projects like Mendenhall’s launch more quickly and easily.

As the team scaled up to analyzing 6,000 documents they needed greater computational power, so Mendenhall obtained a small start-up allocation through the National Science Foundation’s Extreme Science and Engineering Discovery Environment (XSEDE) program. This also allowed the team to tap into XSEDE’s Extended Collaborative Support Services (ECSS), including help from Sergiu Sanielevici at the Pittsburgh Supercomputing Center and Drew Schmidt at the National Institute for Computational Sciences.

“ECSS support is invaluable, especially for people who don’t have a tech background,” says Black. “Sergiu was very instrumental in helping us negotiate with HathiTrust, assuring them about the security of the data.” And Schmidt tested various software options, determining that the MAHUT library was much too slow and helping the team decide on using MALLET instead.

Viewing results

Data visualization was also an important component of the project. Mendenhall worked with Mark Van Moer, a senior visualization programmer at NCSA who says he has enjoyed the challenges of working with textual data.

“I had been doing work with a lot of physical data, data that have a real-world analog, like fluids and molecules,” he says. “This is a bunch of text. The raw data is text, and the analyzed data is more text. But they don’t have any spatial meaning. So it gives you a lot of freedom—how do you want to arrange these items in an image?”

The pilot analysis yielded 100 “topics,” with each topic consisting of a list of words that the model grouped together. Topic 79, for example, clustered words including “women,” “members,” “club,” “association,” and “league.” Van Moer visualized this topic and other topics both with a word cloud, in which the size of each word indicates its frequency, and with histographs that tracked a topic’s emergence over time. The latter clearly showed Topic 79 emerging and growing in prominence in the 1910s and 1920s.

The team concluded that Topic 79 appeared to be related to the black women’s club movement of the early 20th century, in which black women created reform-minded organizations that worked for societal improvement. The club movement was already a well documented part of history, but seeing it emerge from the topic modeling was a strong “proof of concept,” demonstrating that this technique could confirm known patterns.

“Even if this method just confirms things we already know, it shows the method is valid,” Black says. And Mendenhall believes this technique can help reveal new themes and uncover new connections.

“You can use big data to try to capture the erasure of black women’s history,” she says. “You have this corpus of data, but it’s not neutral. There have been these forces that have suppressed certain voices. And now theoretically, you can analyze the corpus and show what’s missing.”

Profound impact

Nicole Brown says this project had a profound impact on the trajectory of her research as a sociologist—she began working on the project as a PhD student and after recently receiving her PhD will begin a post-doc position at NCSA.

“This project definitely changed my view on what sociological research could be,” she says. “The project allowed me to learn about computational analysis generally, and topic modeling specifically, and because it was an interdisciplinary group, I was exposed to the terminology related to coding, algorithms and visualizations. I had not previously considered the implications of big data on my own research, or the ability to make meaning of large corpora without doing close readings. It changed the scope of my research questions and shaped my vision of what my research could be. I was surprised to learn how under-utilized these tools are within my discipline,” she added, “and I hope to rectify that. I predict sociology will soon embrace topic modeling and other computational tools on levels previously unimagined.”

In order to continue to scale up the research, Mendenhall applied for and received a full XSEDE allocation on the Blacklight system at the Pittsburgh Supercomputing Center, starting in spring 2015.

For other researchers in the humanities and social sciences—and in other fields—Mendenhall’s story shows that “if faculty are interested in this sort of thing, they can find a partner here to help them get started,” says Black. “The pathway and the resources are here.”

“I see Illinois as a unique place to cross disciplines and take advantage of expertise from different areas…to be able to go across campus and get people with other expertise, and have an environment where you can sit together and ask questions,” says Mendenhall.

Brown hopes to help more scholars discover the path to NCSA. “I think the humanities and social sciences have a lot to gain from partnering with NCSA, and NCSA has a tremendous amount to gain as well by engaging the campus [beyond the traditional STEM areas],” she says. “I hope to contribute to that innovation through my research and outreach.”