03.29.12 - Permalink
by Trish Barker
NCSA is helping humanities and social science scholars analyze troves of data about worlds both real and virtual, shedding light on human behavior.
Across the spectrum, data has gotten BIG.
"If you look at the trend, databases are getting bigger and bigger," says NCSA database architect Dora Cai. While 50 gigabytes would have been considered a large database not that long ago, "now we're talking about terabytes and hundreds of terabytes and even petabytes."
The Virtual Worlds Exploratorium and an ongoing census analysis project are two examples of data-intensive research in the humanities that show how NCSA's infrastructure and staff can help researchers address the challenges of big data.
Interacting in Virtual Worlds
Millions of people around the world play massively multi-player online role-playing games. And as they play, their every actioneach time they fight a dragon, buy or sell armor, talk to another playeris logged by the game, creating a wealth of information about how people interact in these "virtual worlds."
Several years ago, Sony approached researcher Dmitri Williams, then at the University of Illinois and now at the University of Southern California, to see if he could use data gathered from EverQuest II to determine which players were likely to leave the game (and therefore stop paying to play). Williams was also interested in questions about whether in-game behavior correlated with behavior in the real world. Will someone with a violent, aggressive game character be more violent or aggressive in the real world, for example?
Williams teamed with co-principal investigators Marshall Scott Poole (Illinois), Nosh Contractor (Northwestern), and computer scientist Jaideep Srivastava (Minnesota) to investigate a massive collection of game log data from EverQuest II and other gamesDragon's Nest and Chevalier's Romance, which are popular in China, and Denmark-based EVE Online. They call their collaboration the Virtual Worlds Exploratorium (VWE).
The researchers faced several challenges in working with these data:
- Volume: The logs added up to tens of terabytes.
- Security: The data needed to be kept secure and confidential.
- Heterogeneity: The data originated from different sources, was generated by different programs, and was in different formats and even different languages.
The data is housed at NCSA because "they have a lot of experience with large data and with making sure the data is securely handled," Contractor says. And the VWE team worked with NCSA database architect Dora Cai to create an organized database from the "messy" collection of log files.
If you aren't a data-focused researcher or computer scientist, you might miss the significance of that crucial step, but a collection of data isn't a useful database until it has been organized and structured and can be queried. On this project and many others, Cai was responsible for "architecting the solution so the database becomes a useful research tool," says Tim Cockerill, associate project director for XSEDE (the Extreme Science and Engineering Discovery Environment).
"We have all these great data, and we can ask loads of questions about interaction in the space," Contractor says. Some of those questions have addressed group formation (Why do people team up with one another in the game? Do groups form based on similarities, complimentary differences, proximity, etc.?) and leadership. The latter is one of the areas of interest to the Army and the Air Force, which have both provided funding for VWE projects. "This might be the best training ground for the kinds of leaders we will see tomorrow," Contractor says.
The researchers have also studied "illegal" transactions in which players sell currency, items, and even high-level characters to wealthier players who want the perks without putting in hours of game play to earn them. As games try to crack down on this behavior, the illicit sellers and buyers adopt new tricks to conceal their actions. One of Contractor's students, Brian Keegan, along with fellow student Muhammad Ahmad, found that the "illegal" networks in the game employ virtually identical strategies to those used by drug traffickers. The researchers also found that people who engage in illegal conduct in the game are more likely to have real-world criminal records.
Unlocking secrets of the Census
Another NCSA big data project involves the real world.
A treasure trove of U.S. Census data is released to the public after remaining confidential for 70 years. The standard practice has been for the Census Bureau to create microfilm images of the millions of paper forms. Companies that cater to genealogy buffs, like Ancestry.com, then hire thousands of people to spend months transcribing the microfilm so the data can be searched and sorted online.
But this April the detailed information on the more than 132 million people who lived in the United States in 1940 will be released in digital format. No more microfilm.
The Census Bureau would like to provide something more usable than 3.8 million JPEG images of census forms, but manual transcription is too expensive, and optical character recognition of the handwritten entries is not accurate enough. So NCSA's Image, Spatial, and Data Analysis group, led by Kenton McHenry, has been working for the past year on a prototype framework using content-based image retrieval to allow people to search the census form images directly. The project is supported by the National Archives and Records Administration.
The framework enables a user to input a handwritten queryeither using a stylus or by typing a word that will be then rendered in a handwriting fontto search a database of images of handwritten text for potential matches. Using a computer vision technique known as word spotting, the top ranked results are returned.
While not all will be perfect matches, the system's users will help improve the results over time through a passive form of crowd sourcing. For instance, after searching for "Smith" a user isn't likely to click on results that are not "Smith." The query text entered by the user can be connected to the image results the user selected, allowing the image database to be slowly annotated. Over time, the validated matches can be returned to users rather than relying solely on the word spotting technique.
A significant amount of computation is required in order to pre-process the data to allow for the planned word spotting and passive crowd sourcing. The first step is to split the spreadsheet-like Census forms into individual data cells by finding the form lines and fitting a template over the images. Next, each extracted cell must be converted into a numerical feature vector that roughly represents the handwritten contents of that image. A word spotting technique compares the feature vector of the search query (such as a name, like Smith) to the feature vectors of the many, many cells, looking for similarities. To search all 70 billion cell images would be excessively time-consuming and computationally expensive, so a third step groups similar feature vectors and constructs a hierarchy on the data to narrow the search space and return results with reasonable speed.
The team is using XSEDE start-up allocation to develop their system. An XSEDE Extended Collaborative Support Services team led by NCSA's Jay Alameda has helped the group get optimal performance out of their code, assisting with mapping processes to hardware and with I/O issues. The team's applied through XSEDE for 2 million CPU hours to be used to process the 1940 census records.
These projects were funded by the National Science Foundation, the Army Research Institute, the Air Force Research Lab, and the National Archives and Records Administration.
Virtual World Exploratorium team members
Nosh Contractor (Northwestern)
Marshall Scott Poole (Illinois)
Jaideep Srivastava (Minnesota)
Dmitri Williams (USC)
Census Project/ISDA team members