Deep and wide | National Center for Supercomputing Applications at the University of Illinois
Deep and wide
08.28.12 - Permalink
by Barbara Jewett
An Illinois team uses NCSA resources to help humans understand their environment.
"Everybody is talking about big data nowadays," Shaowen Wang chuckles. "But we've been dealing with big data for years!"
The "we" Wang is referring to is himself and his colleagues at the CyberInfrastructure and Geospatial Information (CIGI) Laboratory, especially research scientists Yan Liu and Anand Padmanabhan. Founded by Wang, CIGI is located at the University of Illinois at Urbana-Champaign, where the trio holds appointments in both NCSA and the geography and geographic information science department.
Geographic information systems (GIS) and related science is a highly interdisciplinary field, encompassing the earth environment, people, and technologies, explains Wang. Citing the "noticeable gap" between high-performance computing computational thinking and geospatial thinking, Wang says CIGI focuses on bringing together the two different realms, merging GIS tools, methods and applications with modern computation. That's why CIGI includes collaborators from multiple institutions that bring a wide variety of expertise to bear on the group's projects, most of which involve massive amounts of data.
"We are experiencing dramatic digital transformations of managing our global environment across many scalesunderstanding people's behaviors, activities, and how they are interacting with the environmentmore than ever before," says Wang. "This is happening across science and engineering. In the geospatial world we see a huge acceleration happening, especially in such things as Google Maps and Google Earth, which just came out around seven years ago. This is now a huge industry together with vibrant science and engineering research. With GIS, we're talking about digital assets, and digital representations of our life experiences, and how to understand those experiences mirroring the real world."
The CIGI dimensions
The CIGI team hopes the research they conduct will soon empower a significant mass of users to have access to more advanced, more customized and more individualized geospatial intelligence services. Their mission is a little bit like Lewis and Clark, Wang says. The famed explorers "Corps of Discovery" mission two hundred years ago focused on the scientific and the commercial; CIGI seeks to aid science discoveries while also contributing societal benefits.
Supercomputing technologies give the team some advantages to do detailed large and multi-scale analysis and modeling that otherwise would not be possible, and also aid them in creating a benefit for the masses. Thus they like to think of their team's work as having both deep and wide dimensions.
Nowadays geospatial data and technologies broadly, and GIS in particular, are widely accessible. People can do simple analysis on their smartphones, says Wang, like find the nearest restaurant.
"This broad dimension to GIS is touching upon a lot of people's lives on the planet," he says. "There are some games like where things are, those kinds of fun things, that help people actually appreciate more and more what this digital world is about."
With a $4.4 million grant from the National Science Foundation (NSF), the CIGI team and their collaborators are working to develop CyberGIS, a comprehensive software framework that harnesses the power of HPC and cyberenvironments and integrates it with data management and visualization for GIS and associated applications.
CyberGIS is focused on scientific problem solving and userfriendliness and allows anyone interested in GISscientist, student, or Grandpa Joeto access GIS tools. A true merger of "deep and wide." Through the GISolve Middleware project, the team develops the middleware necessary for seamlessly gluing software. The GISolve middleware is getting a good test through the CyberGIS project.
The team is putting together a CyberGIS Gateway prototype. A number of users can simultaneously submit data-intensive computing tasks that run seamlessly on supercomputers, without being exposed to the supercomputing's complexity.
One problem that they've run into is that all the analysis and modeling is conducted using NSF XSEDE resources, e.g., NCSA's Forge. Users jobs are submitted and must wait in the queue, just like any other job. Academic researchers are used to queue waits, but for a problem that needs information back quickly, it could be problematic.
And the wait time may cause a casual user interested in learning more about GIS to become frustrated and lose interest, and most likely not return to the site. For instance, the small job created for me as part of their CyberGIS demonstration was still sitting in the queue 20 minutes later.
Another concern is that Forge is being retired in September, which means the team needs to transfer to another XSEDE resource. Every resource change means an overhead of migrating/porting analysis and computation to a new computing environment, which takes time away from their research efforts. The team's dream is for a dedicated set of cyber resources for the project.
Social media and public health
Social media even gets into the geospatial digital mix. CIGI is using Twitter data to understand the possibility of early detection of flu outbreaks. They use keywords from tweets to get real-time data and see spatiotemporal distributions of possible flu cases and how spreading is occurring, using multiple spatial analytical methods.
When Padmanabhan talks about "who" has flu it is in a very broad, impersonal sense, he notes. "Twitter may be collecting data, but in what we get we may not identify a particular person," he reassures.
Currently there are two projects in the Twitter flu domain. Graduate student Yanli Zhao uses NCSA's Forge in order to analyze spatial patterns over days, weeks, and months. She looks at where tweets reporting flu originate, hoping to identify flu hotspots. Preliminary results of spatial patterns of flu risk in the United States generated on a daily basis indicates the potential of the approach to serve as an early warning and detection of flu risk. More work is underway to validate the results with other flu reports. The other project looks at the transportation patterns of these people since in our mobile society disease may spread more rapidly than in the past.
"Today you tweeted you have flu. Are you on the move? Where are you going with this flu symptom and how are you spreading it?" Padmanabhan says, noting that where possible they try to identify from tweets the mode of transportation, such as airplane or car or train, as well as the geographic location.
The projects are further examples of the "deep and wide" dimensions of CIGI's vision. Graduate student Eric Shook uses supercomputers to model large-scale geospatial dynamics at individual level using flu spread as a case study, which parses with the deeper aspect of data modeling. Padmanabhan says one key aspect still in development is how to link the social media data with other local, regional, and global information.
One day soon public health officials may be able to employ social media-based GIS and spatial analysis tools in their work. In the meantime, a paper the team wrote about their work with social media data will be published in an upcoming book.
The big data
And then there's the big data aspect to GIS.
Probably the CIGI team's most data-intensive project was one they did last year to help model the Midwest portion of the Mississippi River flood of 1927. The CIGI team began with locating, retrieving, and processing 102 gigabytes of raw data from the multi-terabyte data inventory in the U.S. Geological Survey (USGS).
"The project was a very interesting experience," says Wang. "Donna Cox from NCSA's Advanced Visualization Lab (AVL) approached us. She wanted to create data-driven visualizations of the Mississippi River Valley showing the extent of the destructive 1927 floodwaters for two artists. We were intrigued because the original idea Donna had was a digital terrain, and she knew it was going to involve GIS."
The team wanted to understand the terrain from the geographic perspective using digital terrain data, and Cox's group had access to some historical maps, so the CIGI team worked hard to bring them together, says Wang.
"This is what is called geospatial synthesis, meaning you have different sources of geospatial data you are trying to slice and dice," he explains. "In this case it would be reconstructing the difficult environment and also making sense out of that reconstruction, being able to communicate through the art forms to the audience."
That art form was "The Great Flood," a 75-minute multimedia work of original music and film by experimental filmmaker Bill Morrison and guitarist and composer Bill Frisell.
All involved in the movie's production agree the CIGI team's efforts were a success. Morrison's film employs documentary footage from the era, showing people coping with and fleeing from the flooding. Cox's team drew on a 1920s map of the river and the CIGI team's contemporary geospatial data to visualize a journey along the Mississippi, providing the context for those historical images.
Wang is quick to credit Liu as the key person on the technical side when it came to synthesizing the digital elevation model data from USGS.
USGS had the data, but the data is not designed for the purpose the CIGI team had: they wanted access to almost the entire collection of high-resolution data for the Midwest area. And while all the information they needed was available from public government databases it was not organized for friendly access.
Getting the information from the huge USGS data inventory and dealing with various web sites, formats, resolutions, versions, and the non-trivial downloading operation required 1.5 weeks of human effort. A CIGI team led by Liu developed an intelligent tool to identify and automate the retrieval of the data into a format that could be used in the visualizations. With this tool, the entire collection of data was retrieved within 12 hours overnight. This tool, called NED Fusion, has been open sourced and contributed back to USGS for them to improve the data accessibility. The CIGI team also helped to assign coordinates to and re-project the historical map (called rectification in GIS term) so contemporary data, such as the Mississippi's current channel, could be aligned to it.
But the average user cannot handle the complexity of locating and retrieving 102 GB of data from a multi-terabyte data inventory. That's why one of the lab's key focus areas is how to make big-data analysis widely available, especially when information is derived from large data sets or from several different sources in different formats with different semantics.
"We want the methods, theories, and tools we are developing in the lab to empower and enable those in the arts, humanities, sciences, and engineering to understand the complex geospatial environment we have, and also how the environment and humans coexist together. Through this digital transformation process we hope to have rigorous scientific principles advanced and applied," says Wang.
Project at a glance
Babak Behzad (Computer Science)
Dan Dong (Geography)
Yizhao Gao (Geography)
Su Yeon Han (Geography)
Hao Hu (Geography)
Heejun Kim (Geography)
Eric Shook (Geography)
Kiumars Soltani (Informatics)
Zhenhua Zhang (Geography)
Yanli Zhao (Geography)
Lauren Blackburn (Art and Design, Informatics)
Julie Carlson (Geography)
Brandon Kleszynski (Communications, Computer Science, Informatics)
Yuhan Li (Industrial Design)
National Science Foundation
For more information