Social insights

05.28.15 -

by Barbara Jewett

The CyberGIS Center relies on Hadoop for social media analytics.

When people at the CyberGIS Center for Advanced Digital and Spatial Studies say they are going to Hadoop or YARN, they're not breaking into a new dance step or pulling out the knitting needles. They're actually working. Very efficiently. Thanks to the ISL 2.0 Hadoop cluster, an exploration and evaluation platform for data management.

"It is not only about data movement, it is about conducting data-driven computation, bringing computation to data" explains Yan Liu, the center's technical coordinator. "Usually in cluster computing we deploy our software, then if we want to run the software to handle data we move the data using some data transfer protocol. When the results are produced we transfer back that data. But using Hadoop we deploy the data first. The Hadoop platform will store the data efficiently using a scale up model. If today you have 1 TB of data you store on 10 data nodes and tomorrow you have 100 TB of data, you only need to increase the data nodes in order to achieve the same level of query and analytical performance. Then we develop the code, and the Hadoop platform will dispatch the code. It knows where our data is and will send the code to those nodes. It saves transfer time and provides data parallelism as well as computing power."

Kiumars Soltani, a PhD student with the CyberGIS Center, likes that easy scale up along with the fact that all he sees is one file system, even though data and code are replicated on several machines transparently. Especially since he's dealing with billions of pieces of social media data, such as tweets pulled daily from Twitter for the center's social media analytics projects.

The CyberGIS team got serious with their Hadoop use in the summer of 2013. For more than a year the team had been using Twitter data to understand the possibility of early detection of flu outbreaks, using keywords from tweets to get real-time data such as time and location, then comparing to reported influenza outbreaks, to see if they could detect a pattern in how flu was spreading. As they began developing FluMapper, a platform to try and detect the spread of influenza through Twitter information that could potentially serve as an early warning and detection system, they knew they needed a better way to manage the data.

Soltani says the flu project emerged into a new, broader project called Move Pattern. Now the team is using their Hadoop experience to expand into a broader range of GIS operations and spatial analytics on social media, including studying spatial diffusion of ideas using Twitter hashtags, dyanmic analysis of neighborhoods in cities using people’s movements, and a health and safety warning system.

"People send billions of tweets every year," says Liu, "and we have to process this large amount of data. Hadoop is a good solution for us and Vlad Kindratenko of ISL is very responsive to any problems we encounter, often resolving our issues in less than a day. He wants to see us succeed in what we do."

Before discovering ISL, the team was using Hadoop on an XSEDE-allocated resource. It was not friendly, says Soltani, as every time he wanted to run a Hadoop code he had to submit job, wait for the job to be assigned, copy all the data to the Hadoop distributed file system (HDFS), and then run his code. It was very time-consuming and slowed the research process, particularly because he had to copy the data back and forth on each use.

"One huge advantage of the NCSA ISL cluster," he says, "is the instant availability of it, which enables us to run our experiments quickly, and having the data hosted on the HDFS permanently. This advantage helped to dramatically speed up our development and research and for that we are very thankful to Vlad and all the people who helped to set up this cluster."

In fact, says Liu, the ISL Hadoop experience was so beneficial to the team that a portion of a new cluster the CyberGIS Center is building and deploying through a Major Research Instrumentation grant from the National Science Foundation will be dedicated for Hadoop and data-driven geospatial data analytics and visualization research.