NCSA Home
Contact Us Intranet

Mimicking human intuition

News Home
Calendar
Images
Video on Demand
Subscribe to Our Newsletter
Frequently Asked Questions

released 12.02.08

NCSA's Automated Learning Group applies data mining

For more than 20 years, NCSA has been developing sophisticated software to identify undiscovered patterns and relevant information in large multi-modal data sets. Today, the center's Advanced Learning Group (ALG) collaborates with researchers and scholars in academia, industry, and government to invent new approaches and tools that will become the basis for new discoveries, improved processes, and in some cases, new commercial developments. The group has many successes, including:

  • SEASR, Software Environment for the Advancement of Scholarly Research, a software engineering project funded by the Andrew W. Mellon Foundation to aid those in humanities with extracting and analyzing information stored in incompatible formats;
  • D2K, Data to Knowledge, a rapid, flexible data mining and machine learning system that integrates analytical data mining methods with data and information visualization tools;
  • Evolution Highway, a set of D2K components developed in collaboration with University of Illinois' Institute for Genomic Biology for mammalian genome comparative analysis;
  • DISCUS, Distributed Innovation and Scalable Collaboration in Uncertain Settings, which is a combination of information technologies to support collaboration and the integration of multiple data sources.
  • RiverGlass, a company started by members of the ALG, transforms the D2K software into a commercial venture. They also founded One Llama Media, which develops acoustic analysis, cultural analysis, and collaborative filtering tools for music and video navigation, discovery, and search.

    By Barbara Jewett

    Retailers commonly use computers to gather and analyze consumer data, but the practice is not widespread in the law enforcement arena owing to privacy issues. A Rutgers University team is using NCSA's Abe and Cobalt to develop privacy‑enhancing data analytics tools to aid law enforcement and fight terrorism.

    Remember the date you last bought a new pair of jeans or a bag of potato chips? No? Don't worry, your favorite stores can tell you.

    From grocery stores to call logs to real estate records at the courthouse, there is more information being collected about us than ever before. Some find this information gathering an abuse of the Bill of Rights while others view portions of it as self-protective action in a terror-filled world.

    Enter data analytics expert William M. (Bill) Pottenger of Rutgers University. He's developing technology that can be used to help keep people safe without violating established rules of privacy, thus protecting the American ideals of democracy and independence. His team is using NCSA resources to develop tools that can be used in law enforcement and counterterrorism investigations, allowing investigators to ferret out potentially useful information for followup. The technology works in a distributed environment, and includes a human user in the loop. Pottenger calls it privacy-enhancing higher-order knowledge discovery.

    His methods would have been a big help to the U.S. Drug Enforcement Agency (DEA) a few years ago. In 2003, the DEA joined forces with the Royal Canadian Mounted Police to investigate the production and sale of methamphetamine in North America. After 18 months of investigation, 67 people in 10 cities were arrested. How was the case cracked? DEA agents spent months gathering information from various sources and entered it into a database, says Pottenger, then searched it for connections linking addresses, phone numbers, and names, then followed up on the leads. The agents' months-long investigation method was a manual version of what the Pottenger team’s higher-order knowledge discovery algorithm, DI-HOPE KD, does in minutes on a computer. And though the research sounds complicated, the work can be easily explained.

    Think shopping, says Pottenger. Stores track our purchases by payment method or by gathering information every time we scan the store's discount card. The stores then do data analytics to determine purchasing patterns, like buying potato chips and soft drinks together. Retail databases usually have the information in a spreadsheet—like format, making it easily searchable.

    The DI-HOPE KD algorithm is unique in that it can find associations between items in databases or text documents that are often a hybrid of organizational styles, from printed documents and reports to spreadsheet—like formats to news archives; in fact, a data style doesn't need to be specified. This allows the analysis to go a step beyond recognizing obvious patterns to looking for the important but obscure related bits that are so crucial in criminal investigations, just as the DEA agents did.

    It's a process

    With data analytics, "it's not just a single thing like an application or an algorithm," Pottenger, a former NCSA staffer, explains. "There are several steps to the process. The first of them, believe it or not, is establishing some sort of an objective." For law enforcement and counterterrorism, the objective is to discover the perpetrators of a particular crime, or to uncover some kind of unfolding plot.

    The second step is the data selection and "cleaning." Selection involves choosing exactly what data or subset of data will be used; cleaning is preparing it for use in an algorithm by accounting for missing data, removing incorrect or unusable data, and ensuring that everything is in a uniform format.

    The team works with data located in various data repositories, which is where the distributed part of the name comes into play. One repository is OpenSource, a collection of over 7 million documents from around the world relating to foreign policy and national security issues with access controlled by the federal government. Another is the Global Terrorism Database (GTD), developed by the National Consortium for the Study of Terrorism and Responses to Terror (START) and based at the University of Maryland, which contains details on over 80,000 incidents.

    The third step is algorithm selection. While retailers use a simple association rule-mining algorithm to discover what items are purchased together, "in our case it is not so easy," says Pottenger, "That's why we've done so much research into what we call higher-order learning algorithms that are based on the fact that different pieces of information are connected in intuitive ways."

    As humans, we often have situations where we don’t know exactly why we think something is the case, we just have a hunch that it's true. The team's algorithms apply human intuition in a limited way, by linking higher-order connections between ideas and between concepts, then making a jump and linking concepts that might not normally be linked. The team has successfully shown that these methods can be used to discover new knowledge relevant to the objective.

    The fourth step in data analytics is to take the results and apply them. Stores use the information to refine store layouts to encourage product cross-selling by placing frequently purchased items in close proximity, or developing and sending targeted marketing messages to shoppers. Law enforcement and counterterrorism officers can follow protocols to nail down leads, arrest criminals, or foil plots.

    Sharing data discretely

    Privacy really only becomes an issue when agencies want to share data, says Pottenger, because law enforcement agencies have established policies for handling people's data, but jurisdictional policies on sharing data vary. Pottenger's technology lets users share data in a way that doesn't reveal the actual information, but still lets the agency they're sharing with know that there may be potentially useful information. Established inter-agency protocols are then followed to obtain the information.

    "You know there's something shady going on, but you can't get on CNN or Fox News and say 'Hey, does anybody know about this address?' You'll blow your investigation," he says with a laugh. "The technology we developed allows for sharing the data without revealing the actual value, like the actual phone number or address or name. I'm not going to go into in more detail than that, except to say it does use encryption technologies, it does use the various distributed communication technologies, and it's based on our higher-order learning research."

    Another unique aspect of the team's work is that it includes computational steering. With computational steering you can get a result more quickly, or you can get a more precise result, by viewing real-time performance analysis and making adjustments, such as focusing on a particular area of the simulation or tweaking the parameters of what is being explored. It's a "time when human and machine are better working together than each working alone," notes Christopher Janneck, a PhD student in computer science who is a member of Pottenger's team.

    While scientific simulations have long included the ability to tweak parameters while the simulation is running, Pottenger says that to his knowledge no has ever explored steering data analytics, and especially not the higher-order knowledge discovery that his team targets. The team has made significant progress, but they still have a ways to go to make knowledge discovery the interactive, synergistic process they envision.

    Simulating the real world

    NCSA's Cobalt cluster and the recently retired Tungsten were crucial to developing the privacy-enhancing higher-order knowledge discovery by letting the team simulate real-world situations. Access to multiple processors meant the team could designate each processor as a "data repository," allowing them to simulate a large number of repositories just as a counterterrorism analyst would explore. They have also conducted real-world tests using datasets from the Richmond, Virginia, and other police departments. But the simulations would not have been possible if not for the help of NCSA's Susan John and now-retired David McWilliams, who assisted with parallel debugging and code porting.

    Now the team is using NCSA machines to evaluate the scalability of the algorithm. Because of the large datasets involved, says Pottenger, you can do something simple in a lab but you can't scale.

    "We're at the point now where we really need to scale this, and it's in a distributed environment so we really have to use multiprocessors. We have to use both non-uniform memory access machines like Cobalt, and we have to use distributed memory architectures like Abe, which use MPI. In this way we can actually simulate the environment of knowledge discovery," he says.

    Commercial application

    Intuidex is the company Pottenger started to incorporate his work in knowledge discovery in applications for the general public. Part of what the company is doing will help small groups search and share information in a confidential way. One tool is for families to share their pictures and other family information. There are currently lots of picture-sharing methods, but Pottenger believes families often want to share confidential items they may not be comfortable sending via email, like medical, financial, or legal documents.

    "The technology we're developing at Intuidex, named IxP2P Groupware™, will allow family members, colleagues, or friends to create a secure, private network on a certain topic and then search and share information on that topic," says Pottenger. Network users join a topic, say "Grandma and Grandpa," and are then able to search other topic members' computers for items that contain information on Grandma and Grandpa and view those items. It's like combining a virtual private network with an Internet newsgroup.

    In developing these technologies, Pottenger says it is important to take human nature into account. "We do things in small groups that grow into larger and larger things. And so if we build technology, the technology needs to work that way, too. If we are going to really enhance privacy, we've really got to understand culture and how it works. And if we're going to help people to solve problems and to share data and to fight crime, we've got to do it in a way that respects how human culture works, how human society works."

    This work is funded by the National Science Foundation, the Department of Homeland Security, and the National Institute of Justice.

    Team members
    William M. Pottenger
    Mark J. Dilsizian
    Cibin George
    Christopher D. Janneck
    Shenzhi Li
    Nikita Lytkin
    Vikas Menon
    Aleksandar Nikolov
    Jason M. Perry


    Save to del.icio.us del.icio.us Slashdot Slashdot