PRAGMA 13  |  University of Illinois at Urbana-Champaign  | September 23 - 25th

PRAGMA Institute Abstracts

Keynote: The Emerging Global Collaboratory for Microbial Metagenomics Researchers

Larry Smarr
University of California at San Diego / California Institute for Telecommunications and Information Technology

CalIT2, the J. Craig Venter Institute, and UCSD's SDSC and Scripps Institution of Oceanography are creating a metagenomic Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA), funded by the Gordon and Betty Moore Foundation.

The CAMERA computational and storage cluster, which contains multiple ocean microbial metagenomic datasets, as well as the full genomes of ~166 marine microbes, is actively in use. End users can access the metagenomic data either via the web or over novel dedicated 10 Gb/s light paths (termed "lambdas") through the National LambdaRail. The end user clusters are reconfigured as "OptIPortals," providing the end user with local scalable visualization, computing, and storage.

Currently over 1,200 users from over 45 countries are CAMERA registered users, with over a dozen remote OptIPortal sites becoming active. Smarr will review the status of users from PRAGMA countries and discuss the possibility of PRAGMA becoming a global LambdaGrid "living laboratory" for this emerging high performance collaboratory.



Keynote: The Impact of Petascale Computing on Biological Science

Rick Stevens
Argonne National Laboratory / The University of Chicago

Since the beginning of the computer era scientists have dreamed about using the power of computation to develop a predictive understanding (a computational "theory") that could help explain the structure, function, and evolution of biological systems. Early attempts to develop computer models and the associated theory of biological systems were limited by our lack of a detailed molecular-level understanding of the mechanisms involved in even the most basic functions of growth, development, and genetic inheritance. Later, as modern molecular biology cracked the genetic code and biochemists filled in the chemical details of metabolism, available computer power became the bottleneck in extending our understanding towards prediction. Today new classes of supercomputers provide the necessary power and the recent dramatic advances in the availability of biological data yields information ranging from the DNA sequences of nearly 500 organisms to detailed snapshots of cellular machinery in action.

Stevens will discuss some of the computational and computer science requirements of the emerging science of systems biology and how this new science may exploit capabilities under development for petascale computing and life science grids. Of particular interest is the evolving mixture of life science computing applications that can efficiently exploit large-scale tightly coupled computing systems (e.g. macromolecular modeling) and those that can effectively exploit more distributed systems (e.g. genome annotation). The field of systems biology has an extremely broad range of computational methods and techniques in use, perhaps broader than any other scientific domain. This diversity of methods means that a rich collection of computing systems and software infrastructures is needed to match the needs of the biological community. He will also discuss how the twin revolutions in computation and biological science are combining to develop theoretical biology and will discuss the enormous impact this will have on science, medicine, and engineering.



Tutorial 1: PRAGMA Grid—Lessons Learned

Cindy Zheng
PRAGMA Grid

This presentation describes the coordination, design and implementation of the PRAGMA Grid. Applications in genomics, quantum mechanics, climate simulation, organic chemistry and molecular simulation have driven the middleware requirements, and the PRAGMA Grid provides a mechanism for science and technology teams to collaborate, for grids to interoperate, and for international users to share software beyond the essential, de facto standard Globus core.

Several middleware tools developed by researchers within PRAGMA have been deployed in the PRAGMA grid, and this has enabled significant insights, improvements and new collaborations to flourish. In this presentation, we describe how human factors, resource availability and performance issues have affected the middleware, applications and the grid design. We also describe how middleware components in grid monitoring, grid accounting and grid file systems have dealt with some of the major characteristics of our grid. We also briefly describe a number of mechanisms that we have employed to make software easily available to PRAGMA and global grid communities.

This tutorial is designed for people who are interested in learning what's going on in and around PRAGMA grid, for people who are interested in using PRAGMA Grid, and for people who are interested in collaborating with PRAGMA grid teams to develop and integrate grid applications and middleware and to explore grid interoperability issues and solutions.



Tutorial 2: The Nimrod Family of Tools

David Abramson
Monash University, Australia

Grids couple geographically distributed resources such as high-performance computers, workstations, clusters of computers, data repositories and scientific instruments. They have begun to provide the infrastructure to support global collaboration in ways that were not previously possible by facilitating the construction of virtual organisations.

Monash University's e-Science and Grid Engineering Lab (MESSAGE) has developed a number of innovative software tools aimed at Grid enabling legacy applications. One of these, Nimrod/G, allows users to explore robust design options by supporting parametric execution on the Grid.

Nimrod/G operates both as a user level tool, complete with a Web-based portal, and also as a middleware layer that can be targeted by application programs. It supports the design and execution of very large computational experiments in which a given application is run on a diverse range of distributed resources. Nimrod/G uses the Globus tool-kit as well as stand-alone schedulers such as PBS. A variant of Nimrod/G, called Nimrod/O, allows scientists to perform robust design, automatically exploring search space across multiple model executions. Another variant, Nimrod/E supports the design of experiments and helps users evaluate parameter effects.

This tutorial will provide an introduction to robust design principles and parametric computing. Attendees will learn how to perform parametric search on Clusters with a tool called EnFuzion, as well as how to use Nimrod/G, Nimrod/O and Nimrod/E on the grid.

This course is designed for scientists and engineers who can utilise distributed high-performance computers in their daily work.



Tutorial 3: CSF4

Xiaohui Wei
Jilin University, China

The presentation will cover:

  1. The definition of a meta-scheduler
  2. The difference between meta-schedulers and local schedulers
  3. CSF4 and its main functionalities
  4. The architecture of CSF4
  5. The challenges of meta-scheduling
  6. The CSF4 scheduling plug-in mechanism
  7. Using CSF4 in PRAGMA Grid testbed
  8. Meta-scheduling for cross-domain parallel jobs
  9. Future plans for CSF4 development


Tutorial 4: Opal: The Application Wrapper Web Service

Sriram Krishnan and Wes Goodman
National Biomedical Computation Resource, San Diego Supercomputer Center, University of California, San Diego

Opal, the Application Wrapper Web service, provides a way to rapidly wrap up existing scientific applications as Web services and to expose them to various clients. The implementation provides features such as scheduling (e.g. using Condor/SGE via Globus or DRMAA) and security (using GSI-based certificates). Furthermore, the service provides job and data management (by executing every job in a separate working directory) and state management (by storing the service state in a PostgreSQL database). The application developer specifies a configuration for a scientific application and deploys the application as a service following a small sequence of steps. End-users can now access this application remotely using the WSDL of the service. Opal has been used by a number of biomedical applications, such as MEME, APBS, PDB2PQR, Continuity, AutoDock, PMV, Blast, and HMMER in a number of projects. It's also been ported into Opal-OP, as a WSRF compliant operation provider, by Osaka University, and is in production use at PDBj.



Tutorial 5: GridChem

Sudhakar Pamidighantam
National Center for Supercomputing Applications

In this tutorial we will introduce and give detailed description of GridChem, also known as the Computational Chemistry Grid, a production cyberinfrastructure provided by a virtual organization serving the computational chemistry community. This project, funded by the National Science Foundation, provides multiple popular molecular modeling applications across a Grid consisting of several HPC sites and provides intuitively familiar application specific user interfaces for research and education in molecular modeling. The tutorial will provide a detailed view of how Web services are integrated into application specific information provisioning frameworks for the end user benefit. Examples will be provided for specific implementation of services and user interfaces for at least two popular applications. Some implementation details as to how the user data is managed will be demonstrated. Specific examples for usage for a user and portal administrator will be provided.



Keynote: GEON: Networking Indian Geosciences Community through iGEON

Arun Agarwal
University of Hyderabad, India

As part of its international activities, the Geosciences Network project (GEON) has initiated collaborations with the University of Hyderabad, India. This effort, called iGEON-India, is funded by the Indo-U.S. Science and Technology Forum. A fundamental objective of the iGEON-India project is to develop data sharing frameworks -- in the process identifying best practices and developing capabilities and tools to enable advances in how geosciences research is done and share our experiences.

High-Performance Computing and Information Technology in the Geosciences

Donald J. Wuebbles
University of Illinois at Urbana-Champaign

High-end computing, data storage requirements, data mining capabilities, and visualization have become ever-increasing challenges as geoscientists attempt to deal with the massive data sets and computing requirements developing out of their study of the complexity of the Earth system and its relationship to its inhabitants. Highly complex numerical models of the relevant physical, chemical, and biological processes and analyses of massive data sets based on observations are integral to the research done in the geosciences. The geosciences are currently poised for an era of rapid scientific progress. As examples, we are on the verge of more accurate advance warning of extreme weather and earthquakes, detailed evaluation of river basins and their interactions with changing landscapes and weather, high-resolution modeling and analyses of the Earth's climate system and the concerns about global warming, and eddy-resolving modeling of the oceans and their changing environment. The single most important factor in achieving these advances in geosciences research is a major expansion of supercomputing capabilities and information technologies. This presentation is aimed at exploring the ongoing research and future needs of the geosciences for advanced computing and analysis capabilities.



Tutorial 7: SP2LEARN—A Framework for Geospatial Models from Sparse Field Measurements Using Image Processing and Machine Learning

Peter Bajcsy
National Center for Supercomputing Applications

This tutorial presents a framework for accurate estimation of geospatial models from sparse field measurements using image processing and machine learning. The motivation for our work is driven by the cost of field measurements and by the limitations of currently available physics-based modeling techniques. The goal is to improve our understanding of the underlying physical phenomena and increase the accuracy of geospatial models. Our approach is to interpolate sparse field measurements, apply existing physics-based models, incorporate spatial constraints using image processing techniques, explore utilizing auxiliary raster measurements using machine learning, and perform optimization of all algorithmic parameters in supervised, as well as, in unsupervised manner.

The tutorial will illustrate the application of the framework to groundwater recharge and discharge (R/D) rate models. We use the physics-based R/D rate model that takes field measurements of hydrologic conductivity, water table level and bedrock elevation. We will explore the accuracy improvements when several image de-noising techniques with a decision tree machine learning technique are employed, and several remote sensing and terrestrial raster measurements are used, for example, slope, soil type and proximity to water bodies. The participants will be exposed to the analyses of spatially sparse geospatial measurements and utilization of image processing and data-driven modeling for these analyses.



Tutorial 9: MPICH-GX Implementation: Grid Enabled MPI Implementation to Support the Private IP and the Fault Tolerance

Oh-kyoung Kwon
KISTI, Korea

MPICH-GX is a patch of MPICH-G2 and has extended the functionalities required in the Grid. MPICH-G2 is a well defined MPI implementation for Grid, but it needs to be modified for supporting some requirements of Grid applications. MPICH-GX provides following functions: private IP support and fault tolerant support. We are covering the details and some experiences of MPICH-GX. In addition, we are giving some demos of the submission of the MPICH-GX job.



NCSA and U of I logo NCSA Home University of Illinois at Urbana-Champaign

PRAGMA logo