Scalable Knowledge Discovery for Hydroclimatological Studies
Award year: 2001-2002
The objective of this research is to develop a knowledge discovery system for hydroclimatological studies using a scalable architecture. The proposed work is aimed at developing a system for knowledge discovery, i.e., identification of implicit relationships and patterns between data elements, in spatial data bases, particularly those that have raster representation. The premise of the proposed research is: techniques for exploring large datasets are now becoming available but have not been extensively applied for the exploration of scientific data, and in particular for hydroclimatological studies; scientific inquiry methods developed for small datasets or "few variable" problems may not be effective for large datasets or "many variable" problems; and there are pressing scientific questions that need answers and can be answered by effectively exploring the available observational data.
The research will be accomplished through the following steps:
- Development of a data warehouse: This will consist of database development for variables that do not change over time, such as elevation, soil compositions, etc., and variables that evolve over time consisting of the periodically observed quantities. ArcGis will be used as the frontend database engine for the KDD (knowledge discovery in databases) applications.
- Development of concept hierarchies: Concept hierarchies, which are analogous to taxonomic classification, will provide the ability to study the data at different granular levels. They will consist of hydrologic constraints, scale definitions, and cluster obtained from the analysis of observational data.
- Implementation of KDD: KDD techniques using concept hierarchies for spatial datasets will be implemented under this task. We will identify anomalous events in the observations for detailed study, provide quantitative uncertainty estimates resulting from data errors, perform non-linear analysis, and implement search techniques such as genetic algorithm.
The proposed work will involve large, complex databases, complicated modeling techniques, substantial computer processing and sophisticated visualization techniques. The primary emphasis of the proposed work is to utilize the parallel processing capability of the Origin2000 system to develop a scalable mining system. Both inter-model parallelism and intra-model parallelism will be explored.
The society at large will directly benefit from the results of this research. The value of data is typically predicated on the ability to extract higher level information: information useful for decision support, for exploration, and for better understanding of the phenomena generating the data. The ability to derive higher level information/knowledge from the voluminous observational data being collected regarding the earth surface processes will enhance the value of the investments in these projects.