Exploring and Integrating the Deep Web: Building a Database of Databases
Kevin Chen-Chuan Chang
Award year: 2003-2004
This research aims at enabling access to structured information sources on the Internet. Over the past few years, the Web has deepened dramatically - A significant and increasing amount of information is hidden on the "deep" Web, behind the query interfaces of searchable databases. Because current crawlers cannot effectively query databases, such data is mostly invisible to traditional search engines, and thus remain largely hidden from users.
As the context of this proposal, we propose to build a metaquery system, to help users in finding and querying these databases uniformly with rich expressive queries. Our goal is two fold: First, to make the deep Web systematically accessible: the MetaExplorer will help users find online databases that are useful for their queries. Second, to make the deep Web uniformly usable: the MetaIntegrator will help users interact with online databases to ask queries. To open up the deep Web, this MetaQuerier faces new challenges: First, it must deal with large scale, since sources are proliferating rapidly online. Second, it must be dynamic and ad-hoc: each query will dynamically select different ad-hoc sources.
Toward realizing MetaQuerier, this proposal specifically aims at exploring and representing databases on the Web. We propose to develop a deep-Web source search engine for crawling sources online to construct a database of databases. We will focus on techniques for source discovery, modeling, and extraction. Building this source search engine is crucial. First, this search engine (as an essential module of the MetaQuerier) will crawl online sources, construct a database of databases, and thus enable users to access the deep Web. Second, this search engine is in fact strategically important to guide our research -- It will construct a source repository, as a large-scale dataset, which will be crucial for realistically conveying both challenges and opportunities: Our motivating survey of the deep-Web frontier reveals that, for integrating massive networked databases, this large scale of proliferating sources is not only a pressing challenge to address, but also a unique opportunity to leverage- for pursuing "holistic" approaches. Third, its goal toward practical deployment (say, as a "Google" for searching online databases) will help bridge our basic research to the more applied focus of NCSA, leading to synergistic and interdisciplinary collaboration.