Uncertain Database Management

by Dr. Reynold Cheng (ckcheng@cs.hku.hk)

Introduction

In applications such as location-based services, natural habitat monitoring, web data integration, and biometric applications, the values of the underlying data are inherently noisy or imprecise. Consider a location-based application that provides querying facilities on geographical objects (e.g., airports, vehicles, and people) extracted from satellite images. Due to the errors incurred during satellite image transmission, the locations of the geographical objects can be imprecise. The data acquired from the Global Positioning System (GPS) and remote sensors can also be inaccurate and outdated, due to measurement error and network delay. As another example, consider a movie rating database integrated from WWW sources (e.g., IMDB movie database and user ratings obtained from the Netflix challenge). Due to the difference in the movie names specified in the sources (e.g., a user may type a wrong movie name), the integrated database may not be completely accurate. In biological applications, the extent of the area of a retina cell extracted from microscopy images can be inexact, due to limited image resolution and measurement accuracy.

In order to satisfy the increasing needs of the above applications, we envision that novel, correct, and scalable methods for managing uncertain data need to be developed. To achieve this goal, our research group is working on two projects:

Project 1: Develop a practical database system that incorporates uncertain data as a first-class citizen, in order to facilitate the development of the above applications; and

Project 2: Investigate the issues of data uncertainty in data mining, ambiguity removal, and data integration.

The following describes how we approach these goals, and a summary of our current results.

 

Project 1: Designing an Uncertain Database System

(a) Location Uncertainty (Gaussian pdf) (b) Location uncertainty of a vehicle in a
      road network (arbitrary pdf)

Figure 1: Attribute uncertainty.

 

To handle the increasing needs of managing a large amount data, our goal is to develop an uncertain database system, which incorporates uncertain data as a "first-class citizen". This means that the database system should provide handy ways of modeling uncertainty, as well as enable querying facilities, which return imprecise answers to users with statistical guarantees.

The ORION database system, which we co-developed with Purdue University in 2006, provides some of these features. For example, it supports the modeling of attribute uncertainty, which is essentially a range of possible values, together with their probability density functions (pdf). This model, as illustrated in Figure 1, is widely used in location-based applications, road networks, sensor networks, and biological applications. The ORION system also provides querying facilities on uncertain data. In particular, it supports the evaluation of probabilistic queries, which augments statistical guarantees on query answers. For instance, if we want to know which sensor in a geographical area yields a temperature between 10°C and 30°C, then a probabilistic query returns answers like: {(A, 0.1), (B, 0.8)}, meaning that sensors A and B collect temperature readings between 10°C and 30°C, with probabilities 0.1 and 0.8, respectively. The technical results about this system can be found here.

(a) The Voronoi Diagram (b) The UV-diagram

Figure 2: The Voronoi Diagram and the UV-diagram.

 

Since obtaining probabilities can be expensive, a probabilistic query can be slow to execute. To address this problem, we develop efficient and scalable query execution algorithms. We studied the evaluation of probabilistic range queries. We also developed the computation of the probabilistic nearest-neighbor (NN) query, which allows questions like: "Which restaurant in Hong Kong is the closest to me?" to be answered. We examined two variants of NN:

(1)  snapshot NN (i.e., the query is issued only once); and

(2)  continuous NN query (i.e., the query is executed for an extensive amount of time).

Recently, we have proposed the Uncertain Voronoi diagram (or UV-diagram in short). The UV-diagram is a generalization of the Voronoi diagram, which facilitates NN query evaluation, for uncertain data. Figures 2(a) and (b) illustrate the Voronoi diagram (for seven precise data points) and the UV-diagram (for seven location uncertain data items). For both diagrams, answering a NN query involves finding the Voronoi cell that contains the query point (i.e., q, q1, and q2). Notice that in a UV-diagram, a Voronoi cell can be associated with two or more objects (e.g., q2 is located in a region where both O6 and O7 can be the closest to it). Since the UV-diagram is expensive to be constructed and stored, we propose a new data structure, called the UV-index, which resembles the UV-diagram. We study how the UV-index can be efficiently constructed and used to answer NN queries. We also plan to incorporate the UV-index into the ORION.

 

Grant Support

1.  UV-Diagram: A Voronoi Diagram for Uncertain Spatial Databases (RGC GRF, Ref: 711110, 2011-13). HKD 746,400.

2.  Scalable Continuous Query Processing on Imprecise Location Data (RGC GRF, Ref: 513508, 2009-10). HKD 645,950.

3.  Adaptive Filters for Continuous Queries over Constantly-Evolving Data Streams (RGC GRF, Ref: 513307, 2008-09). HKD 391,512.

 

Selected Publications (Click here for a full list)

1.  R. Cheng, X. Xie, M. Y. Yiu, J. Chen, and L. Sun. UV-diagram: A Voronoi Diagram for Uncertain Data. In the IEEE Intl. Conf. on Data Engineering (IEEE ICDE 2010), Long Beach, USA, Mar, 2010 (Full paper). Acceptance rate: 12.5%. [Paper][Talk]

2.  J. Chen, R. Cheng, M. Mokbel and C. Chow. Scalable Processing of Snapshot and Continuous Nearest-Neighbor Queries over One-Dimensional Uncertain Data. In Very Large Databases Journal (VLDBJ), Special Issue on Uncertain and Probabilistic Databases, Vol. 18, No. 5, pp. 1219-1240, 2009. (Awarded the Research Output Price in Department of Computer Science, Faculty of Engineering, HKU, 2010) [Paper]

3.  J. Chen and R. Cheng. Efficient Evaluation of Imprecise Location-Dependent Queries. In the IEEE Intl. Conf. on Data Engineering (IEEE ICDE 2007), Istanbul, Turkey, Apr, 2007. Acceptance rate: 18.5%. [Paper][Talk]

4.  R. Cheng, B. Kao, S. Prabhakar, A. Kwan and Y. Tu. Adaptive Stream Filters for Entity-based Queries with Non-value Tolerance. In Very Large Databases Conf. (VLDB 2005), Norway, Aug 2005, pp. 37-48. Acceptance rate: 16.5%. [Paper]

5.  R. Cheng, D. Kalashnikov and S. Prabhakar. Evaluating Probabilistic Queries over Imprecise Data. In Proc. of the ACM Special Interest Group on Management of Data (ACM SIGMOD 2003), pp. 551-562, June 2003. Acceptance rate: 15.2%, 52/342. [Paper][Talk]

 

Project 2: Uncertainty and Database Applications

As discussed before, uncertainty is prevalent in many important applications. In this project, we focus on three major areas, namely, ambiguity removal, uncertain data mining, and data integration.

(a) A probabilistic database.

(b) A partially cleaned database (for (a)).

Figure 3: Cleaning a probabilistic database.

 

   Ambiguity Removal: We study the issues of improving service quality on a probabilistic database. In many applications, service quality can be enhanced by "cleaning" these databases. Consider a probabilistic database used to record the price of products integrated from different web sources (Figure 3(a)). The uncertainty about these ratings can be "sanitized" (or cleaned) by manually contacting the respective users for clarification (Figure 3(b)). In a location-based application that stores the current positions of mobile objects (e.g., people and vehicles) in a probabilistic database, the system can clean the database by requesting the mobile objects to report their most-updated location values (and replace the outdated ones in the database). Cleaning thus reduces the ambiguity of the probabilistic database, as well as improving the quality of the queries that are evaluated on it.

In this work, we plan to develop a sound and universal metric to quantify the quality of query answers. We design efficient algorithms that select the most appropriate set of objects to be cleaned. Since cleaning involves costs (e.g., energy/network bandwidth for data transmission, or manpower used to validate the data), these algorithms have to be performed under a limited cleaning budget. A cleaning engine, illustrated in Figure 4, is being developed on top of an open-source probabilistic database. We further examine how ambiguity can be removed in imperfect but realistic situations (e.g., a cleaning can only be accomplished with a probability).

Figure 4: A probabilistic database cleaning system.

 

   Uncertain Data Mining: We study the issues of developing data mining algorithms (e.g., frequent pattern and classification) for probabilistic databases, which contain a vast amount of interesting information. Discovering frequent patterns and association rules from these data is technically challenging, since a probabilistic database can have an exponential number of possible worlds. We propose two efficient algorithms, which discover frequent patterns in bottom-up and top-down manners. Both algorithms can be easily extended to discover maximal frequent patterns. To achieve higher efficiency, we also examine an approximate (but highly accurate) version of frequent pattern mining. We address the issues of using these patterns to generate association rules. Extensive experiments, using real and synthetic datasets, were conducted to validate the performance of our methods.

 

   Data integration: Despite of advances in machine learning technologies, a schema matching result between two database schemas (e.g., those derived from COMA++) is likely to be imprecise. In particular, numerous instances of "possible mappings" between the schemas may be derived from the matching result. In this project, we study the problem of managing possible mappings between two heterogeneous XML schemas. We observe that for XML schemas, their possible mappings have a high degree of overlap. We hence propose a novel data structure, called the block tree, to capture the commonalities among possible mappings. The block tree is useful for representing the possible mappings in a compact manner, and can be generated efficiently. Moreover, it supports the evaluation of probabilistic twig query (PTQ), which returns the probability of portions of an XML document that match the query pattern. For users who are interested only in answers with k-highest probabilities, we also propose the top-k PTQ, and present an efficient solution for it.

The second challenge we have tackled is to efficiently generate possible mappings for a given schema matching. While this problem can be solved by existing algorithms, we show how to improve the performance of the solution by using a divide-and- conquer approach. An extensive evaluation on realistic datasets show that our approaches significantly improve the efficiency of generating, storing, and querying possible mappings.

 

Grant Support

1. Scalable Cleaning of Probabilistic Databases with Quality Guarantees (RGC GRF, Ref: 711309E, 2010-12). HKD 696,500.

2. Privacy Protection in Location-based Services with Location Cloaking (RGC CERG. Ref: 513806, 2007-09). Amount: HKD 356,000 (Completed).

 

Selected Publications (Click here for a full list)

1. R. Cheng, E. Lo, X. Yang, M. Luk, X. Li, and X. Xie. Explore or Exploit? Effective Strategies for Disambiguating Large Databases. In Very Large Databases Conf. (VLDB 2010), Singapore, Sep, 2010. [Paper]

2. L. Sun, R. Cheng, D. W. Cheung, and J. Cheng. Mining Uncertain Data with Probabilistic Guarantees. In the 16th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (ACM SIGKDD 2010), Washington D.C., USA, Jul, 2010 (Full paper). Acceptance rate: 17%. [Paper]

3. L. Wang, D. Cheung, R. Cheng, and S. Lee. Accelerating Probabilistic Frequent Itemset Mining: A Model-Based Approach. In the ACM 19th Conf. on Information and Knowledge Management (ACM CIKM 2010), Toronto, Canada, Oct 2010. Acceptance rate: 13.4%

4. R. Cheng, J. Gong, and D. Cheung. Managing Uncertainty of XML Schema Matching. In the IEEE Intl. Conf. on Data Engineering (IEEE ICDE 2010), Long Beach, USA, Mar, 2010 (Full paper). Acceptance rate: 12.5%. [Paper][Talk]

5. R. Cheng, J. Chen, and X. Xie. Cleaning Uncertain Data with Quality Guarantees. In Very Large Databases Conf. (VLDB 2008), New Zealand, Aug 2008. Acceptance rate: 16.5%. [Paper][Talk]

 

Contact: For more information, please contact Dr. Reynold Cheng.