| The Data Warehouse Configuration Problem |
|
Description
On-line analytical processing (OLAP), business intelligence, and multi-dimensional analysis has been the focus of intense research activity over the last few years. Early OLAP research focused primarily on simple aggregate queries; particularly the evaluation, usage, and maintenance of summary-table views. As data warehousing and analytical applications have gained ground in the industry, the challenges facing OLAP technology have increased in scale and complexity. Many applications exist which require the evaluation of very complex aggregate queries.
This Ph.D. thesis presents a general algebraic operator for the expression and evaluation of complex aggregate queries and considers two relevant research questions within the field of complex OLAP (i.e., aggregation queries that require expressions more complex than simple summary-table views): the evaluation of subquery predicates in the presence of complex aggregation, and the distributed evaluation of complex OLAP queries. The thesis formalizes the generalized multi-dimensional join (GMD-join), an algebraic operator for complex OLAP and presents a set of algebraic transformation rules demonstrating how the operator interacts with the other operators of a multi-set algebra. The techniques for achieving an efficient evaluation of the GMD-join are considered, and cost-formulas for estimating the cost of evaluating the GMD-join are presented. The algebraic transformations, techniques, and cost-model presented in this thesis provide a foundation for the incorporation of the GMD-join, or a similar segmented evaluation operator into a conventional DBMS. Subqueries are a common feature of complex OLAP queries. Despite this, no research work has considered the evaluation of subquery predicates in the presence of complex aggregation. The thesis presents a general algorithm that allow subquery predicates to be expressed as GMD-joins expressions thereby enabling them to be evaluated efficiently. Many of the new applications for complex OLAP involve huge amounts of highly distributed data. In order for such data to be queried we need to develop and maintain a distributed data warehouse. This thesis develops a framework and describes a prototype for the distributed processing of complex OLAP queries. A general strategy for the distributed evaluation of complex OLAP queries expressed using GMD-joins is presented, and optimization strategies that exploit distribution knowledge, if known, as well as strategies that do not assume such knowledge, are developed. A series of experiments are presented to evaluate the performance of these strategies and validate the distributed processing algorithm. Finally, the architecture and algorithms of Skalla, a prototype system for the distributed evaluation of complex OLAP queries implemented during the Ph.D. project is documented. Further readings:
|