Managing Aging Data

Nykredit Center for Database Research

 

 



Title:

Specification-Based Techniques for The Reduction of Temporal and Multidimensional Data

By:

Janne Skyt

Advisor: 

Christian S. Jensen

Status:

Thesis defended September 17, 2001

Description

A wide range of real world database applications, including financial and medical applications, are faced with severe growth problems, problems that are relevant for a wide range of database systems. A main challenge of improving the current way of handling these problems is to create a frame of reference and a reduction strategy that can be applied to the different database contexts and different understandings of aging data. The goal of this thesis is to create such frame of reference and strategy to allow reduction of databases by structured and disciplined specifications in different contexts. (Aging data is data that is no longer considered useful, i.e., data that is old according to time, content, characteristics or interpreta­tion.)

Many real-world database applications face accountability and trace-ability requirements that lead to the replacement of the usual update-in-place policy by an append-only policy, yielding so-called transaction-time databases. With logical deletions being implemented as insertions at the physical level, these databases retain all previously current states and are ever-growing. Thus meeting the requirements pose severe growth problems. Also in the data warehouse context the append-only databases are prominent: By Inmon's definition, a data warehouse is characterized by all its data exhibiting temporal dimensions. Data is typically time stamped and (bulk) loaded at regular intervals, and is retained in the warehouse for a number of years. Thus the growth problem is highly similar in this context, however the understanding of aging data and the requirements, when reducing the data, shows to be different due to the usual focus on business analysis in data warehousing.

To suppress the growth problems many techniques have been proposed, but the support for physical data reduction has received precious little attention despite that it is called for by, e.g., the laws of many countries. A variety of physical storage structures and indexing techniques as well as query languages are proposed for transaction-time databases, and many precomputation and materialization techniques are proposed for data warehousing, but even if a few cleaning daemons have been proposed non of them have the goal to satisfy legal requirements of data reduction, or the goal to refer to the aging of data.

This thesis suggests a way to specify data reduction in transaction-time databases and data warehousing. It uses the nature and the structure of data in these database systems to apply the same reduction strategy, accepting the different limitations of the systems and achieving the different goals of accountability and traceability as well as maintenance of the business knowledge. The following challenges are addressed in general.

  • Definition of the concepts and operators to serve as a solid foundation for data reduction in append-only databases
  • Satisfaction of consistency requirements
  • Specification of techniques to maintain statistics and aggregate data in the reduced database, since this may remain valuable despite the lack of value for the detail data
  • Implementation strategies for the data reduction system in the context of different database systems
  • Maintainability of the data reduction specification system
  • Handling of queries “faithfully” in the context of a reduced database

The structure of the thesis is as follows. First, the challenges of database reduction in different database contexts are stated. Then, in two larger parts, concepts as well as reduction and aggregation strategies are presented for transaction-time databases and data warehousing, respectively. For transaction-time databases two techniques are presented: vacuuming is a technique that offers specification of physical deletion; and persistent views are views that are immune to physical deletion and thus they allows retention of specified data while using physical deletion. For data warehouses an approach called aggregation-based data reduction is presented and its application is illustrated. Aggregation-based data reduction is a technique that offers gradual aggregation of facts to higher-level granularities in the hierarchical dimensions. Its use is illustrated for a clickstream-analysis case study done in collaboration with Nykredit Data; a case study on analyzing series of clicks on the Nykredit web-site.

Further readings:

J. Skyt and C. S. Jensen, Persistent Views - A Mechanism for Managing Aging Data, August 2001. Available as TimeCenter Report TR-65 [.pdf]

J. Skyt, C. S. Jensen, and T. B. Pedersen, Specification-Based Data Reduction in Dimensional Data Warehouses, July 2001. Available as TimeCenter Report TR-61 [.pdf]

J. Skyt and C. S. Jensen, Vacuuming Temporal Databases, 1998 [.ps.gz]
(An extended version of this paper is under submission, October 2000)

J. Andersen, A. Giversen, A. H. Jensen, R. S. Larsen, T. B. Pedersen, and J. Skyt, Analyzing Clickstreams Using Subsessions, in "Proceedings of the Third International Workshop on Data Warehousing and OLAP," Washington DC, November 2000 [.ps.gz] 

J. Skyt and C. S. Jensen, Managing Aging Data Using Persistent Views (extended abstract), in "Proceedings of the Fifth IFCIS International Conference in Cooperative Information Systems," Eilat, Israel, September 2000 [.ps.gz]

J. Skyt, Managing Aging Data in Temporal Databases and Data Warehouses--Vacuuming and Persistent Views, in "Proceedings of EDBT 2000 Ph.D. Workshop," Konstanz, Germany, March 2000 [.ps.gz] 

 

Copyright © 1998 - 2000.  All rights reserved.