|
|
|
Managing
Aging Data |
|
|
|
|
DescriptionA wide range of
real world database applications, including financial and medical
applications, are faced with severe growth problems, problems that are
relevant for a wide range of database systems. A main challenge of improving
the current way of handling these problems is to create a frame of reference
and a reduction strategy that can be applied to the different database
contexts and different understandings of aging data. The goal of this
thesis is to create such frame of reference and strategy to allow reduction
of databases by structured and disciplined specifications in different
contexts. (Aging data is data that is no longer considered useful,
i.e., data that is old according to time, content, characteristics or
interpretation.) Many real-world
database applications face accountability and trace-ability requirements that
lead to the replacement of the usual update-in-place policy by an append-only
policy, yielding so-called transaction-time databases. With logical
deletions being implemented as insertions at the physical level, these
databases retain all previously current states and are ever-growing. Thus
meeting the requirements pose severe growth problems. Also in the data
warehouse context the append-only databases are prominent: By Inmon's
definition, a data warehouse is characterized by all its data exhibiting
temporal dimensions. Data is typically time stamped and (bulk) loaded at
regular intervals, and is retained in the warehouse for a number of years.
Thus the growth problem is highly similar in this context, however the
understanding of aging data and the requirements, when reducing the
data, shows to be different due to the usual focus on business analysis in
data warehousing. To suppress the
growth problems many techniques have been proposed, but the support for
physical data reduction has received precious little attention despite that
it is called for by, e.g., the laws of many countries. A variety of physical
storage structures and indexing techniques as well as query languages are
proposed for transaction-time databases, and many precomputation and
materialization techniques are proposed for data warehousing, but even if a
few cleaning daemons have been proposed non of them have the goal to satisfy
legal requirements of data reduction, or the goal to refer to the aging
of data. This thesis suggests a way to specify data reduction
in transaction-time databases and data warehousing. It uses the nature and
the structure of data in these database systems to apply the same reduction
strategy, accepting the different limitations of the systems and achieving
the different goals of accountability and traceability as well as maintenance
of the business knowledge. The following challenges are addressed in general.
The
structure of the thesis is as follows. First, the challenges of database
reduction in different database contexts are stated. Then, in two larger
parts, concepts as well as reduction and aggregation strategies are presented
for transaction-time databases and data warehousing, respectively. For
transaction-time databases two techniques are presented: vacuuming is
a technique that offers specification of physical deletion; and persistent
views are views that are immune to physical deletion and thus they allows
retention of specified data while using physical deletion. For data
warehouses an approach called aggregation-based data reduction is
presented and its application is illustrated. Aggregation-based data
reduction is a technique that offers gradual aggregation of facts to
higher-level granularities in the hierarchical dimensions. Its use is
illustrated for a clickstream-analysis case study done in collaboration with
Nykredit Data; a case study on analyzing series of clicks on the Nykredit
web-site. Further readings:
|