Autonomous Database Partitioning Using Data Mining for High Performance Computing

This material is based upon work supported by the National Science Foundation under Grant No. 0954310. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

Project Summary

Query response time and system throughput are the number one metrics when it comes to database and file access performance. Because of data proliferation, efficient access methods and data storage techniques have become increasingly critical to maintain an acceptable query response time and system throughput. Retrieving data from disk is several orders of magnitude slower than retrieving it from memory. One of the common ways to reduce disk I/Os and therefore improve query response time is database clustering, which is a process that partitions the database/file vertically (attribute clustering) and/or horizontally (record clustering). To take advantage of parallelism to improve system throughput, clusters can be placed on different nodes in a cluster machine. A clustering result is optimized for a given set of queries. However in dynamic systems the queries change with time, the clustering in place becomes obsolete, and the database/file needs to be re-clustered dynamically.

This proposal proposes to develop an efficient algorithm for database/file clustering that dynamically and automatically generates attribute and record clusters based on closed item sets mined from the attributes and records sets found in the queries running against the database/files. The proposal then develops ways to implement the algorithms using the cluster computing paradigm to reduce query response time and system throughput even further through parallelism and data redundancy. The developed algorithms will be prototyped on a cluster computer with 486 compute nodes available at the University of Oklahoma. Performance studies will be conducted using the decision support system database benchmark (TPC-H) and real data recorded in database and file formats collected from applications, such as meteorology, microbiology and healthcare.

Intellectual Merits:

While much work has been published on indexing, buffering, clustering and parallelism - techniques for improving system performance, little has been done about automating these processes, especially automatic and dynamic clustering of data on storage medium for high end computing. This is an important area because with data proliferation, human attention has become a precious and expensive resource. It is therefore important to automate this process in order to minimize the operating cost. The intellectual merits of this proposal lie in three important contributions: 1) the database/file clustering technique that makes use of data mining to automatically and dynamically clusters and re-clusters a database/file with little intervention of a database/system administrator, 2) the approach of integrating the proposed clustering technique and the cluster computing architecture to improve query response time and system throughput, and 3) comprehensive performance studies by means of prototyping that use not only a popular database benchmark, but also real database and file datasets from data- and computation-intensive applications. This proposal is high risk and high payoff and is suitable for EAGER as the proposed ideas for integrating autonomous database/file clustering with cluster computing are in their early stage even though they are novel and potentially transformative. Details need to be developed and tested to prove their feasibility on cluster computers with at least few hundred nodes. Once proved, a full proposal that addresses both autonomous attribute and record clustering for high performance computers running on local and wide area networks will be developed and submitted.

Broader Impacts:

The research results will be beneficial to many applications as they are expected to improve query response time and system throughput. Collaboration will be carried out with scientists in the Oklahoma Center for Analysis and Prediction of Storms (CAPS) and domain experts in other application areas. The research results including the proposed prototype and the datasets used for performance studies will be published in journals and conference proceedings, and will be posted on the Website of the PI's Database Group ( http://www.cs.ou.edu/~database) for public use. One graduate student and one undergraduate student will be supported by the project as research assistants (RAs) to conduct research and build the prototype for performance evaluations. The RAs will be trained in database and file management, and high end computing and applications. The PI will work with the minority engineering program in the College of Engineering and the ACM-W Chapter at the University of Oklahoma to identify minority and female students for RA recruitment. The PI has supervised and graduated ten minority and female PhD and Master's students.