Dedup Workload Modeling, Synthetic Datasets, and Scalable Benchmarking

 

STONY BROOK, NY, March 12, 2012 

Electronic data volumes keep growing at rapid rates, costing users precious space and increasing the total cost of ownership over the lifetime of the data (energy, performance, etc.). Data-deduplication is a popular technique to reduce the actual amount of data that has to be retained. Several vendors offer dedup-based products, and many publications are available. Alas, there is a serious lack of comparable results across systems. Often, the problem is a lack of realistic data sets that can be shared without violating privacy; moreover, good data sets can be very large and difficult to share. Many papers publish results using small data sets or non-representative ones (e.g., successive Linux kernel source tarballs). Lastly, there is no agreement what constitutes “realistic” data sets.

In this project we are developing tools and techniques to produce realistic, scalable dedupable data sets, taking actual workloads into account. We are analyzing dedupability properties of several different data sets we have access to; we are developing and release tools for anyone to analyze their own data sets without violating privacy. Next, we are building models that describe the important inherent properties of those data sets. Afterward, we are able to generate data synthetically that follows these models; we are generating data sets far larger than their originals, but faithfully modeling the original data.

Our preliminary prototype work is promising: we began to develop tools to chunk and hash backup and online data sets, then extract key properties such as the distributions of duplicate hashes. We are currently building Markov Models to represent the dedupability of backup data-sets over time.