Data Deduplication

Frequency Based Chunking for Data De-Duplication

Future Work:

 

refine our FBC algorithm by setting the escape frequency for coarse grained CDC chunks

 

design a one-pass algorithm which incorporates  the frequency estimation and the chunking process simultaneously. 

 

Efficient Bloom Filter design for parallel filter

 

This work was partially supported by grants from NSF (NSF Awards: 0960833 and 0934396)

A predominant portion of Internet services, like content delivery networks, news broadcasting, blogs sharing and social networks, etc., is data centric. A significant amount of new data is generated by these services each day. To efficiently store and maintain backups for such data is a challenging task for current data storage systems. Chunking based deduplication (dedup) methods are widely used to eliminate redundant data and hence reduce the required total storage space. In this paper, we propose a novel Frequency Based Chunking (FBC) algorithm. Unlike the most popular Content-Defined Chunking (CDC) algorithm which divides the data stream randomly according to the content, FBC explicitly utilizes the chunk frequency information in the data stream to enhance the data deduplication gain especially when the metadata overhead is taken into consideration. The FBC algorithm consists of two components, a statistical chunk frequency estimation algorithm for identifying the globally appeared frequent chunks, and a two-stage chunking algorithm which uses these chunk frequencies to obtain a better chunking result. To evaluate the effectiveness of the proposed FBC algorithm, we conducted extensive experiments on heterogeneous datasets. In all experiments, the FBC algorithm persistently outperforms the CDC algorithm in terms of achieving a better dedup gain or producing much less number of chunks. Particularly, our experiments show that FBC produces 2.5 ~ 4 times less number of chunks than that of a baseline CDC which achieving the same Duplicate Elimination Ratio (DER). Another benefit of FBC over CDC is that the FBC with average chunk size greater than or equal to that of CDC achieves up to 50% higher DER than that of a CDC algorithm.

 

Publications:

 

Guanlin Lu, Yu Jin, David H.C. Du, Frequency Based Chunking Algorithm for Data Deduplication,  in the 18th Annual Meeting of the IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS 2010), Miami, Florida, August, 2010. (Extended paper, category of top 16% submitted papers)

 

Guanlin Lu, Yu Jin, David H.C. Du, Frequency-Based Chunking for Backup Streams, in the posters of 8th USENIX Conference on File and Storage Technologies (FAST 2010), San Jose, CA, Feb 2010.

 

The views and opinions expressed in this page are strictly those of the page author.
The contents of this page have not been reviewed or approved by the University of Minnesota.