Multi-Order Data Deduplication with GPGPUs for Data-Intensive Computing
Data deduplication has been generally recognized as a critical technique that reduces the data volume to be transferred over the interconnection and to be stored on storage devices in data-intensive high-performance computing, Cloud computing, and Big Data computing. Current data deduplication solutions, however, suffer costly byte-by-byte comparisons in the cases of hash collisions. In this research, we propose a Multi-Order Data Deduplication with GPGPUs method (MODD in short) to address the costly comparison issue in conventional data deduplication solutions. The idea is to reduce hash collisions by leveraging multiple fingerprinting algorithms and thus reduces the need and the cost of byte-by-byte comparisons. The MODD also leverages GPGPUs to compute highly concurrent and massively parallel multiple hashing algorithms to avoid the delay of a multi-order deduplication. We have performed investigations to identify the desired property of complementing hashing algorithms and evaluated the hash collision reductions and byte-by-byte comparison savings. We are also in the process of prototyping and evaluating the leverage of GPGPUs to speed up the hash computations. The proposed MODD method considerably reduces the costly byte-by-byte comparisons in data deduplication and further trades computation capability to data access capability (further reduced data movement), similar as the idea of the original data deduplication technique. It holds a promise and can be widely applicable to high-performance computing, Cloud computing, and Big Data computing.