“From 2005 to 2020, the digital universe will grow by a factor of 300, from130 exabytes to 40,000 exabytes, or 40 trillion gigabytes (more than 5,200gigabytes for every man, woman, and child in 2020). From now until 2020, thedigital universe will about double every two years.” (IDC study for EMC, 2012)
Data deduplication is one for the most effective ways to reduce the size of data storedin large scale systems, and is widely used to date. The process of deduplicationconsists of identifying duplicate chunks of data in different files (including backupversions, virtual machine images, etc.), storing a single copy of each unique chunk,and replacing the duplicate chunks with pointers to this copy.
The goal of this seminar is to study the basic concepts of deduplication, the challengesand tradeoffs it introduces in the design of large scale storage systems, and the mostrecent advances in addressing these challenges. We will cover the various aspects ofsystem design, including chunking, duplicate identification, metadata and referencemanagement, inline vs. background deduplication, centralized vs. distributeddeduplication, hard disk vs. SSD storage, and more.