# Cloud Storage and Data De-Duplication

December 4, 2017

Rendering efficient storage and security for data is very important for cloud. With the rapidly increasing data produced worldwide, networked and multi-user storage systems are becoming very popular. However, concerns over data security still prevent many users from migrating data to remote storage.

Data deduplication refers to a technique for eliminating redundant data in a data set. In the process of deduplication, extra copies of the same data are deleted, leaving only one copy to be stored. Data is analysed to identify duplicate byte patterns to ensure the single instance is indeed the single file. Then, duplicates are replaced with a reference that points to the stored chunk.

### Data deduplication

Data deduplication is a technique to reduce storage space. By identifying redundant data using hash values to compare data chunks, storing only one copy, and creating logical pointers to other copies instead of storing other actual copies of the redundant data. Deduplication reduces data volume so disk space and network bandwidth can be reduced which reduce costs and energy consumption for running storage systems

Figure 1: Data de-duplication View

Data deduplication is a technique whose objective is to improve storage efficiency. With the aim to reduce storage space, in traditional deduplication systems, duplicated data chunks identify and store only one replica of the data in storage. Logical pointers are created for other copies instead of storing redundant data. Deduplication can reduce both storage space and network bandwidth. However such techniques can result with a negative impact on system fault tolerance. Because there are many files that refer to the same data chunk, if it becomes unavailable due to failure can result in reduced reliability. Due to this problem, many approaches and techniques have been proposed that not only provide solutions to achieve storage efficiency, but also to improve its fault tolerance

### Applications

Data deduplication provides practical ways to achieve these goals, including

• Capacity optimization.Data deduplication stores more data in less physical space. It achieves greater storage efficiency than was possible by using features such as Single Instance Storage (SIS) or NTFS compression. Data deduplication uses subfile variable-size chunking and compression, which deliver optimization ratios of 2:1 for general file servers and up to 20:1 for virtualization data.
• Scale and performance.Data deduplication is highly scalable, resource efficient, and nonintrusive. It can process up to 50 MB per second in Windows Server 2012 R2, and about 20 MB of data per second in Windows Server 2012. It can run on multiple volumes simultaneously without affecting other workloads on the server.
• Reliability and data integrity.When data deduplication is applied, the integrity of the data is maintained. Data Deduplication uses checksum, consistency, and identity validation to ensure data integrity. For all metadata and the most frequently referenced data, data deduplication maintains redundancy to ensure that the data is recoverable in the event of data corruption.
• Bandwidth efficiency with BranchCache. Through integration with BranchCache, the same optimization techniques are applied to data transferred over the WAN to a branch office. The result is faster file download times and reduced bandwidth consumption.
• Optimization management with familiar tools. Data deduplication has optimization functionality built into Server Manager and Windows PowerShell. Default settings can provide savings immediately, or administrators can fine-tune the settings to see more gains.

Data deduplication identifies duplicate data, removing redundancies and reducing the overall capacity of data transferred and stored. There are two methods of data de-duplication. Block-level and byte-level data deduplication methods deliver the benefit of optimizing storage capacity. When, where and how the processes work should be reviewed for your data backup environment and its specific requirements before selecting one approach over another.

### Block-level Approaches

Block-level data deduplication segments data streams into blocks, inspecting the blocks to determine if each has been encountered before (typically by generating a digital signature or unique identifier via a hash algorithm for each block). If the block is unique, it is written to disk and its unique identifier is stored in an index; otherwise, only a pointer to the original, unique block is stored. By replacing repeated blocks with much smaller pointers rather than storing the block again, disk storage space is saved.

### Byte-level data de-duplication

Analyzing data streams at the byte level is another approach to deduplication. By performing a byte-by-byte comparison of new data streams versus previously stored ones, a higher level of accuracy can be delivered. Deduplication products that use this method have one thing in common: It’s likely that the incoming backup data stream has been seen before, so it is reviewed to see if it matches similar data received in the past

### References

[1] How Does Data Deduplication Work? Online available at: http://www.enterprisestorageguide.com/how-data-deduplication-works

[2] “Data Deduplication Overview”, available online at: https://technet.microsoft.com/en-us/library/hh831602(v=ws.11).aspx

[3] Leesakul, Waraporn, Paul Townend, and Jie Xu. “Dynamic data deduplication in cloud storage.” Service Oriented System Engineering (SOSE), 2014 IEEE 8th International Symposium on, IEEE, 2014.

$${}$$