Novell Articles:
Data Deduplication
By Donna Moyer
The Challenge
As data stores continue to grow and the need for retaining more
and more organizational data for legal reasons increases, IT
professionals are working to determine if their current backup
strategies can keep up. Tapes – while offering easy transferability
to an off-site location – can be extremely costly to store.
It also can be very time-consuming to restore data from tapes.
Alternatively, the cost of disk has decreased to the point where
using disk-to-disk backup is a viable option. For customers using
a combination of disk and tape backup solutions, data deduplication
can help that cost come down even more, plus save valuable time
at every level.
What is Data Deduplication?
Wikipedia defines data deduplication as “a specific form
of compression where redundant data is eliminated.” Take
the example of a 50 MB PowerPoint presentation emailed to 10 people.
If each person stores the presentation in their home directory,
we now have 500 MB allocated to storing the same data! If each
person then forwards the presentation to 1 other individual and
those people also store the presentation, we have 1G of storage
dedicated to a single file! Incremental and differential backups
aside, this one file will take up 1G of storage for its initial
backup.
Data deduplication takes care of this redundancy by recognizing
that the data in each of these individual files is the same. It
therefore stores one copy of the file and creates pointers to the
rest. Now, instead of using 1G of storage, 20 people have used
a total of only 50 MB of disk space.
However, let's assume that each person makes a change in one slide.
Now the data across all the files is not the same. Some data deduplication
products are smart enough to work on the subfile level: they locate
the blocks of data that are the same, store those one time, and
then store the differing blocks separately. Because of the pointers
the data deduplication product creates, each person can retrieve
their unique version of the file, even though it has been stored
in separate blocks.
How Does It Work?
Deduplication technology works by comparing chunks of data and
searching for duplicates. It does this by assigning a unique identifier
to each chunk, calculated by a cryptographic hash function. When
a duplicate is found, the file is removed and a link to the first
file is created. If this file is changed, then a copy of the changed
file or block is written to disk during the next backup.
Types of Deduplication Technology
There are two types of data deduplication technology currently
in use:
- Post-process deduplication: As the name implies, post-process
deduplication runs after the data is sent to the target device.
The advantage of this is that since the deduplication process
can be slow, time for backup is not lost waiting for deduplication
to occur. The disadvantage is that it is impossible to predict
how long the deduplication process will take. Also, since the
data needs to be written to the target first, more disk space
will be required until the process finishes.
- In-line deduplication: With in-line deduplication, the hash calculations are created
on the target device as the data is written. If a duplicate is
found, the new block of data is not stored. This method requires
less storage on the target, but can be slower due to hash calculations
and lookups taking a long time. Performance varies across vendors.
What Are the Advantages?
Data deduplication brings a wide variety of benefits to organizations:
- Save
on storage space for disk-to-disk backups: According
to the Enterprise Strategy Group's report by Tony Asaro and Heidi
Biggar entitled Data De-duplication and Disk-to-Disk Backup Systems
(July 2007), “Through
hands-on testing, ESG has found that data deduplication technologies
can provide 10 times, 20 times, 30 times and even great reduction
in capacity needed for backup.” Thus,
companies can see savings not only in the disk needed for the
primary backup, but also in the cost of disk for a secondary
site, or in monthly charges for an off-site backup service.
- Save on heating and cooling: By decreasing the amount
of disk needed, organizations can see a reduction in heating
and cooling costs.
- Save on space: With less disk needed, organizations also
save on the amount of floor/rack space needed to house the backup
solution.
- Save on bandwidth: Less data going across the wire means
lowered bandwidth costs.
- Decrease time and costs for data restoration: Recovery
from disc is instantaneous, while recovery from tape can be slow
and time-consuming. If the tape needed is in off-site storage,
more time and costs will be incurred.
What Backup Vendors Support This Technology?
There are a host
of vendors offering this technology, including ExaGrid, EMC DataDomain,
and Barracuda Backup (formerly BitLeap until Barracuda bought them
last year).
Where Can I Learn More?
Check http://www.datadomain.com for whitepapers (like the one
mentioned in this article) and a deduplication calculator. ESG's
report contains some great information, including questions to
ask vendors when selecting a solution.
Conclusion
If you are considering a new backup strategy for your organization,
taking a look at what data deduplication can do for you is a
must. We feel that development of this technology is just getting
started, and can only improve as more products hit the marketplace.
© Copyright 2010, Uptime NetManagement, Inc.
Article Source: http://www.uptimenmi.com/
You have my permission to reprint and distribute this article as long as it
is distributed in its entirety, including all links and copyright information.
This article is not to be sold or included with anything that is sold.
About the Author:
Donna Moyer is Principal/Senior Network Consultant of Uptime NetManagement,
Inc. (http://www.uptimenmi.com/). Uptime is a Novell Gold Solutions partner
providing technology solutions, customized training, and consulting services.
If you are interested in finding out exactly what Novell can do for your
business, or are seeking to maximize the benefits from your current Novell
systems, call us today at 610-621-1244!
|