Opendedup

Global Deduplication

  • Increase font size
  • Default font size
  • Decrease font size

Cloud Based Deduplication Quick Start Guide

E-mail Print PDF

Cloud Based Deduplication Quick Start Guide

awsazure

It is now possible to store dedup chunks to the Amazon S3 cloud storage service, Google, or Azure. This will allow you to store unlimited amounts of data without the need for local storage. AES 256 bit encryption and compression (default) is provided for storing data to cloud storage. 

The purpose of deduplicating data before sending it to cloud storage is to minimize, network, storage and maximize write performance. The concept behind deduplication is to only store unique blocks of data. If only unique data is sent to cloud storage, bandwidth can be optimized and cloud storage can be reduced. Opendedup approaches cloud storage differently than a traditional cloud based file system. The volume data such as the name space and file meta-data are stored locally on the the system where the SDFS volume is mounted.  SDFS stores all unique data in the cloud and uses a local writeback cache for performance purposes. This means the most recently accessed data is cached locally and only the unique chunks of data are stored at the cloud storage provider. This ensures maximum performance by allowing all file system functions to be performed locally except for data reads and writes. In addition, local read and write caching should make writing smaller files transparent to the user or service writing to the volume.

General Features of SDFS for Cloud storage are as follows:

  1. In-Line deduplication to a Cloud Storage Backend - SDFS can send all of its data to AWS, AZURE, or Google. 
  2. Performance Improvements - Compressed Multi-Threaded upload and download of data to the cloud
  3. Local Cache - SDFS will cache the most recently accessed data locally. This is configurable but set to 10 GB by Default
  4. Security - All data can be encrypted using AES-CBC 256 when sent to the cloud
  5. Throttling - Upload and download speeds can be throttled.
  6. Cloud Recovery/Replication - All local metadata is replicated to the cloud and can be recovered back to local volume.
  7. Glacier Support - Supports S3 Lifecycle policies and retrieving data from Glacier
  8. AWS Region Support - Supports all AWS Regions

 

Requirements :

Read the quickstart guide first!

To Setup AWS enabled deduplication volumes follow these steps:

1. Go to http://aws.amazon.com and create an account.
2. Sign up for S3 data storage
3. Get your Access Key ID and Secret Key ID.
4. Make an SDFS volume using the following parameters:

Linux 
 

mkfs.sdfs  --volume-name=<volume name> --volume-capacity=<volume capacity> --aws-enabled=true --cloud-access-key=<the aws assigned access key> --cloud-bucket-name=<a universally unique bucket name such as the aws-access-key> --cloud-secret-key=<assigned aws secret key> --chunk-store-encrypt=true

Windows

 mksdfs --volume-name=<volume name> --volume-capacity=<volume capacity> --aws-enabled=true --cloud-access-key=<the aws assigned access key> --cloud-bucket-name=<a universally unique bucket name such as the aws-access-key> --cloud-secret-key=<assigned aws secret key> --chunk-store-encrypt=true

 
5. Mount volume and go to town!

Linux

mount -t sdfs <volume name> <mount point>

Windows

mountsdfs -v sdfs -m z

To Setup Azure enabled deduplication volumes follow these steps

1. Go to https://manage.windowsazure.com and create an account
2. Create a new storage account with a unique name, this will be your access key id
3. Create a storage bucket within the newely created storage account. This will be the bucket name
4.  Get your secret key by clicking on "Manage Keys" the secret key will be shown a "PRIMARY ACCESS KEY".
5. Make an SDFS volume using the following parameters:
Linux
mkfs.sdfs  --volume-name=<volume name> --volume-capacity=<volume capacity> --azure-enabled=true --cloud-access-key=<storage account> --cloud-bucket-name=<the buckey name> --cloud-secret-key=<primary access key> --chunk-store-encrypt=true
Windows
mksdfs  --volume-name=<volume name> --volume-capacity=<volume capacity> --azure-enabled=true --cloud-access-key=<storage account> --cloud-bucket-name=<the buckey name> --cloud-secret-key=<primary access key> --chunk-store-encrypt=true
6. Mount volume and go to town!
Linux
mount -t sdfs <volume name> <mount point>

Windows

mountsdfs -v <volume name> -m <mount point>

Advanced Setup

There are many advanced features associated with cloud storage. These include local caching, bandwidth throttling, encryption, and remote recovery.

These advanced options are changed throught the xml configuration extended-config tag. The volume must be unmounted to change the config

Below is a sample extended config.

<extended-config allow-sync="false" block-size="10 MB" delete-unclaimed="true" io-threads="16" local-cache-size="10 GB" map-cache-size="200" read-speed="0" sync-check-schedule="4 59 23 * * ?" sync-files="true" upload-thread-sleep-time="6000" write-speed="0"/>

Bandwidth Throttling

read-speed and write-speed throttle the upload and download speeds from the cloud storage provider. These are both set in KB/s.  

Upload Block Performance

block-size sets the maximum block size of batch uploads. This can be set to smaller or larger sizes to accomidate IO characteritics required for the application.

io-threads sets the number of simultanious uploads to the cloud storage provider. This can be set to a larger size for better upload performance over faster connections. It has been tested to 64 threads and seems to peak at 48 threads.

upload-thread-sleep-time sets the interval at which a block will be uploaded regardless of the size of the block.

Garbage Collection

delete-unclaimed set whether to delete the blocks once there are no chucks associated with the blocks in the cloud. Chunks will be dereferenced from blocks as they are orphaned. When the number of chucks is 0 the blocks will be deleted if this value is set to true

Local Caching

local-cache-size sets the size of the local cache. This can be set in GB or TB and specifies the amount of data to be cached locally.

Syncing Metadata to the cloud

All metadata is synced to the cloud storage provider by default. This makes recovery of the entire volume possible as long as the xml config is available. To recover the all of the metadata from the cloud the SDFS volume should be mounted with the "-cfr" option to recover all of the metadata. The following options are associated with syncing metadata to the cloud.

sync-files specifies if metadata will be uploaded to the cloud. Uploading file metadata to the cloud can impact performance but should not be disabled, by setting to false, unless you understand the risk.

sync-check-schedule specifies the cron schedule for verifing that data is synced correctly to the cloud.

 


Last Updated on Monday, 30 November 2015 04:48  

SDFS Info

Latest News

SDFS 3.0.1 Released with a focus on cloud storage and scalability
 
SDFS Version 2.0.11 Released. Check out the change log.