Cloud Based Deduplication Quick Start Guide
It is now possible to store dedup chunks to the Amazon S3 cloud storage service, Google, or Azure. This will allow you to store unlimited amounts of data without the need for local storage. AES 256 bit encryption and compression (default) is provided for storing data to cloud storage.
The purpose of deduplicating data before sending it to cloud storage is to minimize, network, storage and maximize write performance. The concept behind deduplication is to only store unique blocks of data. If only unique data is sent to cloud storage, bandwidth can be optimized and cloud storage can be reduced. Opendedup approaches cloud storage differently than a traditional cloud based file system. The volume data such as the name space and file meta-data are stored locally on the the system where the SDFS volume is mounted. SDFS stores all unique data in the cloud and uses a local writeback cache for performance purposes. This means the most recently accessed data is cached locally and only the unique chunks of data are stored at the cloud storage provider. This ensures maximum performance by allowing all file system functions to be performed locally except for data reads and writes. In addition, local read and write caching should make writing smaller files transparent to the user or service writing to the volume.
General Features of SDFS for Cloud storage are as follows:
To Setup AWS enabled deduplication volumes follow these steps:
1. Go to http://aws.amazon.com and create an account.
2. Sign up for S3 data storage
3. Get your Access Key ID and Secret Key ID.
4. Make an SDFS volume using the following parameters:
mkfs.sdfs --volume-name=<volume name> --volume-capacity=<volume capacity> --aws-enabled=true --cloud-access-key=<the aws assigned access key> --cloud-bucket-name=<a universally unique bucket name such as the aws-access-key> --cloud-secret-key=<assigned aws secret key> --chunk-store-encrypt=true
mksdfs --volume-name=<volume name> --volume-capacity=<volume capacity> --aws-enabled=true --cloud-access-key=<the aws assigned access key> --cloud-bucket-name=<a universally unique bucket name such as the aws-access-key> --cloud-secret-key=<assigned aws secret key> --chunk-store-encrypt=true
5. Mount volume and go to town!
mount -t sdfs <volume name> <mount point>
mountsdfs -v sdfs -m z
To Setup Azure enabled deduplication volumes follow these steps
mkfs.sdfs --volume-name=<volume name> --volume-capacity=<volume capacity> --azure-enabled=true --cloud-access-key=<storage account> --cloud-bucket-name=<the buckey name> --cloud-secret-key=<primary access key> --chunk-store-encrypt=true
mount -t sdfs <volume name> <mount point>
mountsdfs -v <volume name> -m <mount point>
There are many advanced features associated with cloud storage. These include local caching, bandwidth throttling, encryption, and remote recovery.
These advanced options are changed throught the xml configuration extended-config tag. The volume must be unmounted to change the config
Below is a sample extended config.
<extended-config allow-sync="false" block-size="10 MB" delete-unclaimed="true" io-threads="16" local-cache-size="10 GB" map-cache-size="200" read-speed="0" sync-check-schedule="4 59 23 * * ?" sync-files="true" upload-thread-sleep-time="6000" write-speed="0"/>
read-speed and write-speed throttle the upload and download speeds from the cloud storage provider. These are both set in KB/s.
Upload Block Performance
block-size sets the maximum block size of batch uploads. This can be set to smaller or larger sizes to accomidate IO characteritics required for the application.
io-threads sets the number of simultanious uploads to the cloud storage provider. This can be set to a larger size for better upload performance over faster connections. It has been tested to 64 threads and seems to peak at 48 threads.
upload-thread-sleep-time sets the interval at which a block will be uploaded regardless of the size of the block.
delete-unclaimed set whether to delete the blocks once there are no chucks associated with the blocks in the cloud. Chunks will be dereferenced from blocks as they are orphaned. When the number of chucks is 0 the blocks will be deleted if this value is set to true
local-cache-size sets the size of the local cache. This can be set in GB or TB and specifies the amount of data to be cached locally.
Syncing Metadata to the cloud
All metadata is synced to the cloud storage provider by default. This makes recovery of the entire volume possible as long as the xml config is available. To recover the all of the metadata from the cloud the SDFS volume should be mounted with the "-cfr" option to recover all of the metadata. The following options are associated with syncing metadata to the cloud.
sync-files specifies if metadata will be uploaded to the cloud. Uploading file metadata to the cloud can impact performance but should not be disabled, by setting to false, unless you understand the risk.
sync-check-schedule specifies the cron schedule for verifing that data is synced correctly to the cloud.