AWS S3 Copy – Improving Speed

In Amazon AWS, S3 is an object storage which will help the customers to store thousands and millions of objects to serve their end customers or backup purposes.

Problem Statement

Recently, I have received an interesting requirement from one of my customer asking that he wants to copy all the objects of bucket-1 to bucket-2. So that, he can use the bucket-2 to his development,testing and purposes. But he said to me, he want it in few days. When I looked at the bucket it has around million objects with an average size of 1-2MB.

Since S3 service doesn’t guarantee any IOPS to upload/download the objects from an EC2 Instance or between buckets. And I also take a look at the S3 acceleration feature and it doen’t help in this scenario.

Solution

In general, we can use the aws s3 copy command (aws s3 cp s3://bucket-1/ s3://bucket-2/ –recursive) recursively from source to destination buckets. But this process uses the serial copy process which will take the long time to copy the files to destination bucket.

To increase the speed of the copy process, we can use the below solution to achieve this.

Here is the solution which has given us very good control over the speed of copying the objects to S3 bucket.I have listed all the AWS source bucket objects in a file and used the Linux Spilt command to split into 10 split files. Run the AWS S3 copy command parallel from different Linux instances to Destination bucket. This solution helped me to achieve the speed significantly.

Implementation Steps

Please follow this implementation steps to test it on your own.

  1. Choose the right instance types with very good amount of CPU and High-Network Speed. Because the AWS copy operation consumes lot of CPU and high network out for copy. I would recommend M4 or C4 resources with High-Networking Speed for this kind of requirement.
  2. Enable the S3 VPC End-Point feature to reach the S3 bucket with in AWS network rather than going through the Internet
  3. Get all the AWS S3 source bucket objects into a single file
     aws s3 ls s3://bucket-1/ --recursive | awk '{print $4}' > /opt/list-objects
  4. Split list-objects into multiple files and choose the split factor depending upon how many number of parallel processes you want to run. I took 5 as the split factor. This would create the 5 new split files of the same file.
    split -n 5 /opt/list-objects
  5. Run the parallel copy process from different machine(or from the same machine) using aws s3 copy command. I have taken the same instance in this case.
    for i in `cat /opt/xaa`; do aws s3 cp s3://bucket-1/$i s3://bucket-2/$i; done
    for i in `cat /opt/xab`; do aws s3 cp s3://bucket-1/$i s3://bucket-2/$i; done
    for i in `cat /opt/xac`; do aws s3 cp s3://bucket-1/$i s3://bucket-2/$i; done
    for i in `cat /opt/xad`; do aws s3 cp s3://bucket-1/$i s3://bucket-2/$i; done
    for i in `cat /opt/xae`; do aws s3 cp s3://bucket-1/$i s3://bucket-2/$i; done
  6. The above process will still applicable to the downloading the S3 bucket content locally to an EC2 Instance or uploading the content from the EC2 Instance to S3 bucket.

2 Comments

 Add your comment
  1. Good one, Praveen.

  2. nice explanation

Leave a Comment

Your email address will not be published.