Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

I am trying to migrate existing data (JSON) in my Hadoop cluster to Google Cloud Storage.

I have explored GSUtil and it seems that it is the recommended option to move big data sets to GCS. It seems that it can handle huge datasets. It seems though that GSUtil can only move data from Local machine to GCS or S3<->GCS, however cannot move data from local Hadoop cluster.

  1. What is a recommended way of moving data from local Hadoop cluster to GCS ?

  2. In case of GSUtil, can it directly move data from local Hadoop cluster(HDFS) to GCS or do first need to copy files on machine running GSUtil and then transfer to GCS?

  3. What are the pros and cons of using Google Client Side (Java API) libraries vs GSUtil?

Thanks a lot,


Question 1: The recommended way of moving data from a local Hadoop cluster to GCS is to use the Google Cloud Storage connector for Hadoop. The instructions on that site are mostly for running Hadoop on Google Compute Engine VMs, but you can also download the GCS connector directly, either gcs-connector-1.2.8-hadoop1.jar if you're using Hadoop 1.x or Hadoop 0.20.x, or gcs-connector-1.2.8-hadoop2.jar for Hadoop 2.x or Hadoop 0.23.x.

Simply copy the jarfile into your hadoop/lib dir or $HADOOP_COMMON_LIB_JARS_DIR in the case of Hadoop 2:

cp ~/Downloads/gcs-connector-1.2.8-hadoop1.jar /your/hadoop/dir/lib/

You may need to also add the following to your hadoop/conf/hadoop-env.sh file if youre running 0.20.x:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/your/hadoop/dir/lib/gcs-connector-1.2.8-hadoop1.jar

Then, you'll likely want to use service-account "keyfile" authentication since you're on an on-premise Hadoop cluster. Visit your cloud.google.com/console, find APIs & auth on the left-hand-side, click Credentials, if you don't already have one click Create new Client ID, select Service account before clicking Create client id, and then for now, the connector requires a ".p12" type of keypair, so click Generate new P12 key and keep track of the .p12 file that gets downloaded. It may be convenient to rename it before placing it in a directory more easily accessible from Hadoop, e.g:

cp ~/Downloads/*.p12 /path/to/hadoop/conf/gcskey.p12

Add the following entries to your core-site.xml file in your Hadoop conf dir:

<property>
  <name>fs.gs.impl</name>
  <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
</property>
<property>
  <name>fs.gs.project.id</name>
  <value>your-ascii-google-project-id</value>
</property>
<property>
  <name>fs.gs.system.bucket</name>
  <value>some-bucket-your-project-owns</value>
</property>
<property>
  <name>fs.gs.working.dir</name>
  <value>/</value>
</property>
<property>
  <name>fs.gs.auth.service.account.enable</name>
  <value>true</value>
</property>
<property>
  <name>fs.gs.auth.service.account.email</name>
  <value>[email protected]</value>
</property>
<property>
  <name>fs.gs.auth.service.account.keyfile</name>
  <value>/path/to/hadoop/conf/gcskey.p12</value>
</property>

The fs.gs.system.bucket generally won't be used except in some cases for mapred temp files, you may want to just create a new one-off bucket for that purpose. With those settings on your master node, you should already be able to test hadoop fs -ls gs://the-bucket-you-want to-list. At this point, you can already try to funnel all the data out of the master node with a simple hadoop fs -cp hdfs://yourhost:yourport/allyourdata gs://your-bucket.

If you want to speed it up using Hadoop's distcp, sync the lib/gcs-connector-1.2.8-hadoop1.jar and conf/core-site.xml to all your Hadoop nodes, and it should all work as expected. Note that there's no need to restart datanodes or namenodes.

Question 2: While the GCS connector for Hadoop is able to copy direct from HDFS without ever needing an extra disk buffer, GSUtil cannot since it has no way of interpreting the HDFS protocol; it only knows how to deal with actual local filesystem files or as you said, GCS/S3 files.

Question 3: The benefit of using the Java API is flexibility; you can choose how to handle errors, retries, buffer sizes, etc, but it takes more work and planning. Using gsutil is good for quick use cases, and you inherit a lot of error-handling and testing from the Google teams. The GCS connector for Hadoop is actually built directly on top of the Java API, and since it's all open-source, you can see what kinds of things it takes to make it work smoothly here in its source code on GitHub : https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageImpl.java


Look like few property names are changed in recent versions.

`String serviceAccount = "[email protected]";

String keyfile = "/path/to/local/keyfile.p12";

hadoopConfiguration.set("google.cloud.auth.service.account.enable", true); hadoopConfiguration.set("google.cloud.auth.service.account.email", serviceAccount); hadoopConfiguration.set("google.cloud.auth.service.account.keyfile", keyfile);`