将本地Hadoop集群中的50TB数据迁移到Google云端存储 [英] Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

查看:201
本文介绍了将本地Hadoop集群中的50TB数据迁移到Google云端存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试将我的Hadoop集群中的现有数据(JSON)迁移到Google Cloud Storage。



我已经探索过GSUtil,它似乎是推荐的选项将大数据集迁移到GCS。它似乎可以处理巨大的数据集。似乎尽管GSUtil只能将数据从本地机器移动到GCS或S3 GCS,但是不能从本地Hadoop群集移动数据。


  1. 将数据从本地Hadoop集群迁移到GCS的推荐方式是什么?如果使用GSUtil,它可以直接将数据从本地Hadoop集群(HDFS)移动到GCS,或者首先需要在运行GSUtil的计算机上复制文件,然后转移到GCS?

  2. 使用Google客户端(Java API)库vs GSUtil有什么优缺点?


非常感谢,

解决方案

将数据从本地Hadoop集群迁移到GCS的推荐方式是使用适用于Hadoop的Google Cloud Storage连接器。该网站上的说明主要是为了在Google Compute Engine虚拟机上运行Hadoop,但您也可以直接下载GCS连接器: gcs-connector-1.2.8-hadoop1.jar 如果您使用Hadoop 1.x或Hadoop 0.20.x或gcs-connector-1.2.8-hadoop2.jar 。在Hadoop 2中,只需将jar文件复制到hadoop / lib目录或 $ HADOOP_COMMON_LIB_JARS_DIR 即可:

  cp〜/ Downloads / gcs-connector-1.2.8-hadoop1.jar / your / hadoop / dir / lib / 

如果您需要,也可能需要将以下内容添加到您的hadoop / conf / hadoop-env.sh文件运行0.20.x:

  export HADOOP_CLASSPATH = $ HADOOP_CLASSPATH:/your/hadoop/dir/lib/gcs-connector-1.2。 8-hadoop1.jar 

然后,您可能需要使用服务帐户密钥文件身份验证,因为您使用的是内部部署Hadoop群集。访问您的cloud.google.com/console,找到 API和放大器auth 点击 Credentials ,如果您还没有点击创建新客户端ID ,在点击创建客户端ID 之前选择服务帐户,然后暂时连接器需要.p12类型的密钥对,因此请单击生成新的P12密钥,并跟踪下载的.p12文件。在将其放置在Hadoop中更容易访问的目录中之前,将其重命名可能很方便,例如:

  cp〜/ Downloads / * .p12 /path/to/hadoop/conf/gcskey.p12 

将以下条目添加到您的core-site.xml文件在你的Hadoop conf目录中:

 < property> 
< name> fs.gs.impl< / name>
<值> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem< / value>
< / property>
<属性>
< name> fs.gs.project.id< / name>
< value> your-ascii-google-project-id< /值>
< / property>
<属性>
< name> fs.gs.system.bucket< / name>
< value> some-bucket-your-project-owned< / value>
< / property>
<属性>
<名称> fs.gs.working.dir< / name>
<值> /< /值>
< / property>
<属性>
<名称> fs.gs.auth.service.account.enable< /名称>
<值> true< /值>
< / property>
<属性>
<名称> fs.gs.auth.service.account.email< / name>
< value> your-service-account-email@developer.gserviceaccount.com< / value>
< / property>
<属性>
< name> fs.gs.auth.service.account.keyfile< / name>
<值> /path/to/hadoop/conf/gcskey.p12< /值>
< / property>

通常不会使用fs.gs.system.bucket,除非某些情况下用于mapred temp文件,您可能需要为此创建一个新的一次性存储桶。使用主节点上的这些设置,您应该已经能够测试 hadoop fs -ls gs://您想要的列表。此时,您可以尝试使用简单的 hadoop fs -cp hdfs:// yourhost:yourport / allyourdata gs:// your-bucket

如果您想使用Hadoop的distcp加速它,请同步lib / gcs-connector-1.2.8-hadoop1.jar和conf / core -site.xml添加到所有Hadoop节点,并且它们都应按预期工作。请注意,不需要重新启动datanodes或namenode。



问题2:虽然Hadoop的GCS连接器能够直接从HDFS复制,而无需额外的磁盘缓冲区, GSUtil不能解释HDFS协议,它只知道如何处理实际的本地文件系统文件或如你所说的GCS / S3文件。

问题3:使用Java API的好处是灵活性;您可以选择如何处理错误,重试,缓冲区大小等,但需要更多的工作和规划。使用gsutil对于快速使用情况非常有用,并且您继承了Google团队的大量错误处理和测试。 Hadoop的GCS连接器实际上是直接在Java API之上构建的,因为它都是开源的,所以您可以看到它在GitHub上的源代码中如何使其顺利运行:https://github.com/ GoogleCloudPlatform / bigdata-interop / blob / master / gcs / src / main / java / com / google / cloud / hadoop / gcsio / GoogleCloudStorageImpl.java

I am trying to migrate existing data (JSON) in my Hadoop cluster to Google Cloud Storage.

I have explored GSUtil and it seems that it is the recommended option to move big data sets to GCS. It seems that it can handle huge datasets. It seems though that GSUtil can only move data from Local machine to GCS or S3<->GCS, however cannot move data from local Hadoop cluster.

  1. What is a recommended way of moving data from local Hadoop cluster to GCS ?

  2. In case of GSUtil, can it directly move data from local Hadoop cluster(HDFS) to GCS or do first need to copy files on machine running GSUtil and then transfer to GCS?

  3. What are the pros and cons of using Google Client Side (Java API) libraries vs GSUtil?

Thanks a lot,

解决方案

Question 1: The recommended way of moving data from a local Hadoop cluster to GCS is to use the Google Cloud Storage connector for Hadoop. The instructions on that site are mostly for running Hadoop on Google Compute Engine VMs, but you can also download the GCS connector directly, either gcs-connector-1.2.8-hadoop1.jar if you're using Hadoop 1.x or Hadoop 0.20.x, or gcs-connector-1.2.8-hadoop2.jar for Hadoop 2.x or Hadoop 0.23.x.

Simply copy the jarfile into your hadoop/lib dir or $HADOOP_COMMON_LIB_JARS_DIR in the case of Hadoop 2:

cp ~/Downloads/gcs-connector-1.2.8-hadoop1.jar /your/hadoop/dir/lib/

You may need to also add the following to your hadoop/conf/hadoop-env.sh file if youre running 0.20.x:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/your/hadoop/dir/lib/gcs-connector-1.2.8-hadoop1.jar

Then, you'll likely want to use service-account "keyfile" authentication since you're on an on-premise Hadoop cluster. Visit your cloud.google.com/console, find APIs & auth on the left-hand-side, click Credentials, if you don't already have one click Create new Client ID, select Service account before clicking Create client id, and then for now, the connector requires a ".p12" type of keypair, so click Generate new P12 key and keep track of the .p12 file that gets downloaded. It may be convenient to rename it before placing it in a directory more easily accessible from Hadoop, e.g:

cp ~/Downloads/*.p12 /path/to/hadoop/conf/gcskey.p12

Add the following entries to your core-site.xml file in your Hadoop conf dir:

<property>
  <name>fs.gs.impl</name>
  <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
</property>
<property>
  <name>fs.gs.project.id</name>
  <value>your-ascii-google-project-id</value>
</property>
<property>
  <name>fs.gs.system.bucket</name>
  <value>some-bucket-your-project-owns</value>
</property>
<property>
  <name>fs.gs.working.dir</name>
  <value>/</value>
</property>
<property>
  <name>fs.gs.auth.service.account.enable</name>
  <value>true</value>
</property>
<property>
  <name>fs.gs.auth.service.account.email</name>
  <value>your-service-account-email@developer.gserviceaccount.com</value>
</property>
<property>
  <name>fs.gs.auth.service.account.keyfile</name>
  <value>/path/to/hadoop/conf/gcskey.p12</value>
</property>

The fs.gs.system.bucket generally won't be used except in some cases for mapred temp files, you may want to just create a new one-off bucket for that purpose. With those settings on your master node, you should already be able to test hadoop fs -ls gs://the-bucket-you-want to-list. At this point, you can already try to funnel all the data out of the master node with a simple hadoop fs -cp hdfs://yourhost:yourport/allyourdata gs://your-bucket.

If you want to speed it up using Hadoop's distcp, sync the lib/gcs-connector-1.2.8-hadoop1.jar and conf/core-site.xml to all your Hadoop nodes, and it should all work as expected. Note that there's no need to restart datanodes or namenodes.

Question 2: While the GCS connector for Hadoop is able to copy direct from HDFS without ever needing an extra disk buffer, GSUtil cannot since it has no way of interpreting the HDFS protocol; it only knows how to deal with actual local filesystem files or as you said, GCS/S3 files.

Question 3: The benefit of using the Java API is flexibility; you can choose how to handle errors, retries, buffer sizes, etc, but it takes more work and planning. Using gsutil is good for quick use cases, and you inherit a lot of error-handling and testing from the Google teams. The GCS connector for Hadoop is actually built directly on top of the Java API, and since it's all open-source, you can see what kinds of things it takes to make it work smoothly here in its source code on GitHub : https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageImpl.java

这篇关于将本地Hadoop集群中的50TB数据迁移到Google云端存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆