将 50TB 数据从本地 Hadoop 集群迁移到 Google Cloud Storage [英] Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

查看:21
本文介绍了将 50TB 数据从本地 Hadoop 集群迁移到 Google Cloud Storage的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 Hadoop 集群中的现有数据 (JSON) 迁移到 Google Cloud Storage.

I am trying to migrate existing data (JSON) in my Hadoop cluster to Google Cloud Storage.

我已经探索了 GSUtil,它似乎是将大数据集移动到 GCS 的推荐选项.似乎它可以处理庞大的数据集.GSUtil 好像只能从本地机器移动数据到 GCS 或 S3<->GCS,但是不能从本地 Hadoop 集群移动数据.

I have explored GSUtil and it seems that it is the recommended option to move big data sets to GCS. It seems that it can handle huge datasets. It seems though that GSUtil can only move data from Local machine to GCS or S3<->GCS, however cannot move data from local Hadoop cluster.

  1. 将数据从本地 Hadoop 集群移动到 GCS 的推荐方法是什么?

  1. What is a recommended way of moving data from local Hadoop cluster to GCS ?

如果是 GSUtil,是否可以直接将数据从本地 Hadoop 集群 (HDFS) 移动到 GCS,还是需要先在运行 GSUtil 的机器上复制文件,然后再传输到 GCS?

In case of GSUtil, can it directly move data from local Hadoop cluster(HDFS) to GCS or do first need to copy files on machine running GSUtil and then transfer to GCS?

使用 Google 客户端 (Java API) 库与 GSUtil 的优缺点是什么?

What are the pros and cons of using Google Client Side (Java API) libraries vs GSUtil?

非常感谢,

推荐答案

问题 1:将数据从本地 Hadoop 集群移动到 GCS 的推荐方法是使用 适用于 Hadoop 的 Google Cloud Storage 连接器.该站点上的说明主要用于在 Google Compute Engine 虚拟机上运行 Hadoop,但您也可以直接下载 GCS 连接器,gcs-connector-1.2.8-hadoop1.jar 如果您使用的是 Hadoop 1.x 或 Hadoop 0.20.x,或者 gcs-connector-1.2.8-hadoop2.jar 对于 Hadoop 2.x 或 Hadoop 0.23.x.

Question 1: The recommended way of moving data from a local Hadoop cluster to GCS is to use the Google Cloud Storage connector for Hadoop. The instructions on that site are mostly for running Hadoop on Google Compute Engine VMs, but you can also download the GCS connector directly, either gcs-connector-1.2.8-hadoop1.jar if you're using Hadoop 1.x or Hadoop 0.20.x, or gcs-connector-1.2.8-hadoop2.jar for Hadoop 2.x or Hadoop 0.23.x.

在 Hadoop 2 的情况下,只需将 jarfile 复制到您的 hadoop/lib 目录或 $HADOOP_COMMON_LIB_JARS_DIR:

Simply copy the jarfile into your hadoop/lib dir or $HADOOP_COMMON_LIB_JARS_DIR in the case of Hadoop 2:

cp ~/Downloads/gcs-connector-1.2.8-hadoop1.jar /your/hadoop/dir/lib/

如果您运行的是 0.20.x,您可能还需要将以下内容添加到您的 hadoop/conf/hadoop-env.sh 文件中:

You may need to also add the following to your hadoop/conf/hadoop-env.sh file if youre running 0.20.x:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/your/hadoop/dir/lib/gcs-connector-1.2.8-hadoop1.jar

然后,您可能希望使用服务帐户密钥文件"身份验证,因为您位于内部部署的 Hadoop 集群上.访问您的 cloud.google.com/console,找到 APIs &左侧的auth,点击Credentials,如果你还没有一键Create new Client ID,选择Service account 在点击Create client id 之前,然后现在,连接器需要.p12"类型的密钥对,所以点击Generate new P12 key 并保持跟踪下载的 .p12 文件.在将它放在更容易从 Hadoop 访问的目录中之前重命名它可能会很方便,例如:

Then, you'll likely want to use service-account "keyfile" authentication since you're on an on-premise Hadoop cluster. Visit your cloud.google.com/console, find APIs & auth on the left-hand-side, click Credentials, if you don't already have one click Create new Client ID, select Service account before clicking Create client id, and then for now, the connector requires a ".p12" type of keypair, so click Generate new P12 key and keep track of the .p12 file that gets downloaded. It may be convenient to rename it before placing it in a directory more easily accessible from Hadoop, e.g:

cp ~/Downloads/*.p12 /path/to/hadoop/conf/gcskey.p12

将以下条目添加到 Hadoop conf 目录中的 core-site.xml 文件:

Add the following entries to your core-site.xml file in your Hadoop conf dir:

<property>
  <name>fs.gs.impl</name>
  <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
</property>
<property>
  <name>fs.gs.project.id</name>
  <value>your-ascii-google-project-id</value>
</property>
<property>
  <name>fs.gs.system.bucket</name>
  <value>some-bucket-your-project-owns</value>
</property>
<property>
  <name>fs.gs.working.dir</name>
  <value>/</value>
</property>
<property>
  <name>fs.gs.auth.service.account.enable</name>
  <value>true</value>
</property>
<property>
  <name>fs.gs.auth.service.account.email</name>
  <value>your-service-account-email@developer.gserviceaccount.com</value>
</property>
<property>
  <name>fs.gs.auth.service.account.keyfile</name>
  <value>/path/to/hadoop/conf/gcskey.p12</value>
</property>

通常不会使用 fs.gs.system.bucket,除非在某些情况下用于映射临时文件,您可能只想为此目的创建一个新的一次性存储桶.使用主节点上的这些设置,您应该已经能够测试 hadoop fs -ls gs://the-bucket-you-want to-list.此时,您已经可以尝试使用简单的 hadoop fs -cp hdfs://yourhost:yourport/allyourdata gs://your-bucket 从主节点收集所有数据.

The fs.gs.system.bucket generally won't be used except in some cases for mapred temp files, you may want to just create a new one-off bucket for that purpose. With those settings on your master node, you should already be able to test hadoop fs -ls gs://the-bucket-you-want to-list. At this point, you can already try to funnel all the data out of the master node with a simple hadoop fs -cp hdfs://yourhost:yourport/allyourdata gs://your-bucket.

如果您想使用 Hadoop 的 distcp 加快速度,请将 lib/gcs-connector-1.2.8-hadoop1.jar 和 conf/core-site.xml 同步到您的所有 Hadoop 节点,它应该都能按预期工作.请注意,无需重新启动数据节点或名称节点.

If you want to speed it up using Hadoop's distcp, sync the lib/gcs-connector-1.2.8-hadoop1.jar and conf/core-site.xml to all your Hadoop nodes, and it should all work as expected. Note that there's no need to restart datanodes or namenodes.

问题 2:虽然 Hadoop 的 GCS 连接器能够直接从 HDFS 复制而无需额外的磁盘缓冲区,但 GSUtil 不能,因为它无法解释 HDFS 协议;它只知道如何处理实际的本地文件系统文件或如您所说的 GCS/S3 文件.

Question 2: While the GCS connector for Hadoop is able to copy direct from HDFS without ever needing an extra disk buffer, GSUtil cannot since it has no way of interpreting the HDFS protocol; it only knows how to deal with actual local filesystem files or as you said, GCS/S3 files.

问题 3:使用 Java API 的好处是灵活性;您可以选择如何处理错误、重试、缓冲区大小等,但这需要更多的工作和计划​​.使用 gsutil 有利于快速用例,并且您从 Google 团队继承了许多错误处理和测试.Hadoop 的 GCS 连接器实际上是直接构建在 Java API 之上的,而且由于它都是开源的,因此您可以在 GitHub 上的源代码中看到需要哪些东西才能使其顺利运行:https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageImpl.java

Question 3: The benefit of using the Java API is flexibility; you can choose how to handle errors, retries, buffer sizes, etc, but it takes more work and planning. Using gsutil is good for quick use cases, and you inherit a lot of error-handling and testing from the Google teams. The GCS connector for Hadoop is actually built directly on top of the Java API, and since it's all open-source, you can see what kinds of things it takes to make it work smoothly here in its source code on GitHub : https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageImpl.java

这篇关于将 50TB 数据从本地 Hadoop 集群迁移到 Google Cloud Storage的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆