使用水槽写入Google Cloud Storage上的HDFS/GS所需的最少设置是什么? [英] What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume?

查看:152
本文介绍了使用水槽写入Google Cloud Storage上的HDFS/GS所需的最少设置是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想将数据从flume-ng写入Google Cloud Storage. 这有点复杂,因为我观察到了非常奇怪的行为.让我解释一下:

I would like to write data from flume-ng to Google Cloud Storage. It is a little bit complicated, because I observed a very strange behavior. Let me explain:

我已经在Google Cloud上启动了一次hadoop集群(一键设置),以使用存储桶.

I've launched a hadoop cluster on google cloud (one click) set up to use a bucket.

当我在主服务器上SSH并使用hdfs命令添加文件时,我可以立即在存储桶中看到它

When I ssh on the master and add a file with hdfs command, I can see it immediately in my bucket

$ hadoop fs -ls /
14/11/27 15:01:41 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.9-hadoop2
Found 1 items
-rwx------   3 hadoop hadoop         40 2014-11-27 13:45 /test.txt

但是当我尝试添加然后从我的计算机中读取时,似乎使用了其他一些HDFS.在这里,我添加了一个名为jp.txt的文件,它没有显示我以前的文件test.txt

But when I try to add then read from my computer, it seems to use some other HDFS. Here I added a file called jp.txt, and it doesn't show my previous file test.txt

$ hadoop fs -ls hdfs://ip.to.my.cluster/
Found 1 items
-rw-r--r--   3 jp supergroup          0 2014-11-27 14:57 hdfs://ip.to.my.cluster/jp.txt

这也是我在http://ip.to.my.cluster:50070/explorer.html#/

当我使用网络控制台列出存储桶中的文件时(

When I list files in my bucket with the web console (https://console.developers.google.com/project/my-project-id/storage/my-bucket/), I can only see test.txt and not jp.txt.

我阅读了 Hadoop无法连接到Google Cloud Storage ,我相应地配置了我的hadoop客户端(非常难的东西),现在我可以看到存储桶中的项目了.但是为此,我需要使用gs:// URI

I read Hadoop cannot connect to Google Cloud Storage and I configured my hadoop client accordingly (pretty hard stuff) and now I can see items in my bucket. But for that, I need to use a gs:// URI

$ hadoop fs -ls gs://my-bucket/
14/11/27 15:57:46 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.3.0-hadoop2
Found 1 items
-rwx------   3 jp jp         40 2014-11-27 14:45 gs://my-bucket/test.txt

观察/中间结论

因此,似乎在同一群集中有两个不同的存储引擎:传统HDFS"(以hdfs://开头)和Google存储桶(以gs://开头).

Observation / Intermediate conclusion

So it seems here there are 2 different storages engine in the same cluster: "traditional HDFS" (starting with hdfs://) and a Google storage bucket (starting with gs://).

用户和权限有所不同,具体取决于您列出文件的位置.

Users and rights are different, depending on where you are listing files from.

主要问题是:使用水槽写入Google Cloud Storage上的HDFS/GS所需的最少设置是什么?

The main question is: What is the minimal setup needed to write to HDFS/GS on Google Cloud Storage with flume ?

  • 是否需要在Google Cloud上启动Hadoop集群才能实现我的目标?
  • 是否可以直接写入Google Cloud Storage Bucket?如果是,如何配置水槽? (添加jar,重新定义类路径...)
  • 为什么在同一集群中有2个存储引擎(经典的HDFS/GS存储桶)
a1.sources = http
a1.sinks = hdfs_sink
a1.channels = mem

# Describe/configure the source
a1.sources.http.type =  org.apache.flume.source.http.HTTPSource
a1.sources.http.port = 9000

# Describe the sink
a1.sinks.hdfs_sink.type = hdfs
a1.sinks.hdfs_sink.hdfs.path = hdfs://ip.to.my.cluster:8020/%{env}/%{tenant}/%{type}/%y-%m-%d
a1.sinks.hdfs_sink.hdfs.filePrefix = %H-%M-%S_
a1.sinks.hdfs_sink.hdfs.fileSuffix = .json
a1.sinks.hdfs_sink.hdfs.round = true
a1.sinks.hdfs_sink.hdfs.roundValue = 10
a1.sinks.hdfs_sink.hdfs.roundUnit = minute

# Use a channel which buffers events in memory
a1.channels.mem.type = memory
a1.channels.mem.capacity = 1000
a1.channels.mem.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.http.channels = mem
a1.sinks.hdfs_sink.channel = mem

a1.sinks.hdfs_sink.hdfs.path行是否接受gs://路径?

Does the line a1.sinks.hdfs_sink.hdfs.path accept a gs:// path ?

在这种情况下需要什么设置(其他jar,类路径)?

What setup would it need in that case (additional jars, classpath) ?

谢谢

推荐答案

如您所见,基于您使用的URI的scheme://,能够从同一Hadoop集群访问不同的存储系统实际上是相当普遍的.用hadoop fs.您在Google Compute Engine上部署的群集还具有两个可用的文件系统,只是碰巧将默认"设置为gs://your-configbucket.

As you observed, it's actually fairly common to be able to access different storage systems from the same Hadoop cluster, based on the scheme:// of the URI you use with hadoop fs. The cluster you deployed on Google Compute Engine also has both filesystems available, it just happens to have the "default" set to gs://your-configbucket.

您必须在本地群集上包括gs://configbucket/file而不是仅普通/file的原因是,在您的一键式部署中,我们还在Hadoop的core-site.xml中添加了一个密钥,将fs.default.name设置为是gs://configbucket/.您可以在本地群集上实现相同的效果,以使其对所有无方案路径都使用GCS.在您的一键式集群中,请检出/home/hadoop/hadoop-install/core-site.xml以获取您可能会继承到本地设置的参考.

The reason you had to include the gs://configbucket/file instead of just plain /file on your local cluster is that in your one-click deployment, we additionally included a key in your Hadoop's core-site.xml, setting fs.default.name to be gs://configbucket/. You can achieve the same effect on your local cluster to make it use GCS for all the schemeless paths; in your one-click cluster, check out /home/hadoop/hadoop-install/core-site.xml for a reference of what you might carry over to your local setup.

为了稍微解释一下Hadoop的内部结构,hdfs://路径正常工作的原因实际上是因为存在一个配置密钥,理论上它可以在Hadoop的core-site.xml文件中被覆盖,该文件默认情况下设置为

To explain the internals of Hadoop a bit, the reason hdfs:// paths work normally is actually because there is a configuration key which in theory can be overridden in Hadoop's core-site.xml file, which by default sets:

<property>
  <name>fs.hdfs.impl</name>
  <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
  <description>The FileSystem for hdfs: uris.</description>
</property>

同样,您可能已经注意到,要使gs://在本地群集上运行,您提供了fs.gs.impl.这是因为DistribtedFileSystem和GoogleHadoopFileSystem都实现了相同的Hadoop Java接口FileSystem,并且Hadoop的构建与实现方式选择如何实际实现FileSystem方法无关.这也意味着在最基本的级别上,通常可以使用hdfs://的任何地方都应该可以使用gs://.

Similarly, you may have noticed that to get gs:// to work on your local cluster, you provided fs.gs.impl. This is because DistribtedFileSystem and GoogleHadoopFileSystem both implement the same Hadoop Java interface FileSystem, and Hadoop is built to be agnostic to how an implementation chooses to actually implement the FileSystem methods. This also means that at the most basic level, anywhere you could normally use hdfs:// you should be able to use gs://.

因此,回答您的问题:

  1. 将Flume与基于HDFS的典型设置配合使用的最小设置应该适用于将GCS用作接收器.
  2. 您无需在Google Compute Engine上启动集群,尽管这样会更容易,因为您会遇到在本地设置中使用GCS连接器的更困难的手动说明.但是,由于您已经在运行本地设置,因此由Google Compute Engine能否更轻松地运行Hadoop/Flume群集取决于您.
  3. 是的,如上所述,您应该尝试用gs://路径替换hdfs://路径,和/或将fs.default.name设置为根gs://configbucket路径.
  4. 具有两个存储引擎使您可以在不兼容的情况下更轻松地在两个之间切换.支持的功能存在一些细微的差异,例如,GCS不会具有与HDFS中相同的posix样式权限.另外,它不支持appends到现有文件或符号链接.
  1. The same minimal setup you'd use to get Flume working with a typical HDFS-based setup should work for using GCS as a sink.
  2. You don't need to launch the cluster on Google Compute Engine, though it'd be easier, as you experienced with the more difficult manual instructions for using the GCS connector on your local setup. But since you already have a local setup running, it's up to you whether Google Compute Engine will be an easier place to run your Hadoop/Flume cluster.
  3. Yes, as mentioned above, you should experiment with replacing hdfs:// paths with gs:// paths instead, and/or setting fs.default.name to be your root gs://configbucket path.
  4. Having the two storage engines allows you to more easily switch between the two in case of incompatibilities. There are some minor differences in supported features, for example GCS won't have the same kinds of posix-style permissions you have in HDFS. Also, it doesn't support appends to existing files or symlinks.

这篇关于使用水槽写入Google Cloud Storage上的HDFS/GS所需的最少设置是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆