使用hadoop FileSystem API访问Google云存储 [英] Accessing google cloud storage using hadoop FileSystem api

查看:157
本文介绍了使用hadoop FileSystem API访问Google云存储的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的机器上,我已经配置了hadoop core-site.xml以识别gs://方案,并添加了gcs-connector-1.2.8.jar作为Hadoop库.我可以运行hadoop fs -ls gs://mybucket/并获得预期的结果.但是,如果我尝试使用以下方法从Java中进行模拟:

From my machine, I've configured the hadoop core-site.xml to recognize the gs:// scheme and added gcs-connector-1.2.8.jar as a Hadoop lib. I can run hadoop fs -ls gs://mybucket/ and get the expected results. However, if I try to do the analogue from java using:

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
FileStatus[] status = fs.listStatus(new Path("gs://mybucket/"));

我在本地HDFS中而不是在gs://mybucket/中获得根目录下的文件,但这些文件前面带有gs://mybucket.如果在获取fs之前用conf.set("fs.default.name", "gs://mybucket");修改了conf,那么我可以在GCS上看到文件.

I get the files under root in my local HDFS instead of in gs://mybucket/, but with those files prepended with gs://mybucket. If I modify the conf with conf.set("fs.default.name", "gs://mybucket"); before obtaining the fs, then I can see the files on GCS.

我的问题是:
1.这是预期的行为吗?
2.与Google云存储客户端api相比,使用此hadoop FileSystem api是否有缺点?

My question is:
1. Is this expected behavior?
2. Is there a disadvantage to using this hadoop FileSystem api as opposed to the google cloud storage client api?

推荐答案

关于您的第一个问题,预期"是有问题的,但我认为至少可以解释一下.使用FileSystem.get()时,将返回默认文件系统,默认情况下为HDFS.我的猜测是HDFS客户端(DistributedFileSystem)具有代码,该代码可以自动对文件系统中的所有文件添加方案+权限.

As to your first question, "expected" is questionable, but I think I can at least explain. When FileSystem.get() is used the default FileSystem is returned and by default that is HDFS. My guess is that the HDFS client (DistributedFileSystem) has code to prepend scheme + authority automatically to all files in the filesystem.

尝试使用

FileSystem gcsFs = new Path("gs://mybucket/").getFS(conf)

在不利方面,我可能会辩称,如果最终需要直接访问对象存储,那么无论如何,您最终都将编写代码直接与存储API交互(而且有些东西翻译得不是很好) Hadoop FS API,例如对象组成,复杂对象写前提条件(除了简单对象覆盖保护等).

On disadvantages, I could probably argue that if you end up needing to access the object-store directly then you'll end up writing code to interact with the storage APIs directly anyways (and there are things that do not translate very well to the Hadoop FS API, e.g., object composition, complex object write preconditions other than simple object overwrite protection, etc).

我确实有偏见(在团队中工作),但是如果您打算从Hadoop Map/Reduce,Spark等使用GCS,那么Hadoop的GCS连接器应该是一个相当安全的选择.

I am admittedly biased (working on the team), but if you're intending to use GCS from Hadoop Map/Reduce, from Spark, etc, the GCS connector for Hadoop should be a fairly safe bet.

这篇关于使用hadoop FileSystem API访问Google云存储的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆