用于Hadoop的Hadoop 2.4.1和Google Cloud Storage连接器 [英] Hadoop 2.4.1 and Google Cloud Storage connector for Hadoop

查看:211
本文介绍了用于Hadoop的Hadoop 2.4.1和Google Cloud Storage连接器的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图使用Google的Hadoop Cloud Storage Connector在Hadoop上运行Oryx:
https://cloud.google.com/hadoop/google-cloud-storage-connector



我更喜欢使用Hadoop 2.4.1与Oryx,所以我使用hadoop2_env.sh设置为我在google计算引擎上创建的hadoop集群,例如:

  .bdutil -b > -n 2 --env_var_files hadoop2_env.sh \ 
--default_fs gs --prefix< PREFIX_NAME>部署

当我尝试使用hadoop运行oryx时,遇到两个主要问题。


$ b $ 1)尽管确认我的hadoop conf目录与计算引擎上的谷歌安装的预期匹配,例如:

  $ echo $ HADOOP_CONF_DIR 
/ home / hadoop / hadoop-install / etc / hadoop

我仍然发现 正在寻找一个/ conf目录,例如:

 原因:java.lang.IllegalStateException:不是目录:/ etc / hadoop / conf 

我的理解是../etc/hadoop应该是/ conf目录,例如:
hadoop:配置文件

尽管我不需要进行任何更改,但只有在复制配置时才解决此问题文件复制到新创建的目录中,例如:

  sudo mkdir / etc / hadoop / conf 
sudo cp / home / hadoop / hadoop-install / etc / hadoop / * / etc / hadoop / conf

那为什么这样呢?这是使用谷歌hadoop连接器的结果吗?



2)在解决上面的问题后,我发现其他错误似乎(与我) hadoop集群和google文件系统之间的通信:

Wed Oct 01 20:18:30 UTC 2014警告无法为您的平台加载native-hadoop库...使用内置java类(如果适用)

Wed Oct 01 20:18:30 UTC 2014 INFO命名空间前缀:hdfs:// BUCKET_NAME

Wed Oct 01 20:18:30 UTC 2014 SEVERE在执行过程中发生意外错误
java.lang.ExceptionInInitializerError $ b $ com at com .cloudera.oryx.common.servcomp.StoreUtils.listGenerationsForInstance(StoreUtils.java:50)
at com.cloudera.oryx.computation.PeriodicRunner.run(PeriodicRunner.java:173)
at java.util .concurrent.Executors $ RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
。在java.util.concurrent.ScheduledThreadPoolExecutor中$ ScheduledFutureTask.access $ 301(ScheduledThreadPoolExecutor.java:178)
在java.util.concurrent.ScheduledThreadPoolExecutor中$ ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:615)$ b $在java.lang.Thread .run(Thread.java:745)
导致:java.lang.IllegalArgumentException:java.net.UnknownHostException:阻力预测$ b $在org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil。 Java的:在org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:258 373)

在org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153)
at org.apache.hadoop.hdfs.DFSClient。(DFSClient.java:602)
at org.apache.hadoop.hdfs.DFSClient。(DFSClient.java:547)$ $ b $ at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:139)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access $ 200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem $ Cache.getInternal(FileSystem.java:2625)
at org .apache.hadoop.fs.FileSystem $ Cache.get(FileSystem.java:2607)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at com.cloudera (Store.java:76)
at com.cloudera.oryx.common.servcomp.Store。(Store.java:57)
... 9更多


导致:java.net.UnknownHostException:BUCKET_NAME
... 22 more



与我相关的是,当我将默认文件系统设置为gs时,命名空间前缀为hdfs://



也许这是导致UnkownHostException?



请注意,我已经确认了hadoop集群已连接到Google文件系统,例如:
hadoop fs -ls
产生我的谷歌云存储桶的内容以及gs:// BUCKET_NAME目录的所有预期内容。然而,我并不熟悉hadoop通过hadoop连接器的谷歌表现形式以及我通常测试的传统方式,以查看hadoop集群是否正在运行,即:
jps
只产生
6440 Jps
而不是列出所有节点。但是,我正在从hadoop集群的主节点(即PREFIX_NAME-m)运行此命令,并且我不确定在为hadoop使用Google云存储连接器时的预期输出。



那么,我该如何解决这些错误,并让我的oryx作业(通过hadoop)成功访问我的gs:// BUCKET_NAME目录中的数据?



提前感谢您提出宝贵意见或建议。

更新:
感谢您的详细回复。作为一种解决方法,我通过改变硬编码gs://到oryx中:

  prefix =hdfs:// + host +':'+ port; 
} else {
prefix =hdfs://+ host;

到:

  prefix =gs://+ host +':'+ port; 
} else {
prefix =gs://+ host;

现在我得到以下错误:

Tue Oct 14 20:24:50 UTC 2014 SEVERE执行中出现意外错误
java.lang.ExceptionInInitializerError
at com.cloudera.oryx.common.servcomp.StoreUtils.listGenerationsForInstance(StoreUtils.java:50)
at com.cloudera.oryx.computation.PeriodicRunner.run(PeriodicRunner.java:173)
在java.util.concurrent.Executors $ RunnableAdapter.call(Executors.java:471)
在java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)$ b $在java.util.concurrent.ScheduledThreadPoolExecutor中$ ScheduledFutureTask.access $ 301(ScheduledThreadPoolExecutor.java:178)
在java.util中湾concurrent.ScheduledThreadPoolExecutor $ ScheduledFutureTask.run处java.util.concurrent.ThreadPoolExecutor中的$工人java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
(ScheduledThreadPoolExecutor.java:293)
。运行(ThreadPoolExecutor.java: 615)
在java.lang.Thread.run(Thread.java:745)


$ b 引起:java.lang.RuntimeException:java。 lang.ClassNotFoundException:找不到类com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem $ or $ $ $ $ $ $在org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1905)
在org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2573)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2586)
at org。 apache.hadoop.fs.FileSystem.access $ 200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem $ Cache.getInternal(FileSystem.java:2625)
at org.apache .hadoop.fs.FileSystem $ Cache.get(FileSystem.java:2607)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at com.cloudera.oryx .common.servcomp.Store。(Store.java:76)
at com.cloudera.oryx.common.servcomp.Store。(Store.java:57)



按照说明tions here: https://cloud.google.com/hadoop/google -cloud-storage-connector#classpath 我相信我已经将连接器jar添加到Hadoop的类路径中;我补充道:

  HADOOP_CLASSPATH = $ HADOOP_CLASSPATH:'https://storage.googleapis.com/hadoop-lib/gcs/gcs -connector-1.2.9-hadoop2.jar 

到/ home / rich / hadoop-env-setup .SH。和(echo $ HADOOP_CLASSPATH)得到:

/ contrib / capacity-scheduler / .jar:/ home / hadoop / hadoop-install / share / hadoop / common /lib/gcs-connector-1.2.9-hadoop2.jar:/contrib/capacity-scheduler/.jar:/home/hadoop/hadoop-install/share/hadoop/common/lib/gcs-connector- 1.2.9-hadoop2.jar



我是否需要在类路径中添加更多内容?



我也注意(也许有关),即使使用export命令,我仍然会得到/ etc / hadoop / conf的错误。我一直在使用sudo mkdir / etc / hadoop / conf作为临时解决方案。我在这里提到这件事情,以防它可能导致其他问题。

似乎有几个问题;第一种是通常情况下,当事情在 hadoop jar 下运行时,hadoop将各种系统环境变量和类路径等加入正在运行的程序中;在你的情况下,因为Oryx在不使用 hadoop jar 的情况下运行,而是使用类似的命令:

  java -Dconfig.file = oryx.conf -jar computing / target / oryx-computation-xyzjar 



'p>然后 $ HADOOP_CONF_DIR 实际上不使其进入环境,以便 OryxConfiguration.java中的System.getenv 无法将其提取出来,并使用默认的 / etc / hadoop / conf 值。这可以简单地通过 export 命令解决,您可以通过查看它是否进入子shell来测试:

  echo $ HADOOP_CONF_DIR 
bash -c'echo $ HADOOP_CONF_DIR'
export HADOOP_CONF_DIR
bash -c'echo $ HADOOP_CONF_DIR'
java -Dconfig .file = oryx.conf -jar computing / target / oryx-computation-xyzjar

第二个,更不幸的问题是,羚羊似乎 hard-code'hdfs',而不允许用户设置任何文件系统方案:

  private Namespaces(){
Config config = ConfigUtils.getDefaultConfig();
布尔localData;
if(config.hasPath(model.local)){
log.warn(model.local is deprecated; use model.local-data);
localData = config.getBoolean(model.local);
} else {
localData = config.getBoolean(model.local-data);
}
if(localData){
prefix =file:;
} else {
URI defaultURI = FileSystem.getDefaultUri(OryxConfiguration.get());
String host = defaultURI.getHost();
Preconditions.checkNotNull(主机,
Hadoop FS没有主机?您是否打算设置model.local-data = true?);
int port = defaultURI.getPort();
if(port> 0){
prefix =hdfs://+ host +':'+ port;
} else {
prefix =hdfs://+ host;


log.info(Namespace prefix:{},prefix);
}

这一切都取决于Oryx是否打算增加对其他文件系统方案的支持未来,但同时你必须自己更改Oryx代码并重新编译,或者你可以试着绕​​开它(但是可能会有对HDFS的依赖很大的Oryx失败)。



对Oryx的改变在理论上应该是:

  String scheme = defaultURI。 getScheme(); 
if(port> 0){
prefix = scheme +://+ host +':'+ port;
} else {
prefix = scheme +://+ host;
}

然而,如果你确实走这条路线,记住 GCS的最终列表一致性语义,其中多阶段工作流程不得依赖列表操作立即找到前一阶段的所有产出; Oryx可能会也可能不会有这样的依赖关系。



您的案例中最可靠的解决方案是使用 --default_fs hdfs ,其中bdutil仍将安装gcs连接器,以便您可以运行 hadoop distcp 将数据从GCS临时移动到HDFS,运行Oryx,然后运行一次完成后,将其复制回GCS。

I am trying to run Oryx on top of Hadoop using Google's Cloud Storage Connector for Hadoop: https://cloud.google.com/hadoop/google-cloud-storage-connector

I prefer to use Hadoop 2.4.1 with Oryx, so I use the hadoop2_env.sh set-up for the hadoop cluster I create on google compute engine, e.g.:

.bdutil -b <BUCKET_NAME> -n 2 --env_var_files hadoop2_env.sh \
--default_fs gs --prefix <PREFIX_NAME> deploy

I face two main problems when I try to run oryx using hadoop.

1) Despite confirming that my hadoop conf directory matches what is expected for the google installation on compute engine, e.g.:

$ echo $HADOOP_CONF_DIR
/home/hadoop/hadoop-install/etc/hadoop

I still find something is looking for a /conf directory, e.g.:

Caused by: java.lang.IllegalStateException: Not a directory: /etc/hadoop/conf

My understanding is that ../etc/hadoop should be the /conf directory, e.g.: hadoop: configuration files

And while I shouldn't need to make any changes, this problem is only resolved when I copy the config files into a newly created directory, e.g.:

sudo mkdir /etc/hadoop/conf
sudo cp /home/hadoop/hadoop-install/etc/hadoop/* /etc/hadoop/conf

So why is this? Is this a result of using the google hadoop connector?

2) After "resolving" the issue above, I find additional errors which seem (to me) to be related to communication between the hadoop cluster and the google file system:

Wed Oct 01 20:18:30 UTC 2014 WARNING Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Wed Oct 01 20:18:30 UTC 2014 INFO Namespace prefix: hdfs://BUCKET_NAME

Wed Oct 01 20:18:30 UTC 2014 SEVERE Unexpected error in execution java.lang.ExceptionInInitializerError at com.cloudera.oryx.common.servcomp.StoreUtils.listGenerationsForInstance(StoreUtils.java:50) at com.cloudera.oryx.computation.PeriodicRunner.run(PeriodicRunner.java:173) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: resistance-prediction at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:258) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:602) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:547) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:139) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at com.cloudera.oryx.common.servcomp.Store.(Store.java:76) at com.cloudera.oryx.common.servcomp.Store.(Store.java:57) ... 9 more

Caused by: java.net.UnknownHostException: BUCKET_NAME ... 22 more

What seems relevant to me is that the namespace prefix is hdfs:// when I set the default file system to gs://

Perhaps this is leading to the UnkownHostException?

Note that I have "confirmed" the hadoop cluster is connected to the google file system, e.g.: hadoop fs -ls yields the contents of my google cloud bucket and all the expected contents of the gs://BUCKET_NAME directory. However, I am not familiar with the google manifestation of hadoop via the hadoop connector, and the traditional way I usually test to see if the hadoop cluster is running, i.e.: jps only yields 6440 Jps rather than listing all the nodes. However, I am running this command from the master node of the hadoop cluster, i.e., PREFIX_NAME-m, and I am not sure of the expected output when using the google cloud storage connector for hadoop.

So, how can I resolve these errors and have my oryx job (via hadoop) successfully access data in my gs://BUCKET_NAME directory?

Thanks in advance for an insights or suggestions.

UPDATE: Thanks for the very detailed response. As a work-around I "hard coded" gs:// into oryx by changing:

  prefix = "hdfs://" + host + ':' + port;
} else {
  prefix = "hdfs://" + host;

to:

  prefix = "gs://" + host + ':' + port;
} else {
  prefix = "gs://" + host;

I now get the following errors:

Tue Oct 14 20:24:50 UTC 2014 SEVERE Unexpected error in execution java.lang.ExceptionInInitializerError at com.cloudera.oryx.common.servcomp.StoreUtils.listGenerationsForInstance(StoreUtils.java:50) at com.cloudera.oryx.computation.PeriodicRunner.run(PeriodicRunner.java:173) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1905) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2573) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2586) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) at com.cloudera.oryx.common.servcomp.Store.(Store.java:76) at com.cloudera.oryx.common.servcomp.Store.(Store.java:57)

As per the instructions here: https://cloud.google.com/hadoop/google-cloud-storage-connector#classpath I believe I have added connector jar to Hadoop's classpath; I added:

HADOOP_CLASSPATH=$HADOOP_CLASSPATH:'https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.9-hadoop2.jar 

to /home/rich/hadoop-env-setup.sh. and (echo $HADOOP_CLASSPATH) yields:

/contrib/capacity-scheduler/.jar:/home/hadoop/hadoop-install/share/hadoop/common/lib/gcs-connector-1.2.9-hadoop2.jar:/contrib/capacity-scheduler/.jar:/home/hadoop/hadoop-install/share/hadoop/common/lib/gcs-connector-1.2.9-hadoop2.jar

Do I need to add more to the class path?

I also note (perhaps related) that I still get the error for /etc/hadoop/conf even with the export commands. I have been using the sudo mkdir /etc/hadoop/conf as a temporary work around. I mention this here in case it may be leading to additional issues.

解决方案

There appear to be a couple of problems; the first of which is that normally, when things are run under hadoop jar, hadoop imbues the various system environment variables and classpaths, etc., into the program being run; in your case, since Oryx runs without using hadoop jar, instead using something like:

java -Dconfig.file=oryx.conf -jar computation/target/oryx-computation-x.y.z.jar

then $HADOOP_CONF_DIR doesn't actually make it into the environment so System.getenv in OryxConfiguration.java fails to pick it up, and uses the default /etc/hadoop/conf value. This is solved simply with the export command, which you can test by seeing if it makes it into a subshell:

echo $HADOOP_CONF_DIR
bash -c 'echo $HADOOP_CONF_DIR'
export HADOOP_CONF_DIR
bash -c 'echo $HADOOP_CONF_DIR'
java -Dconfig.file=oryx.conf -jar computation/target/oryx-computation-x.y.z.jar

The second, and more unfortunate issue is that Oryx appears to hard-code 'hdfs' rather allowing any filesystem scheme set by the user:

private Namespaces() {
  Config config = ConfigUtils.getDefaultConfig();
  boolean localData;
  if (config.hasPath("model.local")) {
    log.warn("model.local is deprecated; use model.local-data");
    localData = config.getBoolean("model.local");
  } else {
    localData = config.getBoolean("model.local-data");
  }
  if (localData) {
    prefix = "file:";
  } else {
    URI defaultURI = FileSystem.getDefaultUri(OryxConfiguration.get());
    String host = defaultURI.getHost();
    Preconditions.checkNotNull(host,
        "Hadoop FS has no host? Did you intent to set model.local-data=true?");
    int port = defaultURI.getPort();
    if (port > 0) {
      prefix = "hdfs://" + host + ':' + port;
    } else {
      prefix = "hdfs://" + host;
    }
  }
  log.info("Namespace prefix: {}", prefix);
}

It all depends on whether Oryx intends to add support for other filesystem schemes in the future, but in the meantime, you would either have to change the Oryx code yourself and recompile, or you could attempt to hack around it (but with potential for pieces of Oryx which have a hard dependency on HDFS to fail).

The change to Oryx should theoretically just be:

    String scheme = defaultURI.getScheme();
    if (port > 0) {
      prefix = scheme + "://" + host + ':' + port;
    } else {
      prefix = scheme + "://" + host;
    }

However, if you do go this route, keep in mind the eventual list consistency semantics of GCS, where multi-stage workflows must not rely on "list" operations to find immediately find all the outputs of a previous stage; Oryx may or may not have such a dependency.

The most reliable solution in your case would be to deploy with --default_fs hdfs, where bdutil will still install the gcs-connector so that you can run hadoop distcp to move your data from GCS to HDFS temporarily, run Oryx, and then once finished, copy it back out into GCS.

这篇关于用于Hadoop的Hadoop 2.4.1和Google Cloud Storage连接器的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆