“No Filesystem for Scheme:gs”当本地运行火花作业时 [英] "No Filesystem for Scheme: gs" when running spark job locally

查看:447
本文介绍了“No Filesystem for Scheme:gs”当本地运行火花作业时的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在运行Spark作业(版本1.2.0),输入是Google Clous存储桶内的文件夹(即gs:// mybucket /文件夹)

当我的Mac机在本地运行作业时,出现以下错误:


5932 [main] ERROR com.doit .customer.dataconverter.Phase1 - 作业日期:2014_09_23失败,出现错误:No FileSystem for scheme:gs


我知道2件事需要完成以便支持gs路径。一个是安装GCS连接器,另一个是在Hadoop安装的core-site.xml中有以下设置:

  <性> 
< name> fs.gs.impl< / name>
<值> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem< / value>
< description>文件系统:gs:(GCS)uris。< / description>
< / property>
<属性>
< name> fs.AbstractFileSystem.gs.impl< / name>
<值> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS< /值>
< description>
gs:(GCS)uris的AbstractFileSystem。仅适用于Hadoop 2.
< / description>
< / property>

我认为我的问题来自于我不确定每件作品需要在哪里配置这个本地模式。在Intellij项目中,我使用的是Maven,因此我按如下方式导入了Spark库:

 < dependency> <! -  Spark相关性 - > 
< groupId> org.apache.spark< / groupId>
< artifactId> spark-core_2.10< / artifactId>
< version> 1.2.0< / version>
<排除项>
<排除> <! - 在此声明排除 - >
< groupId> org.apache.hadoop< / groupId>
< artifactId> hadoop-client< / artifactId>
< /排除>
< /排除>
< /依赖关系>

,以及Hadoop 1.2.1,如下所示:

 <依赖性> 
< groupId> org.apache.hadoop< / groupId>
< artifactId> hadoop-client< / artifactId>
< version> 1.2.1< / version>
< /依赖关系>

事情是,我不确定在哪里为Spark配置了hadoop位置, hadoop conf被配置。因此,我可能会添加到错误的Hadoop安装中。另外,修改文件后是否需要重新启动?据我所见,我的机器上没有运行Hadoop服务。

解决方案

有几种方法可以帮助Spark选择修改 $ {SPARK_INSTALL_DIR} / conf



< >

将$ {HADOOP_HOME} /conf/core-site.xml复制或符号链接到$ {SPARK_INSTALL_DIR} /conf/core-site.xml中。例如,当 bdutil 安装到VM上时,它会运行:

  ln -s $ {HADOOP_CONF_DIR} /core-site.xml $ {SPARK_INSTALL_DIR} /conf/core-site.xml 


较老的Spark文档解释了这会自动将Spark类路径中包含的xml文件: https://spark.apache.org/docs/0.9.1/hadoop-third-party-distributions.html




  1. 添加一个条目到$ {SPARK_INSTALL_DIR} /conf/spark-env.sh中:

      export HADOOP_CONF_DIR = / full / path / to / your / hadoop / conf / dir 


新的Spark文档似乎表明这是未来的首选方法: https://spark.apache.org/docs/1.1.0/hadoop-third-party -distributions.html


I am running a Spark job (version 1.2.0), and the input is a folder inside a Google Clous Storage bucket (i.e. gs://mybucket/folder)

When running the job locally on my Mac machine, I am getting the following error:

5932 [main] ERROR com.doit.customer.dataconverter.Phase1 - Job for date: 2014_09_23 failed with error: No FileSystem for scheme: gs

I know that 2 things need to be done in order for gs paths to be supported. One is install the GCS connector, and the other is have the following setup in core-site.xml of the Hadoop installation:

<property>
    <name>fs.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
    <description>The FileSystem for gs: (GCS) uris.</description>
</property>
<property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
    <description>
     The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.
    </description>
</property>

I think my problem comes from the fact I am not sure where exactly each piece need to be configured in this local mode. In the Intellij project, I am using Maven, and so I imported the spark library as follows:

<dependency> <!-- Spark dependency -->
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.10</artifactId>
    <version>1.2.0</version>
    <exclusions>
        <exclusion>  <!-- declare the exclusion here -->
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
        </exclusion>
    </exclusions>
</dependency>

, and Hadoop 1.2.1 as follows:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>1.2.1</version>
</dependency>

The thing is, I am not sure where the hadoop location is configured for Spark, and also where the hadoop conf is configured. Therefore, I may be adding to the wrong Hadoop installation. In addition, is there something that needs to be restarted after modifying the files? As far as I saw, there is no Hadoop service running on my machine.

解决方案

There are a couple ways to help Spark pick up the relevant Hadoop configurations, both involving modifying ${SPARK_INSTALL_DIR}/conf:

  1. Copy or symlink your ${HADOOP_HOME}/conf/core-site.xml into ${SPARK_INSTALL_DIR}/conf/core-site.xml. For example, when bdutil installs onto a VM, it runs:

    ln -s ${HADOOP_CONF_DIR}/core-site.xml ${SPARK_INSTALL_DIR}/conf/core-site.xml
    

Older Spark docs explain that this makes the xml files included in Spark's classpath automatically: https://spark.apache.org/docs/0.9.1/hadoop-third-party-distributions.html

  1. Add an entry to ${SPARK_INSTALL_DIR}/conf/spark-env.sh with:

    export HADOOP_CONF_DIR=/full/path/to/your/hadoop/conf/dir
    

Newer Spark docs seem to indicate this as the preferred method going forward: https://spark.apache.org/docs/1.1.0/hadoop-third-party-distributions.html

这篇关于“No Filesystem for Scheme:gs”当本地运行火花作业时的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆