如何设置在纱线星火作业的环境变量? [英] How do I set an environment variable in a YARN Spark job?

查看:187
本文介绍了如何设置在纱线星火作业的环境变量?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从的 Accumulo 1.6 : //spark.apache.org/docs/1.0.0/api/java/index.html相对=nofollow>阿帕奇星火作业(Java编写的)使用 AccumuloInputFormat newAPIHadoopRDD 。为了做到这一点,我必须告诉 AccumuloInputFormat 其中,通过调用 setZooKeeperInstance 方法来定位的ZooKeeper。此方法需要一个 ClientConfiguration 对象,指定不同的相关属性。

我通过调用静态 loadDefault 方法创建我 ClientConfiguration 对象。这种方法应该是在不同的地方寻找一个 client.conf 文件从加载其默认值。一个它应该看的地方是 $ ACCUMULO_CONF_DIR / client.conf

因此​​,我试图将 ACCUMULO_CONF_DIR 环境变量,以这样的方式,这将是可见的,当星火运行的作业(仅供参考,我试图在纱线集群部署模式)的运行。我还没有找到一种方式来做到这一点成功。

到目前为止,我已经试过:


  • 电话 setExecutorEnv(ACCUMULO_CONF_DIR,在/ etc / accumulo / conf目录) SparkConf

  • 导出 ACCUMULO_CONF_DIR spark-env.sh

  • 设置 spark.executorEnv.ACCUMULO_CONF_DIR 火花defaults.conf

他们都没有工作过。当我打印环境之前调用 setZooKeeperInstance ACCUMULO_CONF_DIR 不会出现。

如果是相关的,我使用的<一个href=\"http://www.cloudera.com/content/cloudera/en/documentation/cdh5/latest/CDH5-Release-Notes/CDH5-Release-Notes.html\"相对=nofollow>的CDH5 版本的一切。

下面是什么,我试图做一个例子(进口和异常处理冷落为了简洁):

 公共类MySparkJob
{
    公共静态无效的主要(字串[] args)
    {
        SparkConf sparkConf =新SparkConf();
        sparkConf.setAppName(MySparkJob);
        sparkConf.setExecutorEnv(ACcUMULO_CONF_DIR,在/ etc / accumulo / conf目录);
        JavaSparkContext SC =新JavaSparkContext(sparkConf);
        作业accumuloJob = Job.getInstance(sc.hadoopConfiguration());
        // foreach循环打印环境,没有显示出ACCUMULO_CONF_DIR
        ClientConfiguration accumuloConfiguration = ClientConfiguration.loadDefault();
        AccumuloInputFormat.setZooKeeperInstance(accumuloJob,accumuloConfiguration);
        //其他调用AccumuloInputFormat静态函数正确的配置它。
        JavaPairRDD&LT;关键字,值&GT; accumuloRDD =
            sc.newAPIHadoopRDD(accumuloJob.getConfiguration(),
                               AccumuloInputFormat.class,
                               Key.class,
                               Value.class);
    }
}


解决方案

所以,我发现这个问题的答案,而写的问题(对不起,信誉者)。问题是,CDH5使用星火1.0.0,而我是跑通过YARN工作。显然,纱模式不关注任何执行人环境,而是使用环境变量 SPARK_YARN_USER_ENV 来控制其环境。因此,确保 SPARK_YARN_USER_ENV 包含 ACCUMULO_CONF_DIR =的/ etc / accumulo / conf目录工作,使 ACCUMULO_CONF_DIR 在在问题的源头例如指定点环境可见。

在如何独立模式和纱线模式工作这种差异导致了 SPARK-1680 ,固定在星火1.1.0被报道。

I'm attempting to access Accumulo 1.6 from an Apache Spark job (written in Java) by using an AccumuloInputFormat with newAPIHadoopRDD. In order to do this, I have to tell the AccumuloInputFormat where to locate ZooKeeper by calling the setZooKeeperInstance method. This method takes a ClientConfiguration object which specifies various relevant properties.

I'm creating my ClientConfiguration object by calling the static loadDefault method. This method is supposed to look in various places for a client.conf file to load its defaults from. One of the places it's supposed to look is $ACCUMULO_CONF_DIR/client.conf.

Therefore, I am attempting to set the ACCUMULO_CONF_DIR environment variable in such a way that it will be visible when Spark runs the job (for reference, I'm attempting to run in the yarn-cluster deployment mode). I have not yet found a way to do that successfully.

So far, I've tried:

  • Calling setExecutorEnv("ACCUMULO_CONF_DIR", "/etc/accumulo/conf") on the SparkConf
  • Exporting ACCUMULO_CONF_DIR in spark-env.sh
  • Setting spark.executorEnv.ACCUMULO_CONF_DIR in spark-defaults.conf

None of them have worked. When I print the environment before calling setZooKeeperInstance, ACCUMULO_CONF_DIR does not appear.

If it's relevant, I'm using the CDH5 versions of everything.

Here's an example of what I'm trying to do (imports and exception handling left out for brevity):

public class MySparkJob
{
    public static void main(String[] args)
    {
        SparkConf sparkConf = new SparkConf();
        sparkConf.setAppName("MySparkJob");
        sparkConf.setExecutorEnv("ACcUMULO_CONF_DIR", "/etc/accumulo/conf");
        JavaSparkContext sc = new JavaSparkContext(sparkConf);
        Job accumuloJob = Job.getInstance(sc.hadoopConfiguration());
        // Foreach loop to print environment, shows no ACCUMULO_CONF_DIR
        ClientConfiguration accumuloConfiguration = ClientConfiguration.loadDefault();
        AccumuloInputFormat.setZooKeeperInstance(accumuloJob, accumuloConfiguration);
        // Other calls to AccumuloInputFormat static functions to configure it properly.
        JavaPairRDD<Key, Value> accumuloRDD =
            sc.newAPIHadoopRDD(accumuloJob.getConfiguration(),
                               AccumuloInputFormat.class,
                               Key.class,
                               Value.class);
    }
}

解决方案

So I discovered the answer to this while writing the question (sorry, reputation seekers). The problem is that CDH5 uses Spark 1.0.0, and that I was running the job via YARN. Apparently, YARN mode does not pay any attention to the executor environment and instead uses the environment variable SPARK_YARN_USER_ENV to control its environment. So ensuring SPARK_YARN_USER_ENV contains ACCUMULO_CONF_DIR=/etc/accumulo/conf works, and makes ACCUMULO_CONF_DIR visible in the environment at the indicated point in the question's source example.

This difference in how standalone mode and YARN mode work resulted in SPARK-1680, which is reported as fixed in Spark 1.1.0.

这篇关于如何设置在纱线星火作业的环境变量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆