如何在AWS EMR上设置PYTHONHASHSEED [英] How to set PYTHONHASHSEED on AWS EMR

查看:98
本文介绍了如何在AWS EMR上设置PYTHONHASHSEED的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

是否可以在EMR群集的所有节点上设置环境变量?

Is there any way to set an environment variable on all nodes of an EMR cluster?

尝试在Python3 PySpark中使用reduceByKey()时出现错误,并获得有关哈希种子的错误.我可以看到这是一个已知的错误,并且在群集的所有节点上都需要将环境变量PYTHONHASHSEED设置为相同的值,但是我对此没有任何运气.

I am getting an error when trying to use reduceByKey() in Python3 PySpark, and getting an error regarding the hash seed. I can see this is a known error, and that the environment varialbe PYTHONHASHSEED needs to be set to the same value on all nodes of the cluster, but I haven't had any luck with it.

我尝试通过集群配置将一个变量添加到spark-env:

I have tried adding a variable to spark-env through the cluster configuration:

[
  {
    "Classification": "spark-env",

      "Configurations": [
      {
         "Classification": "export",
         "Properties": {
          "PYSPARK_PYTHON": "/usr/bin/python3",
          "PYTHONHASHSEED": "123"
       }
     }
   ]
 },
 {
   "Classification": "spark",
   "Properties": {
     "maximizeResourceAllocation": "true"
    }
  }
]

但这是行不通的.我也尝试添加引导脚本:

but this doesn't work. I have also tried adding a bootstrap script:

#!/bin/bash
export PYTHONHASHSEED=123

但这似乎并没有解决问题.

but this also doesn't seem to do the trick.

推荐答案

我相信/usr/bin/python3不会选择您在spark-env范围内的群集配置中定义的环境变量PYTHONHASHSEED.

I believe that the /usr/bin/python3 isn't picking up the environment variable PYTHONHASHSEED that you are defining in the cluster configuration under the spark-env scope.

您应该使用python34而不是/usr/bin/python3,并按如下所示设置配置:

You ought using python34 instead of /usr/bin/python3 and set the configuration as followed :

[
   {
      "classification":"spark-defaults",
      "properties":{
         // [...]
      }
   },
   {
      "configurations":[
         {
            "classification":"export",
            "properties":{
               "PYSPARK_PYTHON":"python34",
               "PYTHONHASHSEED":"123"
            }
         }
      ],
      "classification":"spark-env",
      "properties":{
        // [...]
      }
   }
]

现在,让我们对其进行测试.我定义了一个bash脚本,都调用python s:

Now, let's test it. I define a bash script call both pythons :

#!/bin/bash

echo "using python34"
for i in `seq 1 10`;
  do
    python -c "print(hash('foo'))";
  done
echo "----------------------"
echo "using /usr/bin/python3"
for i in `seq 1 10`;
  do
    /usr/bin/python3 -c "print(hash('foo'))";
  done

判决:

[hadoop@ip-10-0-2-182 ~]$ bash test.sh
using python34
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
----------------------
using /usr/bin/python3
8867846273747294950
-7610044127871105351
6756286456855631480
-4541503224938367706
7326699722121877093
3336202789104553110
3462714165845110404
-5390125375246848302
-7753272571662122146
8018968546238984314

PS1::我正在使用AMI版本emr-4.8.2.

PS1: I am using AMI release emr-4.8.2.

PS2 :摘录自此答案.

编辑:我已经使用pyspark测试了以下内容.

I have tested the following using pyspark.

16/11/22 07:16:56 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1479798580078_0001
16/11/22 07:16:56 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.6.2
      /_/

Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
SparkContext available as sc, HiveContext available as sqlContext.
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580

还创建了一个简单的应用程序(simple_app.py):

Also created a simple application (simple_app.py):

from pyspark import SparkContext

sc = SparkContext(appName = "simple-app")

numbers = [hash('foo') for i in range(10)]

print(numbers)

似乎也可以完美运行:

[hadoop@ip-*** ~]$ spark-submit --master yarn simple_app.py 

输出(被截断):

[...]
16/11/22 07:28:42 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594] // THE RELEVANT LINE IS HERE.
16/11/22 07:28:42 INFO SparkContext: Invoking stop() from shutdown hook
[...]

如您所见,每次返回相同的哈希值也可以.

As you can see it also works returning the same hash each time.

从评论看来,您似乎正在尝试在执行程序而非驱动程序上计算哈希,因此您需要在spark应用程序中设置spark.executorEnv.PYTHONHASHSEED配置,以便可以在执行程序上传播它(这是执行此操作的一种方法).

EDIT 2: From the comments, it seems like you are trying to compute hashes on the executors and not the driver, thus you'll need to set up spark.executorEnv.PYTHONHASHSEED, inside your spark application configuration so it can be propagated on the executors (it's one way to do it).

注意:为执行程序设置环境变量与YARN客户端相同,请使用spark.executorEnv.[EnvironmentVariableName].

因此,下面的简约示例带有simple_app.py:

Thus the following minimalist example with simple_app.py :

from pyspark import SparkContext, SparkConf

conf = SparkConf().set("spark.executorEnv.PYTHONHASHSEED","123")
sc = SparkContext(appName="simple-app", conf=conf)

numbers = sc.parallelize(['foo']*10).map(lambda x: hash(x)).collect()

print(numbers)

现在让我们再次对其进行测试.这是截断的输出:

And now let's test it again. Here is the truncated output :

16/11/22 14:14:34 INFO DAGScheduler: Job 0 finished: collect at /home/hadoop/simple_app.py:6, took 14.251514 s
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594]
16/11/22 14:14:34 INFO SparkContext: Invoking stop() from shutdown hook

我认为这涵盖了所有内容.

I think that this covers all.

这篇关于如何在AWS EMR上设置PYTHONHASHSEED的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆