如何在AWS EMR上设置PYTHONHASHSEED [英] How to set PYTHONHASHSEED on AWS EMR
问题描述
是否可以在EMR群集的所有节点上设置环境变量?
Is there any way to set an environment variable on all nodes of an EMR cluster?
尝试在Python3 PySpark中使用reduceByKey()时出现错误,并获得有关哈希种子的错误.我可以看到这是一个已知的错误,并且在群集的所有节点上都需要将环境变量PYTHONHASHSEED设置为相同的值,但是我对此没有任何运气.
I am getting an error when trying to use reduceByKey() in Python3 PySpark, and getting an error regarding the hash seed. I can see this is a known error, and that the environment varialbe PYTHONHASHSEED needs to be set to the same value on all nodes of the cluster, but I haven't had any luck with it.
我尝试通过集群配置将一个变量添加到spark-env:
I have tried adding a variable to spark-env through the cluster configuration:
[
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"Properties": {
"PYSPARK_PYTHON": "/usr/bin/python3",
"PYTHONHASHSEED": "123"
}
}
]
},
{
"Classification": "spark",
"Properties": {
"maximizeResourceAllocation": "true"
}
}
]
但这是行不通的.我也尝试添加引导脚本:
but this doesn't work. I have also tried adding a bootstrap script:
#!/bin/bash
export PYTHONHASHSEED=123
但这似乎并没有解决问题.
but this also doesn't seem to do the trick.
推荐答案
我相信/usr/bin/python3
不会选择您在spark-env
范围内的群集配置中定义的环境变量PYTHONHASHSEED
.
I believe that the /usr/bin/python3
isn't picking up the environment variable PYTHONHASHSEED
that you are defining in the cluster configuration under the spark-env
scope.
您应该使用python34
而不是/usr/bin/python3
,并按如下所示设置配置:
You ought using python34
instead of /usr/bin/python3
and set the configuration as followed :
[
{
"classification":"spark-defaults",
"properties":{
// [...]
}
},
{
"configurations":[
{
"classification":"export",
"properties":{
"PYSPARK_PYTHON":"python34",
"PYTHONHASHSEED":"123"
}
}
],
"classification":"spark-env",
"properties":{
// [...]
}
}
]
现在,让我们对其进行测试.我定义了一个bash脚本,都调用python
s:
Now, let's test it. I define a bash script call both python
s :
#!/bin/bash
echo "using python34"
for i in `seq 1 10`;
do
python -c "print(hash('foo'))";
done
echo "----------------------"
echo "using /usr/bin/python3"
for i in `seq 1 10`;
do
/usr/bin/python3 -c "print(hash('foo'))";
done
判决:
[hadoop@ip-10-0-2-182 ~]$ bash test.sh
using python34
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
-4177197833195190597
----------------------
using /usr/bin/python3
8867846273747294950
-7610044127871105351
6756286456855631480
-4541503224938367706
7326699722121877093
3336202789104553110
3462714165845110404
-5390125375246848302
-7753272571662122146
8018968546238984314
PS1::我正在使用AMI版本emr-4.8.2
.
PS1: I am using AMI release emr-4.8.2
.
PS2 :摘录自此答案.
编辑:我已经使用pyspark
测试了以下内容.
I have tested the following using pyspark
.
16/11/22 07:16:56 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1479798580078_0001
16/11/22 07:16:56 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.6.2
/_/
Using Python version 3.4.3 (default, Sep 1 2016 23:33:38)
SparkContext available as sc, HiveContext available as sqlContext.
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
>>> print(hash('foo'))
-2457967226571033580
还创建了一个简单的应用程序(simple_app.py
):
Also created a simple application (simple_app.py
):
from pyspark import SparkContext
sc = SparkContext(appName = "simple-app")
numbers = [hash('foo') for i in range(10)]
print(numbers)
似乎也可以完美运行:
[hadoop@ip-*** ~]$ spark-submit --master yarn simple_app.py
输出(被截断):
[...]
16/11/22 07:28:42 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594] // THE RELEVANT LINE IS HERE.
16/11/22 07:28:42 INFO SparkContext: Invoking stop() from shutdown hook
[...]
如您所见,每次返回相同的哈希值也可以.
As you can see it also works returning the same hash each time.
从评论看来,您似乎正在尝试在执行程序而非驱动程序上计算哈希,因此您需要在spark应用程序中设置spark.executorEnv.PYTHONHASHSEED
配置,以便可以在执行程序上传播它(这是执行此操作的一种方法).
EDIT 2: From the comments, it seems like you are trying to compute hashes on the executors and not the driver, thus you'll need to set up spark.executorEnv.PYTHONHASHSEED
, inside your spark application configuration so it can be propagated on the executors (it's one way to do it).
注意:为执行程序设置环境变量与YARN客户端相同,请使用
spark.executorEnv.[EnvironmentVariableName].
因此,下面的简约示例带有simple_app.py
:
Thus the following minimalist example with simple_app.py
:
from pyspark import SparkContext, SparkConf
conf = SparkConf().set("spark.executorEnv.PYTHONHASHSEED","123")
sc = SparkContext(appName="simple-app", conf=conf)
numbers = sc.parallelize(['foo']*10).map(lambda x: hash(x)).collect()
print(numbers)
现在让我们再次对其进行测试.这是截断的输出:
And now let's test it again. Here is the truncated output :
16/11/22 14:14:34 INFO DAGScheduler: Job 0 finished: collect at /home/hadoop/simple_app.py:6, took 14.251514 s
[-5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594, -5869373620241885594]
16/11/22 14:14:34 INFO SparkContext: Invoking stop() from shutdown hook
我认为这涵盖了所有内容.
I think that this covers all.
这篇关于如何在AWS EMR上设置PYTHONHASHSEED的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!