Hadoop为什么不尊重在pyspark中设置的'spark.hadoop.fs'属性? [英] Why doesn't Hadoop respect 'spark.hadoop.fs' properties set in pyspark?
问题描述
我想要动态设置的 spark-defaults.conf
中有三个属性:
There are three properties in my spark-defaults.conf
that I want to be able to set dynamically:
-
spark.driver.maxResultSize
-
spark.hadoop.fs.s3a.access.key
-
spark.hadoop.fs.s3a.secret.key
这是我的尝试:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = (SparkConf()
.setMaster(spark_master)
.setAppName(app_name)
.set('spark.driver.maxResultSize', '5g')
.set('spark.hadoop.fs.s3a.access.key', '<access>')\
.set('spark.hadoop.fs.s3a.secret.key', '<secret>)
)
spark = SparkSession.builder.\
config(conf=conf).\
getOrCreate()
print(spark.conf.get('spark.driver.maxResultSize'))
print(spark.conf.get('spark.hadoop.fs.s3a.access.key'))
print(spark.conf.get('spark.hadoop.fs.s3a.secret.key'))
spark.stop()
这是我得到的输出:
5g
<access>
<secret>
但是,当我尝试使用此配置在S3上读取csv文件时,出现权限拒绝错误.
However when I try to read a csv file on S3 using this configuration, I get a permissions denied error.
如果我通过环境变量设置凭据,则可以读取该文件.
If I set the credentials via environment variables, I am able to read the file.
为什么Hadoop不尊重这种方式指定的凭据?
Why doesn't Hadoop respect the credentials specified this way?
更新:
我知道与在pyspark中设置Hadoop属性有关的其他问答.
I am aware of other Q&As relating to setting Hadoop properties in pyspark.
在这里,我试图记录后代,使您如何愚蠢地认为可以通过 spark.hadoop.*
动态设置它们,因为这是用于在其中设置这些属性的名称. spark-defaults.conf
,并且当您尝试以这种方式设置它们时,不会直接出现错误.
Here I am trying to record for posterity how you can be fooled into thinking that you can set them dynamically via spark.hadoop.*
, since that is the name you use to set these properties in spark-defaults.conf
, and since you don't get an error directly when you try to set them this way.
许多站点告诉您设置 spark.hadoop.fs.s3a.access.key
属性",但是如果仅在 spark-defaults.conf
,而不是动态地在 pyspark
中.
Many sites tell you to "set the spark.hadoop.fs.s3a.access.key
property", but don't specify that this only the case if you set it statically in spark-defaults.conf
and not dynamically in pyspark
.
推荐答案
事实证明,您无法通过以下方式指定Hadoop属性:
It turns out that you can't specify Hadoop properties via:
spark.conf.set('spark.hadoop.< property>',< value>)
但您必须改为使用:
spark.sparkContext._jsc.hadoopConfiguration().set('< property>',< value>)
I believe you can only use spark.conf.set()
for the properties listed on the Spark Configuration page.
这篇关于Hadoop为什么不尊重在pyspark中设置的'spark.hadoop.fs'属性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!