Hadoop为什么不尊重在pyspark中设置的'spark.hadoop.fs'属性? [英] Why doesn't Hadoop respect 'spark.hadoop.fs' properties set in pyspark?

查看:39
本文介绍了Hadoop为什么不尊重在pyspark中设置的'spark.hadoop.fs'属性?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想要动态设置的 spark-defaults.conf 中有三个属性:

There are three properties in my spark-defaults.conf that I want to be able to set dynamically:

  • spark.driver.maxResultSize
  • spark.hadoop.fs.s3a.access.key
  • spark.hadoop.fs.s3a.secret.key

这是我的尝试:

from pyspark import SparkConf
from pyspark.sql import SparkSession

conf = (SparkConf()
        .setMaster(spark_master)
        .setAppName(app_name)
        .set('spark.driver.maxResultSize', '5g')
        .set('spark.hadoop.fs.s3a.access.key', '<access>')\
        .set('spark.hadoop.fs.s3a.secret.key', '<secret>)
        )

spark = SparkSession.builder.\
    config(conf=conf).\
    getOrCreate()

print(spark.conf.get('spark.driver.maxResultSize'))
print(spark.conf.get('spark.hadoop.fs.s3a.access.key'))
print(spark.conf.get('spark.hadoop.fs.s3a.secret.key'))

spark.stop()

这是我得到的输出:

5g
<access>
<secret>

但是,当我尝试使用此配置在S3上读取csv文件时,出现权限拒绝错误.

However when I try to read a csv file on S3 using this configuration, I get a permissions denied error.

如果我通过环境变量设置凭据,则可以读取该文件.

If I set the credentials via environment variables, I am able to read the file.

为什么Hadoop不尊重这种方式指定的凭据?

Why doesn't Hadoop respect the credentials specified this way?

更新:

我知道与在pyspark中设置Hadoop属性有关的其他问答.

I am aware of other Q&As relating to setting Hadoop properties in pyspark.

在这里,我试图记录后代,使您如何愚蠢地认为可以通过 spark.hadoop.* 动态设置它们,因为这是用于在其中设置这些属性的名称. spark-defaults.conf ,并且当您尝试以这种方式设置它们时,不会直接出现错误.

Here I am trying to record for posterity how you can be fooled into thinking that you can set them dynamically via spark.hadoop.*, since that is the name you use to set these properties in spark-defaults.conf, and since you don't get an error directly when you try to set them this way.

许多站点告诉您设置 spark.hadoop.fs.s3a.access.key 属性",但是如果仅在 spark-defaults.conf ,而不是动态地在 pyspark 中.

Many sites tell you to "set the spark.hadoop.fs.s3a.access.key property", but don't specify that this only the case if you set it statically in spark-defaults.conf and not dynamically in pyspark.

推荐答案

事实证明,您无法通过以下方式指定Hadoop属性:

It turns out that you can't specify Hadoop properties via:

spark.conf.set('spark.hadoop.< property>',< value>)

但您必须改为使用:

spark.sparkContext._jsc.hadoopConfiguration().set('< property>',< value>)

我相信您只能将 spark.conf.set()用于

I believe you can only use spark.conf.set() for the properties listed on the Spark Configuration page.

这篇关于Hadoop为什么不尊重在pyspark中设置的'spark.hadoop.fs'属性?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆