如何在 pyspark 中启动 sparksession [英] How to start sparksession in pyspark

查看:25
本文介绍了如何在 pyspark 中启动 sparksession的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想更改 spark 会话的默认内存、执行程序和核心设置.我在 Jupyter 中 HDInsight 集群上的 pyspark 笔记本中的第一个代码如下所示:

from pyspark.sql import SparkSession火花 = SparkSession\.builder\.appName("Juanita_Smith")\.config("spark.executor.instances", "2")\.config("spark.executor.cores", "2")\.config("spark.executor.memory", "2g")\.config("spark.driver.memory", "2g")\.getOrCreate()

完成后,我读回参数,看起来该语句有效

但是,如果我查看纱线,该设置确实不起作用.

我需要进行哪些设置或命令才能让会话配置生效?

提前感谢您的帮助

解决方案

当您的笔记本内核启动时,SparkSession 已经使用内核配置文件中定义的参数创建.要更改此设置,您需要更新或替换内核配置文件,我认为该文件通常位于 <jupyter home>/kernels/<kernel name>/kernel.json 之类的地方.

更新

如果您有权访问托管 Jupyter 服务器的机器,您可以使用 jupyter kernelspec list 找到当前内核配置的位置.然后,您可以编辑 pyspark 内核配置之一,或将其复制到新文件并进行编辑.出于您的目的,您需要将以下参数添加到 PYSPARK_SUBMIT_ARGS:

PYSPARK_SUBMIT_ARGS":--conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=2g --conf spark.driver.内存=2g"

I want to change the default memory, executor and core settings of a spark session. The first code in my pyspark notebook on HDInsight cluster in Jupyter looks like this:

from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("Juanita_Smith")\
    .config("spark.executor.instances", "2")\
    .config("spark.executor.cores", "2")\
    .config("spark.executor.memory", "2g")\
    .config("spark.driver.memory", "2g")\
    .getOrCreate()

On completion, I read the parameters back, which looks like the statement worked

However if I look in yarn, the setting have indeed not worked.

Which settings or commands do I need to make to let the session configuration take effect ?

Thank you for help in advance

解决方案

By the time your notebook kernel has started, the SparkSession is already created with parameters defined in a kernel configuration file. To change this, you will need to update or replace the kernel configuration file, which I believe is usually somewhere like <jupyter home>/kernels/<kernel name>/kernel.json.

Update

If you have access to the machine hosting your Jupyter server, you can find the location of the current kernel configurations using jupyter kernelspec list. You can then either edit one of the pyspark kernel configurations, or copy it to a new file and edit that. For your purposes, you will need to add the following arguments to the PYSPARK_SUBMIT_ARGS:

"PYSPARK_SUBMIT_ARGS": "--conf spark.executor.instances=2 --conf spark.executor.cores=2 --conf spark.executor.memory=2g --conf spark.driver.memory=2g"

这篇关于如何在 pyspark 中启动 sparksession的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆