如何配置pyspark作业 [英] How to profile pyspark jobs
本文介绍了如何配置pyspark作业的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我想了解pyspark代码中的配置文件.
I want to understand profiling in pyspark codes.
此后: https://github.com/apache/spark/pull/2351
>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
284 function calls (276 primitive calls) in 0.001 seconds
Ordered by: internal time, cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
4 0.000 0.000 0.000 0.000 serializers.py:198(load_stream)
4 0.000 0.000 0.000 0.000 {reduce}
12/4 0.000 0.000 0.001 0.000 rdd.py:2092(pipeline_func)
4 0.000 0.000 0.000 0.000 {cPickle.loads}
4 0.000 0.000 0.000 0.000 {cPickle.dumps}
104 0.000 0.000 0.000 0.000 rdd.py:852(<genexpr>)
8 0.000 0.000 0.000 0.000 serializers.py:461(read_int)
12 0.000 0.000 0.000 0.000 rdd.py:303(func)
以上效果很好.但是,如果我执行以下操作:
Above works great. But If I do something like below:
from pyspark.sql import HiveContext
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf().setAppName("myapp").set("spark.python.profile","true")
sc = SparkContext(conf=conf)
sqlContext = HiveContext(sc)
df=sqlContext.sql("select * from myhivetable")
df.count()
sc.show_profiles()
这没有给我任何东西.我得到计数,但show_profiles()
给我None
This does not give me anything. I get the count but show_profiles()
give me None
任何帮助表示赞赏
推荐答案
使用Spark SQL时没有要分析的Python代码.唯一的Python是调用Scala引擎.其他所有操作均在Java虚拟机上执行.
There is no Python code to profile when you use Spark SQL. The only Python is to call Scala engine. Everything else is executed on Java Virtual Machine.
这篇关于如何配置pyspark作业的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文