如何配置pyspark作业 [英] How to profile pyspark jobs

查看：192 发布时间：2020/9/4 7:09:51 apache-spark pyspark apache-spark-sql profiler spark-dataframe

本文介绍了如何配置pyspark作业的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想了解pyspark代码中的配置文件.

I want to understand profiling in pyspark codes.

此后: https://github.com/apache/spark/pull/2351

>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
         284 function calls (276 primitive calls) in 0.001 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
        4    0.000    0.000    0.000    0.000 {reduce}
     12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
        4    0.000    0.000    0.000    0.000 {cPickle.loads}
        4    0.000    0.000    0.000    0.000 {cPickle.dumps}
      104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
        8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
       12    0.000    0.000    0.000    0.000 rdd.py:303(func)

以上效果很好.但是，如果我执行以下操作:

Above works great. But If I do something like below:

from pyspark.sql import HiveContext
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf().setAppName("myapp").set("spark.python.profile","true")
sc   = SparkContext(conf=conf)
sqlContext = HiveContext(sc)

df=sqlContext.sql("select * from myhivetable")
df.count()
sc.show_profiles()

这没有给我任何东西.我得到计数，但show_profiles()给我None

This does not give me anything. I get the count but show_profiles() give me None

任何帮助表示赞赏

如何配置pyspark作业 [英] How to profile pyspark jobs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何配置pyspark作业 [英] How to profile pyspark jobs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭