SQL和withColumn之间的性能 [英] Performance Between SQL and withColumn
问题描述
假设我创建以下数据框:
Suppose I create the following dataframe:
dt = pd.DataFrame(np.array([[1,5],[2,12],[4,17]]),columns=['a','b'])
df = spark.createDataFrame(dt)
我想创建第三列c,这是这两列的总和。我有以下两种方式。
I want to create a third column, c, that is the sum of these two columns. I have the following two ways to do so.
Spark中的withColumn()方法:
The withColumn() method in Spark:
df1 = df.withColumn('c', df.a + df.b)
或使用sql:
df.createOrReplaceTempView('mydf')
df2 = spark.sql('select *, a + b as c from mydf')
虽然两者都产生相同的结果,但哪种方法在计算上更快?
While both yield the same results, which method is computationally faster?
此外,sql与spark用户定义的函数相比如何?
Also, how does sql compare to a spark user defined function?
推荐答案
虽然两者都产生相同的结果,但哪种方法计算速度更快?
While both yield the same results, which method is computationally faster?
看看执行计划:
df1.explain()
#== Physical Plan ==
#*(1) Project [a#0L, b#1L, (a#0L + b#1L) AS c#4L]
#+- Scan ExistingRDD[a#0L,b#1L]
df2.explain()
#== Physical Plan ==
#*(1) Project [a#0L, b#1L, (a#0L + b#1L) AS c#8L]
#+- Scan ExistingRDD[a#0L,b#1L]
由于这些相同,因此两种方法相同。
Since these are the same, the two methods are identical.
通常来说,使用 withColumn
或 spark-sql没有计算优势
。如果代码编写正确,则基础计算将相同。
Generally speaking, there is no computational advantage of using either withColumn
or spark-sql
over the other. If the code is written properly, the underlying computations will be identical.
在某些情况下,使用 spark-sql
表达某些内容会更容易,例如,如果您想使用列值作为参数一个火花函数。
There may be some cases where it's easier to express something using spark-sql
, for example if you wanted to use a column value as a parameter to a spark function.
此外,sql与一个火花用户定义函数相比如何?
Also, how does sql compare to a spark user defined function?
看看这个帖子:火花功能与UDF性能?
这篇关于SQL和withColumn之间的性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!