在 Spark 数据框列中获取最大值的最佳方法 [英] Best way to get the max value in a Spark dataframe column

查看:41
本文介绍了在 Spark 数据框列中获取最大值的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出在 Spark 数据框列中获得最大值的最佳方法.

I'm trying to figure out the best way to get the largest value in a Spark dataframe column.

考虑以下示例:

df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"])
df.show()

创建:

+---+---+
|  A|  B|
+---+---+
|1.0|4.0|
|2.0|5.0|
|3.0|6.0|
+---+---+

我的目标是在 A 列中找到最大值(通过检查,这是 3.0).使用 PySpark,我可以想到以下四种方法:

My goal is to find the largest value in column A (by inspection, this is 3.0). Using PySpark, here are four approaches I can think of:

# Method 1: Use describe()
float(df.describe("A").filter("summary = 'max'").select("A").first().asDict()['A'])

# Method 2: Use SQL
df.registerTempTable("df_table")
spark.sql("SELECT MAX(A) as maxval FROM df_table").first().asDict()['maxval']

# Method 3: Use groupby()
df.groupby().max('A').first().asDict()['max(A)']

# Method 4: Convert to RDD
df.select("A").rdd.max()[0]

以上每个都给出了正确的答案,但在没有 Spark 分析工具的情况下,我无法判断哪个是最好的.

Each of the above gives the right answer, but in the absence of a Spark profiling tool I can't tell which is best.

关于以上哪种方法在 Spark 运行时或资源使用方面最有效的任何直觉或经验主义的想法,或者是否有比上述方法更直接的方法?

Any ideas from either intuition or empiricism on which of the above methods is most efficient in terms of Spark runtime or resource usage, or whether there is a more direct method than the ones above?

推荐答案

>df1.show()
+-----+--------------------+--------+----------+-----------+
|floor|           timestamp|     uid|         x|          y|
+-----+--------------------+--------+----------+-----------+
|    1|2014-07-19T16:00:...|600dfbe2| 103.79211|71.50419418|
|    1|2014-07-19T16:00:...|5e7b40e1| 110.33613|100.6828393|
|    1|2014-07-19T16:00:...|285d22e4|110.066315|86.48873585|
|    1|2014-07-19T16:00:...|74d917a1| 103.78499|71.45633073|

>row1 = df1.agg({"x": "max"}).collect()[0]
>print row1
Row(max(x)=110.33613)
>print row1["max(x)"]
110.33613

答案和method3差不多.但似乎可以删除方法3中的asDict()"

The answer is almost the same as method3. but seems the "asDict()" in method3 can be removed

这篇关于在 Spark 数据框列中获取最大值的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆