在Spark数据框列中获取最大值的最佳方法 [英] Best way to get the max value in a Spark dataframe column

查看:741
本文介绍了在Spark数据框列中获取最大值的最佳方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试找出在Spark数据框列中获取最大值的最佳方法.

I'm trying to figure out the best way to get the largest value in a Spark dataframe column.

请考虑以下示例:

df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"])
df.show()

哪个创建了:

+---+---+
|  A|  B|
+---+---+
|1.0|4.0|
|2.0|5.0|
|3.0|6.0|
+---+---+

我的目标是在A列中找到最大值(经检查,这是3.0).使用PySpark,我可以想到以下四种方法:

My goal is to find the largest value in column A (by inspection, this is 3.0). Using PySpark, here are four approaches I can think of:

# Method 1: Use describe()
float(df.describe("A").filter("summary = 'max'").select("A").first().asDict()['A'])

# Method 2: Use SQL
df.registerTempTable("df_table")
spark.sql("SELECT MAX(A) as maxval FROM df_table").first().asDict()['maxval']

# Method 3: Use groupby()
df.groupby().max('A').first().asDict()['max(A)']

# Method 4: Convert to RDD
df.select("A").rdd.max()[0]

以上每种方法都给出了正确的答案,但是在没有Spark分析工具的情况下,我无法确定哪种方法最好.

Each of the above gives the right answer, but in the absence of a Spark profiling tool I can't tell which is best.

从直觉或经验主义的角度来看,在Spark运行时或资源使用方面,上述哪种方法最有效,或者是否有比上述方法更直接的方法?

Any ideas from either intuition or empiricism on which of the above methods is most efficient in terms of Spark runtime or resource usage, or whether there is a more direct method than the ones above?

推荐答案

>df1.show()
+-----+--------------------+--------+----------+-----------+
|floor|           timestamp|     uid|         x|          y|
+-----+--------------------+--------+----------+-----------+
|    1|2014-07-19T16:00:...|600dfbe2| 103.79211|71.50419418|
|    1|2014-07-19T16:00:...|5e7b40e1| 110.33613|100.6828393|
|    1|2014-07-19T16:00:...|285d22e4|110.066315|86.48873585|
|    1|2014-07-19T16:00:...|74d917a1| 103.78499|71.45633073|

>row1 = df1.agg({"x": "max"}).collect()[0]
>print row1
Row(max(x)=110.33613)
>print row1["max(x)"]
110.33613

答案与method3几乎相同.但似乎可以删除method3中的"asDict()"

The answer is almost the same as method3. but seems the "asDict()" in method3 can be removed

这篇关于在Spark数据框列中获取最大值的最佳方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆