Pyspark ---添加新的列与每组的值 [英] Pyspark --- adding new column with values per group by

查看：694 发布时间：2017/3/26 4:13:37 apache-spark dataframe group-by pyspark

本文介绍了Pyspark ---添加新的列与每组的值的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有以下数据集：

a | b   
1 | 0.4 
1 | 0.8 
1 | 0.5 
2 | 0.4
2 | 0.1

我想添加一个名为label的新列，其中每个 a 中的值组。 a 组中 b 的最高值标记为1，所有其他标签为0。

I would like to add a new column called "label" where the values are determined locally for each group of values in a. The highest value of b in a group a is labeled 1 and all others are labeled 0.

输出将如下所示：

a | b   | label
1 | 0.4 | 0
1 | 0.8 | 1
1 | 0.5 | 0
2 | 0.4 | 1
2 | 0.1 | 0

如何使用PySpark有效地执行此操作？

How can I do this efficiently using PySpark?

推荐答案

您可以使用窗口功能。首先，您需要几个导入：

You can do it with window functions. First you'll need a couple of imports:

from pyspark.sql.functions import desc, row_number, when
from pyspark.sql.window import Window

和窗口定义： p>

and window definition:

w = Window().partitionBy("a").orderBy(desc("b"))

最后你使用这些：

df.withColumn("label", when(row_number().over(w) == 1, 1).otherwise(0))

例如数据：

df = sc.parallelize([
    (1, 0.4), (1, 0.8), (1, 0.5), (2, 0.4), (2, 0.1)
]).toDF(["a", "b"])

是：

+---+---+-----+
|  a|  b|label|
+---+---+-----+
|  1|0.8|    1|
|  1|0.5|    0|
|  1|0.4|    0|
|  2|0.4|    1|
|  2|0.1|    0|
+---+---+-----+

这篇关于Pyspark ---添加新的列与每组的值的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pyspark ---添加新的列与每组的值 [英] Pyspark --- adding new column with values per group by

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Pyspark ---添加新的列与每组的值 [英] Pyspark --- adding new column with values per group by

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭