用最大行值替换 PySpark 数据帧组中的值 [英] Replace values in a PySpark Dataframe group with max row values

查看:23
本文介绍了用最大行值替换 PySpark 数据帧组中的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有这个 PySpark 数据框:

We have this PySpark Dataframe:

+---+--------+-----------+
| id|language|    summary|
+---+--------+-----------+
|  2|    Java|      Great|
|  4|  Python|    Awesome|
|  7|  Python|    Amazing|
|  9|  Python| Incredible|
|  3|   Scala|       Good|
|  6|   Scala|  Fantastic|
+---+--------+-----------+

这个问题有点复杂,请耐心等待.对于具有相同语言列值的行,我希望能够使用 id 作为决胜局来调整汇总列值(具有相同语言的行应选择该语言具有最大 id 的行,并将所有汇总更改为相等最大 id 行的摘要).因此,例如对于 Python,我希望能够用Incredible"替换所有摘要.因为带有难以置信"的行具有 Python 的最高 id.Scala 也一样.所以它会导致这个:

This issue is a bit convoluted, so please bear with me. For rows with the same language column value I want to be able to adjust the summary column values using the id as a tie breaker (the rows with the same language should select the row with the max id for that language and change all summaries to equal the max id row's summary). So for example for Python, I want to be able to replace all the summaries with "Incredible" since the row with "Incredible" has the highest id for Python. Same for Scala. So it would result into this:

+---+--------+-----------+
| id|language|    summary|
+---+--------+-----------+
|  2|    Java|      Great|
|  4|  Python| Incredible|
|  7|  Python| Incredible|
|  9|  Python| Incredible|
|  3|   Scala|  Fantastic|
|  6|   Scala|  Fantastic|
+---+--------+-----------+

我们可以假设每个语言组的 id 总是唯一的.对于一种语言,我们永远不会有两次相同的 id,尽管我们可能会看到不同语言的相同 id.

We can assume that the ids are always going to be unique for each language group. We will never have the same id twice for one language although we may see the same id for different languages.

推荐答案

可以使用窗口函数得到每种语言最大id对应的汇总:

You can get the summary corresponding to the maximum id for each language using a window function:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'summary', 
    F.max(F.struct('id', 'summary')).over(Window.partitionBy('language'))['summary']
)

df2.show()
+---+--------+----------+
| id|language|   summary|
+---+--------+----------+
|  5|   Scala| Fantastic|
|  6|   Scala| Fantastic|
|  2|  Python|Incredible|
|  3|  Python|Incredible|
|  4|  Python|Incredible|
|  1|    Java|     Great|
+---+--------+----------+

这篇关于用最大行值替换 PySpark 数据帧组中的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆