Apache Spark 基于另一行更新 RDD 或数据集中的一行 [英] Apache Spark update a row in an RDD or Dataset based on another row

查看：28 发布时间：2021/11/14 23:16:26 scala apache-spark spark-dataframe rdd apache-spark-dataset

本文介绍了Apache Spark 基于另一行更新 RDD 或数据集中的一行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想弄清楚如何根据另一行更新某些行.

I'm trying to figure how I can update some rows based on another another row.

例如，我有一些类似的数据

For example, I have some data like

Id | useraname | ratings | city
--------------------------------
1, philip, 2.0, montreal, ...
2, john, 4.0, montreal, ...
3, charles, 2.0, texas, ...

我想将同一城市的用户更新为相同的 groupId(1 或 2)

I want to update the users in the same city to the same groupId (either 1 or 2)

Id | useraname | ratings | city
--------------------------------
1, philip, 2.0, montreal, ...
1, john, 4.0, montreal, ...
3, charles, 2.0, texas, ...

如何在我的 RDD 或数据集中实现这一点?

How can I achieve this in my RDD or Dataset ?

所以为了完整起见，如果 Id 是一个字符串，稠密等级将不起作用怎么办?

So just for sake of completeness, what if the Id is a String, the dense rank won't work ?

例如?

Id | useraname | ratings | city
--------------------------------
a, philip, 2.0, montreal, ...
b, john, 4.0, montreal, ...
c, charles, 2.0, texas, ...

结果如下:

grade | useraname | ratings | city
--------------------------------
a, philip, 2.0, montreal, ...
a, john, 4.0, montreal, ...
c, charles, 2.0, texas, ...

推荐答案

一个干净的方法是使用 Window 函数中的 dense_rank().它枚举 Window 列中的唯一值.因为 city 是一个 String 列，所以它们会按字母顺序增加.

A clean way to do this would be to use dense_rank() from Window functions. It enumerates the unique values in your Window column. Because city is a String column, these will be increasing alphabetically.

import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window

val df = spark.createDataFrame(Seq(
  (1, "philip", 2.0, "montreal"),
  (2, "john", 4.0, "montreal"),
  (3, "charles", 2.0, "texas"))).toDF("Id", "username", "rating", "city")

val w = Window.orderBy($"city")
df.withColumn("id", rank().over(w)).show()

+---+--------+------+--------+
| id|username|rating|    city|
+---+--------+------+--------+
|  1|  philip|   2.0|montreal|
|  1|    john|   4.0|montreal|
|  2| charles|   2.0|   texas|
+---+--------+------+--------+

这篇关于Apache Spark 基于另一行更新 RDD 或数据集中的一行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Apache Spark 基于另一行更新 RDD 或数据集中的一行 [英] Apache Spark update a row in an RDD or Dataset based on another row

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Apache Spark 基于另一行更新 RDD 或数据集中的一行 [英] Apache Spark update a row in an RDD or Dataset based on another row

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭