根据Scala Spark中的键在数据框中合并多个记录 [英] Merge Multiple Records in a Dataframe based on a key in scala spark
问题描述
我有一个数据框,其中包含由键标识的记录.但是在某些情况下,密钥可能会重复.我的目标是按以下步骤合并基于该键的所有记录
I have a dataframe which contains records identified by a key. But there might be a case where a key can get repetitive. My goal is to merge all the records based on that key as follows
让我们假设我的输入数据帧看起来像这样:
Lets suppose my input dataframe looks something like this:
key | value1 | value2 | value3
-------------------------------
a | 1 | null | null
a | null | 2 | null
a | null | null | 3
并且我希望基于'a'合并后的输出如下所示
and I want my output after merging based on 'a' should look something like as follows
key | value1 | value2 | value3
-------------------------------
a | 1 | 2 | 3
现在我可以确定这部分的三个值中的哪一个将与键"a"的一条记录相对应.
Now I am sure about this part either one the three values will be present against one record for the key 'a'.
谢谢
推荐答案
如果您知道组中只有一条记录不为空(或者您不在乎会得到哪一条),则可以使用first
:
If you know there is only one record for group which is not null (or you don't care which one you'll get), you can use first
:
import org.apache.spark.sql.functions.{first, last}
val df = Seq(
("a", Some(1), None, None), ("a", None, Some(2), None),
("a", None, None, Some(3))
).toDF("key", "value1", "value2", "value3")
df.groupBy("key").agg(
first("value1", true) as "value1",
first("value2", true) as "value2",
first("value3", true) as "value3"
).show
// +---+------+------+------+
// |key|value1|value2|value3|
// +---+------+------+------+
// | a| 1| 2| 3|
// +---+------+------+------+
或last
:
df.groupBy("key").agg(
last("value1", true) as "value1",
last("value2", true) as "value2",
last("value3", true) as "value3"
).show
// +---+------+------+------+
// |key|value1|value2|value3|
// +---+------+------+------+
// | a| 1| 2| 3|
// +---+------+------+------+
这篇关于根据Scala Spark中的键在数据框中合并多个记录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!