分组后将Spark DataFrame的行聚合为String [英] Aggregate rows of Spark DataFrame to String after groupby
问题描述
Spark和Scale我都是新手,可能真的需要一个提示来解决我的问题.因此,我有两个DataFrames A(列ID和名称)和B(列ID和文本)想要加入它们,按ID分组,并将所有文本行组合为一个字符串:
I'm quite new both Spark and Scale and could really need a hint to solve my problem. So I have two DataFrames A (columns id and name) and B (columns id and text) would like to join them, group by id and combine all rows of text into a single String:
A
+--------+--------+
| id| name|
+--------+--------+
| 0| A|
| 1| B|
+--------+--------+
B
+--------+ -------+
| id| text|
+--------+--------+
| 0| one|
| 0| two|
| 1| three|
| 1| four|
+--------+--------+
所需结果:
+--------+--------+----------+
| id| name| texts|
+--------+--------+----------+
| 0| A| one two|
| 1| B|three four|
+--------+--------+----------+
到目前为止,我正在尝试以下方法:
So far I'm trying the following:
var C = A.join(B, "id")
var D = C.groupBy("id", "name").agg(collect_list("text") as "texts")
除了我的texts列是String Array而不是String之外,这还不错.非常感谢您的帮助.
This works quite well besides that my texts column is an Array of Strings instead of a String. I would appreciate some help very much.
推荐答案
我只是在您的功能中添加一些次要功能以提供正确的解决方案,即
I am just adding some minor functions in yours to give the right solution, which is
A.join(B, Seq("id"), "left").orderBy("id").groupBy("id", "name").agg(concat_ws(" ", collect_list("text")) as "texts")
这篇关于分组后将Spark DataFrame的行聚合为String的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!