如何合并Spark数据集中的行以合并字符串列 [英] How to merging rows in a spark data set to combine a string column

查看:57
本文介绍了如何合并Spark数据集中的行以合并字符串列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要将数据集中的两行或更多行合并为一个.分组必须基于id列进行.要合并的列是一个字符串.我需要在合并列中使用逗号分隔的字符串.我如何实现这一点是Java.输入行

I need to merge two or more rows in a dataset into one. The grouping has to be done based on an id column. The column to be merged is a string. I need to get a comma-separated string in the merged column. How do I achieve this is Java. Input rows

col1,col2  
1,abc  
2,pqr  
1,abc1  
3,xyz
2,pqr1

预期输出:

col1, col2  
1, "abc,abc1"  
2, "pqr,pqr1"  
3, xyz  

推荐答案

聚合两个单独的列:

your_data_frame
    .withColumn("aggregated_column", concat_ws(",", col("col1"), col("col2"))

以防万一,这是除了通常的东西之外还要导入的东西

Just in case, here is what to import besides the usual stuff

import static org.apache.spark.sql.functions.*;

修改

如果要聚合任意数量的名称(按名称知道),则可以通过以下方式做到这一点:

If you want to aggregate an arbitrary number of columns that you know by name, you can do it this way:

String[] column_names = {"c1", "c2", "c3"};
Column[] columns = Arrays.asList(column_names)
            .stream().map(x -> col(x))
            .collect(Collectors.toList())
            .toArray(new Column[0]);
data_frame
    .withColumn("agg", concat_ws(",", columns));

编辑#2:分组并合并

如果要按"ID"列分组并汇总另一列,则可以通过以下方式做到这一点:

In case you want to group by a column "ID" and aggregate another column, you can do it this way:

dataframe
    .groupBy("ID")
    .agg(concat_ws(",", collect_list(col("col1")) ))

这篇关于如何合并Spark数据集中的行以合并字符串列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆