Spark 数据集转换为数组 [英] Spark Data set transformation to array

查看:72
本文介绍了Spark 数据集转换为数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的数据集;col1 的值重复多次,col2 的值是唯一的.这个原始数据集大约有 10 亿行,所以我不想使用 collect 或 collect_list 因为它不会扩展到我的用例中.

I have a dataset like below; with values of col1 repeating multiple times and unique values of col2. This original dataset can almost a billion rows, so I do not want to use collect or collect_list as it will not scale-out for my use case.

原始数据集:

+---------------------|
|    col1  |    col2  |
+---------------------|
|    AA|    11        |
|    BB|    21        |
|    AA|    12        |
|    AA|    13        |
|    BB|    22        |
|    CC|    33        |
+---------------------|

我想将数据集转换为以下数组格式.newColumn 作为 col2 的数组.

I want to transform the dataset into the following array format. newColumn as an array of col2.

转换后的数据集:

+---------------------|
|col1  |     newColumn|
+---------------------|
|    AA|    [11,12,13]|
|    BB|    [21,22]   |
|    CC|    [33]      |
+---------------------|

我见过这个解决方案,但它使用 collect_list 并且不会横向扩展大数据集.

I have seen this solution, but it uses collect_list and will not scale-out on big datasets.

推荐答案

使用 Spark 的内置函数总是最好的方法.我认为使用 collect_list 函数没有问题.只要你有足够的内存,这将是最好的方法.优化您的工作的一种方法是将您的数据保存为 parquet ,按 A 列存储数据并将其保存为表格.更好的做法是用一些均匀分布数据的列对其进行分区.

Using the inbuilt functions of spark are always the best way. I see no problem in using the collect_list function. As long as you have sufficient memory, this would be the best way. One way of optimizing your job would be to save your data as parquet , bucket it by column A and saving it as a table. Better would be to also partition it by some column that evenly distributes data.

例如

df_stored = #load your data from csv or parquet or any format'
spark.catalog.setCurrentDatabase(database_name)
df_stored.write.mode("overwrite").format("parquet").partitionBy(part_col).bucketBy(10,"col1").option("path",savepath).saveAsTable(tablename)
df_analysis = spark.table(tablename)
df_aggreg = df_analysis.groupby('col1').agg(F.collect_list(col('col2')))

这将加快聚合速度并避免大量洗牌.试试看

This would speeden up the aggregation and avoid a lot of shuffle. try it out

这篇关于Spark 数据集转换为数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆