Spark数据集转换为数组 [英] Spark Data set transformation to array
问题描述
我有一个像下面这样的数据集;其中col1的值重复多次,而col2的值唯一.这个原始数据集几乎可以容纳十亿行,所以我不想使用collect或collect_list,因为它不会针对我的用例进行横向扩展.
I have a dataset like below; with values of col1 repeating multiple times and unique values of col2. This original dataset can almost a billion rows, so I do not want to use collect or collect_list as it will not scale-out for my use case.
原始数据集:
+---------------------|
| col1 | col2 |
+---------------------|
| AA| 11 |
| BB| 21 |
| AA| 12 |
| AA| 13 |
| BB| 22 |
| CC| 33 |
+---------------------|
我想将数据集转换为以下数组格式.newColumn作为col2的数组.
I want to transform the dataset into the following array format. newColumn as an array of col2.
转换后的数据集:
+---------------------|
|col1 | newColumn|
+---------------------|
| AA| [11,12,13]|
| BB| [21,22] |
| CC| [33] |
+---------------------|
我已经看到了此解决方案,但是它使用collect_list并且不会扩展到大数据集.
I have seen this solution, but it uses collect_list and will not scale-out on big datasets.
推荐答案
使用spark的内置函数始终是最好的方法.我看不到使用collect_list函数的问题.只要您有足够的内存,这将是最好的方法.优化工作的一种方法是将数据保存为实木复合地板,按A列进行存储并将其保存为表格.最好将其划分为均匀分布数据的某些列.
Using the inbuilt functions of spark are always the best way. I see no problem in using the collect_list function. As long as you have sufficient memory, this would be the best way. One way of optimizing your job would be to save your data as parquet , bucket it by column A and saving it as a table. Better would be to also partition it by some column that evenly distributes data.
例如,
df_stored = #load your data from csv or parquet or any format'
spark.catalog.setCurrentDatabase(database_name)
df_stored.write.mode("overwrite").format("parquet").partitionBy(part_col).bucketBy(10,"col1").option("path",savepath).saveAsTable(tablename)
df_analysis = spark.table(tablename)
df_aggreg = df_analysis.groupby('col1').agg(F.collect_list(col('col2')))
这将加快聚合速度,并避免大量改组.试试吧
This would speeden up the aggregation and avoid a lot of shuffle. try it out
这篇关于Spark数据集转换为数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!