如何使用 pyspark 管理跨集群的数据帧的物理数据放置? [英] How to manage physical data placement of a dataframe across the cluster with pyspark?

查看：21 发布时间：2021/12/22 21:31:51 apache-spark pyspark database-partitioning

本文介绍了如何使用 pyspark 管理跨集群的数据帧的物理数据放置?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有一个 pyspark 数据框数据"，如下所示.我想按期间"对数据进行分区.相反，我希望每个时期的数据都存储在它自己的分区上(请参阅下面数据"数据框下方的示例).

Say I have a pyspark dataframe 'data' as follows. I want to partition the data by "Period". Rather I want each period of data to be stored on it's own partition (see the example below the 'data' dataframe below).

data = sc.parallelize([[1,1,0,14277.4,0], 
[1,2,0,14277.4,0], 
[2,1,0,4741.91,0], 
[2,2,0,4693.03,0], 
[3,1,2,9565.93,0], 
[3,2,2,9566.05,0], 
[4,2,0,462.68,0], 
[5,1,1,3549.66,0], 
[5,2,5,3549.66,1], 
[6,1,1,401.52,0], 
[6,2,0,401.52,0], 
[7,1,0,1886.24,0], 
[7,2,0,1886.24,0]]) 
.toDF(("Acct","Period","Status","Bal","CloseFlag"))

data.show(100)

+----+------+------+-------+---------+
|Acct|Period|Status|    Bal|CloseFlag|
+----+------+------+-------+---------+
|   1|     1|     0|14277.4|        0|
|   1|     2|     0|14277.4|        0|
|   2|     1|     0|4741.91|        0|
|   2|     2|     0|4693.03|        0|
|   3|     1|     2|9565.93|        0|
|   3|     2|     2|9566.05|        0|
|   4|     2|     0| 462.68|        0|
|   5|     1|     1|3549.66|        0|
|   5|     2|     5|3549.66|        1|
|   6|     1|     1| 401.52|        0|
|   6|     2|     0| 401.52|        0|
+----+------+------+-------+---------+

例如

第 1 部分:

+----+------+------+-------+---------+
|Acct|Period|Status|    Bal|CloseFlag|
+----+------+------+-------+---------+
|   1|     1|     0|14277.4|        0|
|   2|     1|     0|4741.91|        0|
|   3|     1|     2|9565.93|        0|
|   5|     1|     1|3549.66|        0|
|   6|     1|     1| 401.52|        0|
+----+------+------+-------+---------+

第 2 部分:

+----+------+------+-------+---------+
|Acct|Period|Status|    Bal|CloseFlag|
+----+------+------+-------+---------+
|   1|     2|     0|14277.4|        0|
|   2|     2|     0|4693.03|        0|
|   3|     2|     2|9566.05|        0|
|   4|     2|     0| 462.68|        0|
|   5|     2|     5|3549.66|        1|
|   6|     2|     0| 401.52|        0|
+----+------+------+-------+---------+

推荐答案

方法应该是先重新分区，得到正确的分区数(唯一周期数)，然后按Period列分区再保存.

The approach should be to repartition first, to have the right number of partitions(number of unique periods), and then partition by the Period column before saving it.

from pyspark.sql import functions as F
n = data.select(F.col('Period')).distinct().count()

data.repartition(n)
     .write 
     .partitionBy("Period")
     .mode("overwrite")
     .format("parquet")
     .saveAsTable("testing")

这篇关于如何使用 pyspark 管理跨集群的数据帧的物理数据放置?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用 pyspark 管理跨集群的数据帧的物理数据放置? [英] How to manage physical data placement of a dataframe across the cluster with pyspark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何使用 pyspark 管理跨集群的数据帧的物理数据放置? [英] How to manage physical data placement of a dataframe across the cluster with pyspark?

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭