如何将数据框拆分为具有相同列值的数据框? [英] How to split a dataframe into dataframes with same column values?

查看:73
本文介绍了如何将数据框拆分为具有相同列值的数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Scala,如何将dataFrame拆分为具有相同列值的多个dataFrame(数组或集合). 例如,我想拆分以下DataFrame:

Using Scala, how can I split dataFrame into multiple dataFrame (be it array or collection) with same column value. For example I want to split the following DataFrame:

ID  Rate    State
1   24  AL
2   35  MN
3   46  FL
4   34  AL
5   78  MN
6   99  FL

收件人:

数据集1

ID  Rate    State
1   24  AL  
4   34  AL

数据集2

ID  Rate    State
2   35  MN
5   78  MN

数据集3

ID  Rate    State
3   46  FL
6   99  FL

推荐答案

您可以收集唯一的状态值,并只需映射到结果数组即可:

You can collect unique state values and simply map over resulting array:

val states = df.select("State").distinct.collect.flatMap(_.toSeq)
val byStateArray = states.map(state => df.where($"State" <=> state))

或映射:

val byStateMap = states
    .map(state => (state -> df.where($"State" <=> state)))
    .toMap

Python中的相同之处:

The same thing in Python:

from itertools import chain
from pyspark.sql.functions import col

states = chain(*df.select("state").distinct().collect())

# PySpark 2.3 and later
# In 2.2 and before col("state") == state) 
# should give the same outcome, ignoring NULLs 
# if NULLs are important 
# (lit(state).isNull() & col("state").isNull()) | (col("state") == state)
df_by_state = {state: 
  df.where(col("state").eqNullSafe(state)) for state in states}

这里明显的问题是,它需要对每个级别进行完整的数据扫描,因此这是一项昂贵的操作.如果您正在寻找一种仅将输出拆分的方法,另请参见如何将RDD拆分为两个或多个RDD?

The obvious problem here is that it requires a full data scan for each level, so it is an expensive operation. If you're looking for a way to just split the output see also How do I split an RDD into two or more RDDs?

特别是,您可以编写按感兴趣的列划分的Dataset:

In particular you can write Dataset partitioned by the column of interest:

val path: String = ???
df.write.partitionBy("State").parquet(path)

,并在需要时回读:

// Depend on partition prunning
for { state <- states } yield spark.read.parquet(path).where($"State" === state)

// or explicitly read the partition
for { state <- states } yield spark.read.parquet(s"$path/State=$state")

取决于数据的大小,输入的拆分级别,存储级别和持久性级别,它可能比多个过滤器更快或更慢.

Depending on the size of the data, number of levels of the splitting, storag and persistence level of the input it might faster or slower than multiple filters.

这篇关于如何将数据框拆分为具有相同列值的数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆