如何将数据帧拆分为具有相同列值的数据帧? [英] How to split a dataframe into dataframes with same column values?

查看:34
本文介绍了如何将数据帧拆分为具有相同列值的数据帧?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 Scala,如何将数据帧拆分为具有相同列值的多个数据帧(无论是数组还是集合).例如我想拆分以下数据帧:

Using Scala, how can I split dataFrame into multiple dataFrame (be it array or collection) with same column value. For example I want to split the following DataFrame:

ID  Rate    State
1   24  AL
2   35  MN
3   46  FL
4   34  AL
5   78  MN
6   99  FL

到:

数据集 1

ID  Rate    State
1   24  AL  
4   34  AL

数据集 2

ID  Rate    State
2   35  MN
5   78  MN

数据集 3

ID  Rate    State
3   46  FL
6   99  FL

推荐答案

您可以收集唯一的状态值并简单地映射结果数组:

You can collect unique state values and simply map over resulting array:

val states = df.select("State").distinct.collect.flatMap(_.toSeq)
val byStateArray = states.map(state => df.where($"State" <=> state))

或映射:

val byStateMap = states
    .map(state => (state -> df.where($"State" <=> state)))
    .toMap

Python 中的相同内容:

The same thing in Python:

from itertools import chain
from pyspark.sql.functions import col

states = chain(*df.select("state").distinct().collect())

# PySpark 2.3 and later
# In 2.2 and before col("state") == state) 
# should give the same outcome, ignoring NULLs 
# if NULLs are important 
# (lit(state).isNull() & col("state").isNull()) | (col("state") == state)
df_by_state = {state: 
  df.where(col("state").eqNullSafe(state)) for state in states}

这里明显的问题是它需要对每一级进行全数据扫描,所以这是一个昂贵的操作.如果您正在寻找一种仅拆分输出的方法,另请参阅如何将一个 RDD 拆分为两个或多个 RDD?

The obvious problem here is that it requires a full data scan for each level, so it is an expensive operation. If you're looking for a way to just split the output see also How do I split an RDD into two or more RDDs?

特别是您可以编写按感兴趣的列分区的Dataset:

In particular you can write Dataset partitioned by the column of interest:

val path: String = ???
df.write.partitionBy("State").parquet(path)

并在需要时回读:

// Depend on partition prunning
for { state <- states } yield spark.read.parquet(path).where($"State" === state)

// or explicitly read the partition
for { state <- states } yield spark.read.parquet(s"$path/State=$state")

根据数据的大小、分割的级别数、输入的存储和持久级别,它可能比多个过滤器更快或更慢.

Depending on the size of the data, number of levels of the splitting, storag and persistence level of the input it might faster or slower than multiple filters.

这篇关于如何将数据帧拆分为具有相同列值的数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆