如何将数据帧拆分为具有相同列值的数据帧? [英] How to split a dataframe into dataframes with same column values?
问题描述
使用 Scala,如何将数据帧拆分为具有相同列值的多个数据帧(无论是数组还是集合).例如我想拆分以下数据帧:
Using Scala, how can I split dataFrame into multiple dataFrame (be it array or collection) with same column value. For example I want to split the following DataFrame:
ID Rate State
1 24 AL
2 35 MN
3 46 FL
4 34 AL
5 78 MN
6 99 FL
到:
数据集 1
ID Rate State
1 24 AL
4 34 AL
数据集 2
ID Rate State
2 35 MN
5 78 MN
数据集 3
ID Rate State
3 46 FL
6 99 FL
推荐答案
您可以收集唯一的状态值并简单地映射结果数组:
You can collect unique state values and simply map over resulting array:
val states = df.select("State").distinct.collect.flatMap(_.toSeq)
val byStateArray = states.map(state => df.where($"State" <=> state))
或映射:
val byStateMap = states
.map(state => (state -> df.where($"State" <=> state)))
.toMap
Python 中的相同内容:
The same thing in Python:
from itertools import chain
from pyspark.sql.functions import col
states = chain(*df.select("state").distinct().collect())
# PySpark 2.3 and later
# In 2.2 and before col("state") == state)
# should give the same outcome, ignoring NULLs
# if NULLs are important
# (lit(state).isNull() & col("state").isNull()) | (col("state") == state)
df_by_state = {state:
df.where(col("state").eqNullSafe(state)) for state in states}
这里明显的问题是它需要对每一级进行全数据扫描,所以这是一个昂贵的操作.如果您正在寻找一种仅拆分输出的方法,另请参阅如何将一个 RDD 拆分为两个或多个 RDD?
The obvious problem here is that it requires a full data scan for each level, so it is an expensive operation. If you're looking for a way to just split the output see also How do I split an RDD into two or more RDDs?
特别是您可以编写按感兴趣的列分区的Dataset
:
In particular you can write Dataset
partitioned by the column of interest:
val path: String = ???
df.write.partitionBy("State").parquet(path)
并在需要时回读:
// Depend on partition prunning
for { state <- states } yield spark.read.parquet(path).where($"State" === state)
// or explicitly read the partition
for { state <- states } yield spark.read.parquet(s"$path/State=$state")
根据数据的大小、分割的级别数、输入的存储和持久级别,它可能比多个过滤器更快或更慢.
Depending on the size of the data, number of levels of the splitting, storag and persistence level of the input it might faster or slower than multiple filters.
这篇关于如何将数据帧拆分为具有相同列值的数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!