Spark:基于列值的分隔从DataFrame写入JSON几个文件 [英] Spark: write JSON several files from DataFrame based on separation by column value
问题描述
假设我有这个DataFrame( df
):
Suppose I have this DataFrame (df
):
user food affinity
'u1' 'pizza' 5
'u1' 'broccoli' 3
'u1' 'ice cream' 4
'u2' 'pizza' 1
'u2' 'broccoli' 3
'u2' 'ice cream' 1
也就是说,每个使用者对一系列食物都具有一定的(计算的)亲和力.DataFrame是根据几种方式构建的,我需要为每个用户创建一个JSON文件 及其亲和力.例如,对于用户'u1',我想要包含用户'u1'的文件
Namely each user has a certain (computed) affinity to a series of foods. The DataFrame is built from several What I need to do is create a JSON file for each user, with their affinities. For instance, for user 'u1', I want to have file for user 'u1' containing
[
{'food': 'pizza', 'affinity': 5},
{'food': 'broccoli', 'affinity': 3},
{'food': 'ice cream', 'affinity': 4},
]
这将导致用户将DataFrame分开,我想不出一种方法,因为对于完整的DataFrame,使用
This would entail a separation of the DataFrame by user and I cannot think of a way to do this as the writing of a JSON file would be achieved, for full DataFrame, with
df.write.json(<path_to_file>)
推荐答案
您可以 partitionBy
(它将为您提供一个目录,并且每个用户可能有多个文件):
You can partitionBy
(it will give you a single directory and possibly multiple files per user):
df.write.partitionBy("user").json(<path_to_file>)
或分区
和 partitionBy
(它将为您提供每个用户一个目录和一个文件):
or repartition
and partitionBy
(it will give you a single directory and a single file per user):
df.repartition(col("user")).write.partitionBy("user").json(<path_to_file>)
不幸的是,以上方法均无法提供JSON数组.
Unfortunately none of the above will give you a JSON array.
如果您使用Spark 2.0,则可以先尝试使用收集列表:
If you use Spark 2.0 you can try with collect list first:
df.groupBy(col("user")).agg(
collect_list(struct(col("food"), col("affinity"))).alias("affinities")
)
和 partitionBy
像以前一样写.
在2.0之前,您必须使用RDD API,但这是特定于语言的.
Prior to 2.0 you'll have to use RDD API, but it is language specific.
这篇关于Spark:基于列值的分隔从DataFrame写入JSON几个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!