Spark:基于列值的分隔从DataFrame写入JSON几个文件 [英] Spark: write JSON several files from DataFrame based on separation by column value

查看：85 发布时间：2021/4/8 19:42:52 json apache-spark io

本文介绍了Spark:基于列值的分隔从DataFrame写入JSON几个文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设我有这个DataFrame( df ):

Suppose I have this DataFrame (df):

user    food        affinity
'u1'    'pizza'       5 
'u1'    'broccoli'    3
'u1'    'ice cream'   4
'u2'    'pizza'       1
'u2'    'broccoli'    3
'u2'    'ice cream'   1

也就是说，每个使用者对一系列食物都具有一定的(计算的)亲和力.DataFrame是根据几种方式构建的，我需要为每个用户创建一个JSON文件及其亲和力.例如，对于用户'u1'，我想要包含用户'u1'的文件

Namely each user has a certain (computed) affinity to a series of foods. The DataFrame is built from several What I need to do is create a JSON file for each user, with their affinities. For instance, for user 'u1', I want to have file for user 'u1' containing

[
    {'food': 'pizza', 'affinity': 5},
    {'food': 'broccoli', 'affinity': 3},
    {'food': 'ice cream', 'affinity': 4},
]

这将导致用户将DataFrame分开，我想不出一种方法，因为对于完整的DataFrame，使用

This would entail a separation of the DataFrame by user and I cannot think of a way to do this as the writing of a JSON file would be achieved, for full DataFrame, with

df.write.json(<path_to_file>)

推荐答案

您可以 partitionBy (它将为您提供一个目录，并且每个用户可能有多个文件):

You can partitionBy (it will give you a single directory and possibly multiple files per user):

df.write.partitionBy("user").json(<path_to_file>)

或分区和 partitionBy (它将为您提供每个用户一个目录和一个文件):

or repartition and partitionBy (it will give you a single directory and a single file per user):

df.repartition(col("user")).write.partitionBy("user").json(<path_to_file>)

不幸的是，以上方法均无法提供JSON数组.

Unfortunately none of the above will give you a JSON array.

如果您使用Spark 2.0，则可以先尝试使用收集列表:

If you use Spark 2.0 you can try with collect list first:

df.groupBy(col("user")).agg(
  collect_list(struct(col("food"), col("affinity"))).alias("affinities")
)

和 partitionBy 像以前一样写.

在2.0之前，您必须使用RDD API，但这是特定于语言的.

Prior to 2.0 you'll have to use RDD API, but it is language specific.

这篇关于Spark:基于列值的分隔从DataFrame写入JSON几个文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark:基于列值的分隔从DataFrame写入JSON几个文件 [英] Spark: write JSON several files from DataFrame based on separation by column value

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark:基于列值的分隔从DataFrame写入JSON几个文件 [英] Spark: write JSON several files from DataFrame based on separation by column value

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭