将spark数据框的每一行写为一个单独的文件 [英] Write each row of a spark dataframe as a separate file
问题描述
我有Spark Dataframe,它只有一列,其中每一行都是一个长字符串(实际上是一个xml文件). 我想遍历DataFrame并将每一行的字符串保存为文本文件,它们可以简单地称为1.xml,2.xml等.
I have Spark Dataframe with a single column, where each row is a long string (actually an xml file). I want to go through the DataFrame and save a string from each row as a text file, they can be called simply 1.xml, 2.xml, and so on.
我似乎找不到任何有关执行此操作的信息或示例. 我刚刚开始使用Spark和PySpark. 也许在DataFrame上映射了一个函数,但是该函数将不得不将字符串写入文本文件,我找不到执行该操作的方法.
I cannot seem to find any information or examples on how to do this. And I am just starting to work with Spark and PySpark. Maybe map a function on the DataFrame, but the function will have to write string to text file, I can't find how to do this.
推荐答案
使用Spark保存数据框时,将为每个分区创建一个文件.因此,获取每个文件单行的一种方法是首先将数据重新划分为与行数一样多的分区.
When saving a dataframe with Spark, one file will be created for each partition. Hence, one way to get a single row per file would be to first repartition the data to as many partitions as you have rows.
github 上有一个库,用于使用Spark读写XML文件.但是,数据框需要具有特殊的格式才能生成正确的XML.在这种情况下,由于所有内容都作为一个字符串包含在单个列中,所以最简单的保存方法可能是csv.
There is a library on github for reading and writing XML files with Spark. However, the dataframe needs to have a special format to produce correct XML. In this case, since you have everything as a string in a single column, the easiest way to save would probably be as csv.
重新分区和保存可以按照以下步骤进行:
The repartition and saving can be done as follows:
rows = df.count()
df.repartition(rows).write.csv('save-dir')
这篇关于将spark数据框的每一行写为一个单独的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!