Spark 2.0.x 从包含一个字符串类型数组的数据帧转储一个 csv 文件 [英] Spark 2.0.x dump a csv file from a dataframe containing one array of type string
问题描述
我有一个包含一列数组类型的数据框 df
I have a dataframe df
that contains one column of type array
df.show()
看起来像
|ID|ArrayOfString|Age|Gender|
+--+-------------+---+------+
|1 | [A,B,D] |22 | F |
|2 | [A,Y] |42 | M |
|3 | [X] |60 | F |
+--+-------------+---+------+
我尝试将 df
转储到 csv 文件中,如下所示:
I try to dump that df
in a csv file as follow:
val dumpCSV = df.write.csv(path="/home/me/saveDF")
由于 ArrayOfString
列而无法正常工作.我收到错误:
It is not working because of the column ArrayOfString
. I get the error:
CSV 数据源不支持数组字符串数据类型
CSV data source does not support array string data type
如果我删除列 ArrayOfString
,代码就可以工作.但我需要保留ArrayOfString
!
The code works if I remove the column ArrayOfString
. But I need to keep ArrayOfString
!
转储包含 ArrayOfString 列的 csv 数据帧的最佳方法是什么(ArrayOfString 应作为 CSV 文件中的一列转储)
What would be the best way to dump the csv dataframe including column ArrayOfString (ArrayOfString should be dumped as one column on the CSV file)
推荐答案
出现此错误的原因是 csv 文件格式不支持数组类型,您需要将其表示为字符串才能保存.
The reason why you are getting this error is that csv file format doesn't support array types, you'll need to express it as a string to be able to save.
尝试以下操作:
import org.apache.spark.sql.functions._
val stringify = udf((vs: Seq[String]) => vs match {
case null => null
case _ => s"""[${vs.mkString(",")}]"""
})
df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
或
import org.apache.spark.sql.Column
def stringify(c: Column) = concat(lit("["), concat_ws(",", c), lit("]"))
df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)
这篇关于Spark 2.0.x 从包含一个字符串类型数组的数据帧转储一个 csv 文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!