Spark 2.0.x从包含一个字符串类型数组的数据帧中转储一个csv文件 [英] Spark 2.0.x dump a csv file from a dataframe containing one array of type string

查看:250
本文介绍了Spark 2.0.x从包含一个字符串类型数组的数据帧中转储一个csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框df,其中包含一列类型为数组的

I have a dataframe df that contains one column of type array

df.show()看起来像

|ID|ArrayOfString|Age|Gender|
+--+-------------+---+------+
|1 | [A,B,D]     |22 | F    |
|2 | [A,Y]       |42 | M    |
|3 | [X]         |60 | F    |
+--+-------------+---+------+

我尝试按如下方式将df转储到csv文件中:

I try to dump that df in a csv file as follow:

val dumpCSV = df.write.csv(path="/home/me/saveDF")

由于列ArrayOfString,它不起作用.我收到错误消息:

It is not working because of the column ArrayOfString. I get the error:

CSV数据源不支持数组字符串数据类型

CSV data source does not support array string data type

如果删除列ArrayOfString,该代码将起作用.但是我需要保留ArrayOfString

The code works if I remove the column ArrayOfString. But I need to keep ArrayOfString!

转储包含ArrayOfString列的csv数据帧的最佳方法是什么(ArrayOfString应该作为CSV文件的一列转储)

What would be the best way to dump the csv dataframe including column ArrayOfString (ArrayOfString should be dumped as one column on the CSV file)

推荐答案

出现此错误的原因是csv文件格式不支持数组类型,您需要将其表示为字符串才能保存.

The reason why you are getting this error is that csv file format doesn't support array types, you'll need to express it as a string to be able to save.

尝试以下方法:

import org.apache.spark.sql.functions._

val stringify = udf((vs: Seq[String]) => vs match {
  case null => null
  case _    => s"""[${vs.mkString(",")}]"""
})

df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)

import org.apache.spark.sql.Column

def stringify(c: Column) = concat(lit("["), concat_ws(",", c), lit("]"))

df.withColumn("ArrayOfString", stringify($"ArrayOfString")).write.csv(...)

这篇关于Spark 2.0.x从包含一个字符串类型数组的数据帧中转储一个csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆