添加自定义定界符会在最终的Spark数据帧CSV outpu中添加双引号 [英] Adding custom Delimiter adds double quotes in the final spark data frame CSV outpu

查看:271
本文介绍了添加自定义定界符会在最终的Spark数据帧CSV outpu中添加双引号的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据框,其中我正在用|^|替换默认定界符,. 它工作正常,除了在记录中找到,之外,我也得到了预期的结果. 例如,我有一个这样的记录,如下

I have a data frame where i am replacing default delimiter , with |^|. it is working fine and i am getting the expected result also except where , is found in the records . For example i have one such records like below

4295859078|^|914|^|INC|^|Balancing Item - Non Operating Income/(Expense),net|^||^||^|IIII|^|False|^||^||^||^||^|False|^||^||^||^||^|505096|^|505074|^|505074|^|505096|^|505096|^||^|505074|^|True|^||^|3014960|^||^|I|!|

第4字段中有,.

现在我正在这样替换,

 val dfMainOutputFinal = dfMainOutput.na.fill("").select($"DataPartition", $"StatementTypeCode",concat_ws("|^|", dfMainOutput.schema.fieldNames.filter(_ != "DataPartition").map(c => col(c)): _*).as("concatenated"))

val headerColumn = df.columns.filter(v => (!v.contains("^") && !v.contains("_c"))).toSeq

val header = headerColumn.dropRight(1).mkString("", "|^|", "|!|")

val dfMainOutputFinalWithoutNull = dfMainOutputFinal.withColumn("concatenated", regexp_replace(col("concatenated"), "null", "")).withColumnRenamed("concatenated", header)


dfMainOutputFinalWithoutNull.repartition(1).write.partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("nullValue", "")
  .option("header", "true")
  .option("codec", "gzip")
  .save("s3://trfsmallfffile/FinancialLineItem/output")

我在保存的输出零件文件中得到了这样的输出

And i get output like this in the saved output part file

"4295859078|^|914|^|INC|^|Balancing Item - Non Operating Income/(Expense),net|^||^||^|IIII|^|false|^||^||^||^||^|false|^||^||^||^||^|505096|^|505074|^|505074|^|505096|^|505096|^||^|505074|^|true|^||^|3014960|^||^|I|!|"

我的问题是结果的开头和结尾处的" ".

My problem is " " at the start and end of the result .

如果删除逗号,那么我将得到如下所示的正确结果

If remove comma then i am getting correct result like below

4295859078|^|914|^|INC|^|Balancing Item - Non Operating Income/(Expense)net|^||^||^|IIII|^|false|^||^||^||^||^|false|^||^||^||^||^|505096|^|505074|^|505074|^|505096|^|505096|^||^|505074|^|true|^||^|3014960|^||^|I|!|

推荐答案

这是标准的CSV功能.如果实际数据中出现定界符(称为定界符冲突),该字段用引号引起来.

This is a standard CSV feature. If there's an occurrence of delimiter in the actual data (referred to as Delimiter Collision), the field is enclosed in quotes.

您可以尝试

df.write.option("delimiter" , somechar)

其中somechar应该是数据中不会出现的字符.

where somechar should be a character that doesn't occur in your data.

一种更可靠的解决方案是完全禁用quoteMode,因为您正在编写仅包含一列的数据框.

A more robust solution would be to disable quoteMode entirely since you are writing a dataframe with only one column.

dfMainOutputFinalWithoutNull.repartition(1)
  .write.partitionBy("DataPartition","StatementTypeCode")
  .format("csv")
  .option("nullValue", "")
  .option("quoteMode", "NONE")
//.option("delimiter", ";")           // assuming `;` is not present in data
  .option("header", "true")
  .option("codec", "gzip")
  .save("s3://trfsmallfffile/FinancialLineItem/output")

这篇关于添加自定义定界符会在最终的Spark数据帧CSV outpu中添加双引号的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆