将管道分隔文件转换为Spark DataFrame转换为CSV文件 [英] Converting pipe-delimited file to spark dataframe to CSV file

查看:120
本文介绍了将管道分隔文件转换为Spark DataFrame转换为CSV文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件,其中只有一列,行定义如下:

I have a CSV file with one single column and the rows are defined as follows :

123 || food || fruit
123 || food || fruit || orange 
123 || food || fruit || apple

我想创建一个具有单列和不同行值的csv文件:

I want to create a csv file with a single column and distinct row values as :

orange
apple

我尝试使用以下代码:

 val data = sc.textFile("fruits.csv")
 val rows = data.map(_.split("||"))
 val rddnew = rows.flatMap( arr => {
 val text = arr(0) 
 val words = text.split("||")
 words.map( word => ( word, text ) )
 } )

但是这段代码并没有给我想要的正确结果.
有人可以帮我吗?

But this code is not giving me the correct result as wanted.
Can anyone please help me with this ?

推荐答案

您需要使用转义符拆分特殊字符,因为拆分需要使用正则表达式

you need to split with escape for special characters, since split takes regex

.split("\\|\\|")

转换为CSV非常棘手,因为数据字符串可能包含定界符(用引号引起来),换行符或其他对解析敏感的字符,因此我建议使用

converting to CSV is tricky because data strings may potentially contain your delimiter (in quotes), new-line or other parse-sensitive characters, so I'd recommend using spark-csv

 val df = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("delimiter", "||")
  .option("header", "true")
  .option("inferSchema", "true")
  .load("words.csv")

 words.write
  .format("com.databricks.spark.csv")
  .option("delimiter", "||")
  .option("header", "true")
  .save("words.csv")

这篇关于将管道分隔文件转换为Spark DataFrame转换为CSV文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆