spark 2.0使用json读取csv [英] spark 2.0 read csv with json

查看:304
本文介绍了spark 2.0使用json读取csv的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个CSV文件,如下所示:

I have a CSV file that looks like:

"a","b","c","{""x"":""xx"",""y"":""yy""}"

当我使用Java CSV阅读器(au.com.bytecode.opencsv.CSVParser)时,当我指示defaultEscapeChar = '\u0000'

When I use java CSV reader (au.com.bytecode.opencsv.CSVParser), it manages to parse the string when I indicate defaultEscapeChar = '\u0000'

当我尝试使用spark 2.2 CSV阅读器读取它时,它失败了,无法将其拆分为4列.这是我尝试过的:

When I tried to read it with spark 2.2 CSV reader, it failed and wasn't able to split it to 4 columns. This is what I tried:

val df = spark.read.format("csv")
              .option("quoteMode","ALL")
              .option("quote", "\u0000")
              .load("s3://...")

我也尝试使用option("escape", "\u0000") 但没有运气.

I also tries it with option("escape", "\u0000") but with no luck.

我需要选择哪个CSV选项才能正确解析此文件?

Which CSV options I need to choose in order to parse this file correctly?

推荐答案

您实际上已经关闭,正确的选项是option("escape", "\"") 因此,鉴于最新的Spark版本(2.2或什至更早的版本),以下代码段

You actually were close, right option is option("escape", "\"") so given recent spark version (2.2+ or maybe even earlier), snippet below

import org.apache.spark.sql.{Dataset, SparkSession}

object CsvJsonMain {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("CsvJsonExample").master("local").getOrCreate()

    import spark.sqlContext.implicits._
    val csvData: Dataset[String] = spark.sparkContext.parallelize(List(
      """
        |"a","b","c","{""x"":""xx"",""y"":""yy""}"
      """.stripMargin)).toDS()

    val frame = spark.read.option("escape", "\"").csv(csvData)
    frame.show()
  }
}

会产生

+---+---+---+-------------------+
|_c0|_c1|_c2|                _c3|
+---+---+---+-------------------+
|  a|  b|  c|{"x":"xx","y":"yy"}|
+---+---+---+-------------------+

spark不能立即解析此类csv的原因是默认转义值为'\'符号,如在

The reason why spark fails to parse such csv out-of-the box is that default escape value is '\' symbol as could be seen on the line 91 at CSVOptions and it's obviously wouldn't work with default json quotes escaping.

它以前在使用databricks-csv库的spark 2.0之前可以工作的根本原因是,底层csv引擎曾经是 commons-csv 并且转义字符默认为null将允许库检测json及其转义的方式.由于2.0 csv功能是spark本身的一部分,因此使用了未提供的 uniVocity CSV解析器这样的魔术",但显然更快.

The underlying reason why it used to work before spark 2.0 with databricks-csv library is that underlying csv engine used to be commons-csv and escape character defaulted to null would allow library to detect json and it's way of escaping. Since 2.0 csv functionality is part of the spark itself and using uniVocity CSV parser which doesn't provide such "magic" but apparently is faster.

P.S.如果要保留json数据,请不要忘记在编写csv文件时指定转义.

P.S. Don't forget to specify escaping when writing csv files, if you want to preserve json data as it is.

frame.write.option("quoteAll","true").option("escape", "\"").csv("csvFileName") 

这篇关于spark 2.0使用json读取csv的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆