如何解析使用^ A（即\001）作为使用spark-csv的分隔符的csv？ [英] How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?

查看：2359 发布时间：2018/6/12 13:56:11 scala apache-spark hive delimiter spark-csv

本文介绍了如何解析使用^ A（即\001）作为使用spark-csv的分隔符的csv？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

非常新的火花和蜂巢以及大数据和Scala等等。我试图编写一个简单的函数，它接受一个sqlContext，从s3加载一个csv文件并返回一个DataFrame。问题在于这个特定的csv使用^ A（即\001）作为分隔符，数据集很大，所以我不能只在它上面执行s / \001 /，/ g。此外，这些字段可能包含逗号或其他可能用作分隔符的字符。

我知道我使用的spark-csv包含分隔符选项，但我不知道如何设置它，以便将\001作为一个字符读取，而不是像转义的0,0和1那样。也许我应该使用hiveContext或其他东西？

<如果你检查GitHub页面，spark-csv有一个定界符参数（正如你也注意到的那样））。
像这样使用它：

  val df = sqlContext.read 
 .format（com.databricks .spark.csv）
 .option（header，true）//使用所有文件的第一行作为头文件
 .option（inferSchema，true）//自动推断数据类型
 .option（delimiter，\\\）
 .load（cars.csv）

Terribly new to spark and hive and big data and scala and all. I'm trying to write a simple function that takes an sqlContext, loads a csv file from s3 and returns a DataFrame. The problem is that this particular csv uses the ^A (i.e. \001) character as the delimiter and the dataset is huge so I can't just do a "s/\001/,/g" on it. Besides, the fields might contain commas or other characters I might use as a delimiter.

I know that the spark-csv package that I'm using has a delimiter option, but I don't know how to set it so that it will read \001 as one character and not something like an escaped 0, 0 and 1. Perhaps I should use hiveContext or something?

解决方案

If you check the GitHub page, there is a delimiter parameter for spark-csv (as you also noted). Use it like this:

val df = sqlContext.read
    .format("com.databricks.spark.csv")
    .option("header", "true") // Use first line of all files as header
    .option("inferSchema", "true") // Automatically infer data types
    .option("delimiter", "\u0001")
    .load("cars.csv")

这篇关于如何解析使用^ A（即\001）作为使用spark-csv的分隔符的csv？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何解析使用^ A（即\001）作为使用spark-csv的分隔符的csv？ [英] How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何解析使用^ A（即\001）作为使用spark-csv的分隔符的csv？ [英] How to parse a csv that uses ^A (i.e. \001) as the delimiter with spark-csv?

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭