Spark 2.0 Scala-使用转义分隔符读取csv文件 [英] Spark 2.0 Scala - Read csv files with escaped delimiters

查看:592
本文介绍了Spark 2.0 Scala-使用转义分隔符读取csv文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试读取一个使用反斜杠转义定界符而不是使用引号的CSV文件.我试过构造没有qoutes并带有转义字符的DataFrameReader,但是它不起作用.似乎转义"选项只能用于转义引号字符.除了创建自定义输入格式以外,还有其他解决方法吗?

I'm trying to read a CSV file that uses backslash to escape delimiters instead of using quotes. I've tried constructing the DataFrameReader without qoutes and with an escape character, but it doesn't work. It seems the "escape" option can only be used to escape quote characters. Is there any way around this other than crating a custom input format?

以下是我目前使用的选项:

Here are the options that I'm using for now:

  spark.read.options(Map(
    "sep" -> ",",
    "encoding" -> "utf-8",
    "quote" -> "",
    "escape" -> "\\",
    "mode" -> "PERMISSIVE",
    "nullValue" -> ""

例如,假设我们有以下示例数据:

For example let's say we have the following sample data:

模式:名称,城市

    Joe Bloggs,Dublin\,Ireland
    Joseph Smith,Salt Lake City\,\
    Utah

应该返回2条记录:

  Name           |       City
-----------------|---------------
Joe Bloggs       | Dublin,Ireland
Joseph Smith     | Salt Lake City,
Utah

能够转义换行符是很不错的,但是必须转义列定界符.现在,我正在考虑使用spark.textFile读取行,然后使用一些CSV库解析单个行.这将解决我的转义列定界符问题,但不会解决转义行定界符.

Being able to escape newlines would be a nice-to-have, but escaping the column delimiter is required. For now I'm thinking about reading the lines with spark.textFile, then using some CSV library to parse the individual lines. That will fix my escaped column delimiter problem, but not escaped row delimiters.

推荐答案

在CSV阅读器中似乎不支持此功能(请参见

It seems like this is not supported in the CSV reader (see https://github.com/databricks/spark-csv/issues/390).

我猜想,解决此问题的最简单方法是手动解析您的行;一点都不理想,但仍然可以正常工作,而且不太难.

I'm going to guess that the easiest way around this is to parse your rows manually; not at all ideal but still functional and not too hard.

您可以使用正则表达式后面的负数来分隔行,例如(?<!\\),-这将匹配任何不带反斜杠的逗号.

You can split your lines using a negative lookbehind regex, e.g. (?<!\\), - this will match any comma not preceded by a backslash.

这篇关于Spark 2.0 Scala-使用转义分隔符读取csv文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆