Spark CSV 阅读器:乱码的日语文本和处理多行 [英] Spark CSV reader : garbled Japanese text and handling multilines

查看：25 发布时间：2021/11/14 23:12:44 scala apache-spark character-encoding apache-spark-sql spark-csv

本文介绍了Spark CSV 阅读器:乱码的日语文本和处理多行的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的 Spark 作业 (spark 2.4.1) 中，我正在 S3 上读取 CSV 文件.这些文件包含日语字符.此外，它们可以包含 ^M 字符 (u000D)，因此我需要将它们解析为多行.

In my Spark job (spark 2.4.1) , I am reading CSV files on S3.These files contain Japanese characters.Also they can have ^M character (u000D) so I need to parse them as multiline.

首先我使用以下代码读取 CSV 文件:

First I used following code to read CSV files:

  implicit class DataFrameReadImplicits (dataFrameReader: DataFrameReader) {
     def readTeradataCSV(schema: StructType, s3Path: String) : DataFrame = {

        dataFrameReader.option("delimiter", "\u0001")
          .option("header", "false")
          .option("inferSchema", "false")
          .option("multiLine","true")
          .option("encoding", "UTF-8")
          .option("charset", "UTF-8")
          .schema(schema)
          .csv(s3Path)
     }
  }

但是当我用这种方法读DF时，所有的日文字符都是乱码.

But when I read DF using this method all the Japanese characters are garbled.

在做了一些测试之后，我发现如果我使用 "spark.sparkContext.textFile(path)" 正确编码的日语字符读取相同的 S3 文件.

After doing some tests I found out that If I read the same S3 file using "spark.sparkContext.textFile(path)" Japanese characters encoded properly.

所以我尝试了这种方式:

So I tried this way :

implicit class SparkSessionImplicits (spark : SparkSession) {
    def readTeradataCSV(schema: StructType, s3Path: String) = {
      import spark.sqlContext.implicits._
      spark.read.option("delimiter", "\u0001")
        .option("header", "false")
        .option("inferSchema", "false")
        .option("multiLine","true")
        .schema(schema)
        .csv(spark.sparkContext.textFile(s3Path).map(str => str.replaceAll("\u000D"," ")).toDS())
    }
  }

现在编码问题已修复.但是，多行无法正常工作，并且在 ^M 字符附近断行，即使我尝试使用 str.replaceAll("\u000D"," ") 替换 ^M

Now the encoding issue is fixed.However multilines doesn't work properly and lines are broken near ^M character , even though I tried to replace ^M using str.replaceAll("\u000D"," ")

有关如何使用第一种方法阅读日语字符的任何提示，或者使用第二种方法处理多行?

Any tips on how to read Japanese characters using first method, or handle multi-lines using the second method ?

更新:当应用在 Spark 集群上运行时会出现这种编码问题.当我在本地运行应用时，读取相同的 S3 文件，编码工作正常.

UPDATE: This encoding issue happens when the app runs on the Spark cluster.When I ran the app locally, reading the same S3 file, encoding works just fine.

推荐答案

有些内容在代码中，但(还)不在文档中.您是否尝试过显式设置您的行分隔符，从而避免由于 ^M 而导致的多行"解决方法?

Some things are in the code but not (yet) in the docs. Did you try setting explicitly your line separator, thus avoiding the "multiline" workaround because of ^M?

来自 SparkTextSuite"分支 2.4 的单元测试
https://github.com/apache/spark/blob/branch-2.4/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/文本套件.scala

From the unit tests for Spark "TextSuite" branch 2.4
https://github.com/apache/spark/blob/branch-2.4/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/text/TextSuite.scala

def testLineSeparator(lineSep: String): Unit = {
  test(s"SPARK-23577: Support line separator - lineSep: '$lineSep'") {
  ...
}
// scalastyle:off nonascii
Seq("|", "^", "::", "!!!@3", 0x1E.toChar.toString, "아").foreach { lineSep =>
  testLineSeparator(lineSep)
}
// scalastyle:on nonascii

来自CSV选项解析的源代码，分支3.0
https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

From the source code for CSV options parsing, branch 3.0
https://github.com/apache/spark/blob/branch-3.0/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVOptions.scala

val lineSeparator: Option[String] = parameters.get("lineSep").map { sep =>
  require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
  require(sep.length == 1, "'lineSep' can contain only 1 character.")
  sep
}
val lineSeparatorInRead: Option[Array[Byte]] = lineSeparator.map { lineSep =>
  lineSep.getBytes(charset)
}

因此，看起来 CSV 不支持用于行分隔符的字符串，仅支持单个字符，因为它依赖于某些 Hadoop 库.我希望你的情况没问题.

So, looks like CSV does not support strings for line delimiters, just single characters, because it relies on some Hadoop library. I hope that's fine in your case.

匹配的 JIRA 是...

The matching JIRAs are...

SPARK-21289 基于文本的格式不支持自定义结尾- 行分隔符 ...
SPARK-23577 特定于文本数据源 > 在 V2.4.0 中修复

SPARK-21289 Text based formats do not support custom end-of-line delimiters ...
SPARK-23577 specific to text datasource > fixed in V2.4.0

这篇关于Spark CSV 阅读器:乱码的日语文本和处理多行的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Spark CSV 阅读器:乱码的日语文本和处理多行 [英] Spark CSV reader : garbled Japanese text and handling multilines

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

Spark CSV 阅读器:乱码的日语文本和处理多行 [英] Spark CSV reader : garbled Japanese text and handling multilines

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭