火花读取非UTF-8编码的WholeTextFiles [英] spark read wholeTextFiles with non UTF-8 encoding

查看:108
本文介绍了火花读取非UTF-8编码的WholeTextFiles的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想通过非UTF-8编码读取整个文本文件

I want to read whole text files in non UTF-8 encoding via

val df = spark.sparkContext.wholeTextFiles(path, 12).toDF

变成火花.如何更改编码? 我想阅读ISO-8859编码的文本,但它不是CSV,它类似于xml:SGML.

into spark. How can I change the encoding? I would want to read ISO-8859 encoded text, but it is not CSV, it is something similar to xml:SGML.

也许应该使用自定义的Hadoop文件输入格式?

maybe a custom Hadoop file input format should be used?

  • https://dzone.com/articles/implementing-hadoops-input-format-and-output-forma
  • http://henning.kropponline.de/2016/10/23/custom-matlab-inputformat-for-apache-spark/

推荐答案

简单.

这是源代码,

import java.nio.charset.Charset

import org.apache.hadoop.io.{Text, LongWritable}
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD

object TextFile {
  val DEFAULT_CHARSET = Charset.forName("UTF-8")

  def withCharset(context: SparkContext, location: String, charset: String): RDD[String] = {
    if (Charset.forName(charset) == DEFAULT_CHARSET) {
      context.textFile(location)
    } else {
      // can't pass a Charset object here cause its not serializable
      // TODO: maybe use mapPartitions instead?
      context.hadoopFile[LongWritable, Text, TextInputFormat](location).map(
        pair => new String(pair._2.getBytes, 0, pair._2.getLength, charset)
      )
    }
  }
}

从这里复制它.

要使用它.

如果您需要全文文件,

这是实施的实际来源.

def wholeTextFiles(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[(String, String)] = withScope {
    assertNotStopped()
    val job = NewHadoopJob.getInstance(hadoopConfiguration)
    // Use setInputPaths so that wholeTextFiles aligns with hadoopFile/textFile in taking
    // comma separated files as input. (see SPARK-7155)
    NewFileInputFormat.setInputPaths(job, path)
    val updateConf = job.getConfiguration
    new WholeTextFileRDD(
      this,
      classOf[WholeTextFileInputFormat],
      classOf[Text],
      classOf[Text],
      updateConf,
      minPartitions).map(record => (record._1.toString, record._2.toString)).setName(path)
  }

尝试更改:

.map(record => (record._1.toString, record._2.toString))

到(可能):

.map(record => (record._1.toString, new String(record._2.getBytes, 0, record._2.getLength, "myCustomCharset")))

这篇关于火花读取非UTF-8编码的WholeTextFiles的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆