在 spark 中设置 textinputformat.record.delimiter [英] Setting textinputformat.record.delimiter in spark

查看:42
本文介绍了在 spark 中设置 textinputformat.record.delimiter的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Spark 中,可以设置一些 hadoop 配置设置,例如

In Spark, it is possible to set some hadoop configuration settings like, e.g.

System.setProperty("spark.hadoop.dfs.replication", "1")

这有效,复制因子设置为 1.假设是这种情况,我认为这种模式(在常规的 hadoop 配置属性前加上spark.hadoop.")也适用于textinputformat.record.delimiter:

This works, the replication factor is set to 1. Assuming that this is the case, I thought that this pattern (prepending "spark.hadoop." to a regular hadoop configuration property), would also work for the textinputformat.record.delimiter:

System.setProperty("spark.hadoop.textinputformat.record.delimiter", "\n\n")

然而,spark 似乎只是忽略了这个设置.我是否以正确的方式设置 textinputformat.record.delimiter ?有没有更简单的方法来设置 textinputformat.record.delimiter.我想避免编写自己的 InputFormat,因为我真的只需要获取由两个换行符分隔的记录.

However, it seems that spark just ignores this setting. Do I set the textinputformat.record.delimiter in the correct way? Is there a simpler way of setting the textinputformat.record.delimiter. I would like to avoid writing my own InputFormat, since I really only need to obtain records delimited by two newlines.

推荐答案

我使用以下函数处理普通未压缩文件.

I got this working with plain uncompressed files with the below function.

import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.Text
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

def nlFile(path: String) = {
    val conf = new Configuration
    conf.set("textinputformat.record.delimiter", "\n")
    sc.newAPIHadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
      .map(_._2.toString)
}

这篇关于在 spark 中设置 textinputformat.record.delimiter的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆