星火：使用不同的分隔符不是新行读取文件 [英] Spark: Reading files using different delimiter than new line

查看：288 发布时间：2016/5/22 15:19:20 apache-spark

本文介绍了星火：使用不同的分隔符不是新行读取文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用Apache 1.0.1星火。我有UTF8 \\ U0001 而不是通常的新行 \\ n 分隔的多个文件。我怎样才能读取星火这样的文件？意思是，的默认分隔符sc.textfile（HDFS：///的myproject / *）是 \\ n ，我想将其更改为 \\ U0001 。

I'm using Apache Spark 1.0.1. I have many files delimited with UTF8 \u0001 and not with the usual new line \n. How can I read such files in Spark? Meaning, the default delimiter of sc.textfile("hdfs:///myproject/*") is \n, and I want to change it to \u0001.

推荐答案

在星火壳，我根据提取的数据<一个href=\"http://stackoverflow.com/questions/17692857/setting-textinputformat-record-delimiter-in-spark\">Setting textinputformat.record.delimiter火花：

In Spark shell, I extracted data according to Setting textinputformat.record.delimiter in spark:

$ spark-shell
...
scala> import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.LongWritable

scala> import org.apache.hadoop.io.Text
import org.apache.hadoop.io.Text

scala> import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.conf.Configuration

scala> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat

scala> val conf = new Configuration
conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml

scala> conf.set("textinputformat.record.delimiter", "\u0001")

scala> val data = sc.newAPIHadoopFile("mydata.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf).map(_._2.toString)
data: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile at <console>:19

sc.newAPIHadoopFile（mydata.txt，...）是一个RDD [（LongWritable，文本）]，其中所述元件的第一部分是起始系统字符索引，并且第二部分是实际的文本分隔由\\ U0001。

sc.newAPIHadoopFile("mydata.txt", ...) is a RDD[(LongWritable, Text)], where the first part of the elements is the starting charater index, and the second part is the actual text delimited by "\u0001".

这篇关于星火：使用不同的分隔符不是新行读取文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

星火：使用不同的分隔符不是新行读取文件 [英] Spark: Reading files using different delimiter than new line

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

星火：使用不同的分隔符不是新行读取文件 [英] Spark: Reading files using different delimiter than new line

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭