Spark:使用与换行符不同的分隔符读取文件 [英] Spark: Reading files using different delimiter than new line
问题描述
我使用的是 Apache Spark 1.0.1.我有很多文件用 UTF8 \u0001
分隔,而不是用通常的新行 \n
分隔.如何在 Spark 中读取此类文件?意思是,sc.textfile("hdfs:///myproject/*")
的默认分隔符是 \n
,我想把它改成 \u0001
.
I'm using Apache Spark 1.0.1. I have many files delimited with UTF8 \u0001
and not with the usual new line \n
. How can I read such files in Spark? Meaning, the default delimiter of sc.textfile("hdfs:///myproject/*")
is \n
, and I want to change it to \u0001
.
推荐答案
在 Spark shell 中,我根据 在spark中设置textinputformat.record.delimiter:
In Spark shell, I extracted data according to Setting textinputformat.record.delimiter in spark:
$ spark-shell
...
scala> import org.apache.hadoop.io.LongWritable
import org.apache.hadoop.io.LongWritable
scala> import org.apache.hadoop.io.Text
import org.apache.hadoop.io.Text
scala> import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.conf.Configuration
scala> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
scala> val conf = new Configuration
conf: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml
scala> conf.set("textinputformat.record.delimiter", "\u0001")
scala> val data = sc.newAPIHadoopFile("mydata.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf).map(_._2.toString)
data: org.apache.spark.rdd.RDD[(org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text)] = NewHadoopRDD[0] at newAPIHadoopFile at <console>:19
sc.newAPIHadoopFile("mydata.txt", ...)
是一个 RDD[(LongWritable, Text)]
,其中元素的第一部分是起始字符索引,第二部分是由"\u0001"
分隔的实际文本.
sc.newAPIHadoopFile("mydata.txt", ...)
is a RDD[(LongWritable, Text)]
, where the first part of the elements is the starting character index, and the second part is the actual text delimited by "\u0001"
.
这篇关于Spark:使用与换行符不同的分隔符读取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!