火花在获取RDDS文件名 [英] Spark-Obtaining file name in RDDs
问题描述
我想处理4个目录的文本文件,继续增长每day.What我需要做的是,如果有人试图寻找一个发票号,我应该给他们里面有它的文件列表。
I am trying to process 4 directories of text files that keep growing every day.What I need to do is, if somebody is trying to search for an invoice number, I should give them the list of files which has it.
我能够映射并通过加载它们作为RDD.But我怎么能获取文件名和其他文件属性可以减少在文本文件中的值?
I was able to map and reduce the values in text files by loading them as RDD.But how can I obtain the file name and other file attributes?
推荐答案
如果您的文本文件是 SparkContext.wholeTextFiles
太大,你可以使用一个(简单的)定制的InputFormat
,然后调用 SparkContext.hadoopRDD
If your text files are too large for SparkContext.wholeTextFiles
, you would use a (simple) custom InputFormat
and then call SparkContext.hadoopRDD
本的InputFormat需要返回一个元组(文件名,行),而不是行,那么你可以筛选使用predicate,着眼于该行的内容,那么独特,并收集文件名。
The InputFormat would need to return a tuple (filename, line) rather than line then you could filter using a predicate that looks at the content of the line, then unique it and collect the filenames.
从星火,在code看起来是这样的:
From Spark, the code would look something like:
val ft = classOf[FileNamerInputFormat]
val kt = classOf[String]
val vt = classOf[String]
val hadoopConfig = new Configuration(sc.hadoopConfiguration)
sc.newAPIHadoopFile(path, ft, kt, vt, hadoopConfig)
.filter { case (f, l) => isInteresting(l) }
.map { case (f, _) => f }
.distinct()
.collect()
这篇关于火花在获取RDDS文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!