火花在获取RDDS文件名 [英] Spark-Obtaining file name in RDDs

查看:254
本文介绍了火花在获取RDDS文件名的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想处理4个目录的文本文件,继续增长每day.What我需要做的是,如果有人试图寻找一个发票号,我应该给他们里面有它的文件列表。

I am trying to process 4 directories of text files that keep growing every day.What I need to do is, if somebody is trying to search for an invoice number, I should give them the list of files which has it.

我能够映射并通过加载它们作为RDD.But我怎么能获取文件名和其他文件属性可以减少在文本文件中的值?

I was able to map and reduce the values in text files by loading them as RDD.But how can I obtain the file name and other file attributes?

推荐答案

如果您的文本文件是 SparkContext.wholeTextFiles 太大,你可以使用一个(简单的)定制的InputFormat ,然后调用 SparkContext.hadoopRDD

If your text files are too large for SparkContext.wholeTextFiles, you would use a (simple) custom InputFormat and then call SparkContext.hadoopRDD

本的InputFormat需要返回一个元组(文件名,行),而不是行,那么你可以筛选使用predicate,着眼于该行的内容,那么独特,并收集文件名。

The InputFormat would need to return a tuple (filename, line) rather than line then you could filter using a predicate that looks at the content of the line, then unique it and collect the filenames.

从星火,在code看起来是这样的:

From Spark, the code would look something like:

val ft = classOf[FileNamerInputFormat]
val kt = classOf[String]
val vt = classOf[String]

val hadoopConfig = new Configuration(sc.hadoopConfiguration)
sc.newAPIHadoopFile(path, ft, kt, vt, hadoopConfig)
  .filter { case (f, l) => isInteresting(l) }
  .map { case (f, _) => f } 
  .distinct()
  .collect()

这篇关于火花在获取RDDS文件名的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆