火花在获取RDDS文件名 [英] Spark-Obtaining file name in RDDs

查看：254 发布时间：2016/5/22 15:45:42 apache-spark

本文介绍了火花在获取RDDS文件名的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想处理4个目录的文本文件，继续增长每day.What我需要做的是，如果有人试图寻找一个发票号，我应该给他们里面有它的文件列表。

I am trying to process 4 directories of text files that keep growing every day.What I need to do is, if somebody is trying to search for an invoice number, I should give them the list of files which has it.

我能够映射并通过加载它们作为RDD.But我怎么能获取文件名和其他文件属性可以减少在文本文件中的值？

I was able to map and reduce the values in text files by loading them as RDD.But how can I obtain the file name and other file attributes?

推荐答案

如果您的文本文件是 SparkContext.wholeTextFiles 太大，你可以使用一个（简单的）定制的InputFormat ，然后调用 SparkContext.hadoopRDD

If your text files are too large for SparkContext.wholeTextFiles, you would use a (simple) custom InputFormat and then call SparkContext.hadoopRDD

本的InputFormat需要返回一个元组（文件名，行），而不是行，那么你可以筛选使用predicate，着眼于该行的内容，那么独特，并收集文件名。

The InputFormat would need to return a tuple (filename, line) rather than line then you could filter using a predicate that looks at the content of the line, then unique it and collect the filenames.

从星火，在code看起来是这样的：

From Spark, the code would look something like:

val ft = classOf[FileNamerInputFormat]
val kt = classOf[String]
val vt = classOf[String]

val hadoopConfig = new Configuration(sc.hadoopConfiguration)
sc.newAPIHadoopFile(path, ft, kt, vt, hadoopConfig)
  .filter { case (f, l) => isInteresting(l) }
  .map { case (f, _) => f } 
  .distinct()
  .collect()

这篇关于火花在获取RDDS文件名的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

火花在获取RDDS文件名 [英] Spark-Obtaining file name in RDDs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

火花在获取RDDS文件名 [英] Spark-Obtaining file name in RDDs

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭