如何读取多个文本文件合并为一个RDD? [英] How to read multiple text files into a single RDD?
问题描述
我想从一个HDFS位置读取一堆文本文件,并使用Spark在迭代上执行映射。
I want to read a bunch of text files from a hdfs location and perform mapping on it in an iteration using spark.
JavaRDD<串GT;记录= ctx.textFile(参数[1],1);
能够在同一时间只读取一个文件
JavaRDD<String> records = ctx.textFile(args[1], 1);
is capable of reading only one file at a time.
我想读多个文件,并处理它们作为单个RDD。怎么样?
I want to read more than one file and process them as a single RDD. How?
推荐答案
您可以指定整个目录,使用通配符甚至目录和通配符的CSV。例如:
You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
由于尼克Chammas指出,这是Hadoop的的曝光<一href=\"http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/ma$p$pduce/lib/input/FileInputFormat.html\"><$c$c>FileInputFormat$c$c>因此,这也适用于使用Hadoop(烫)。
As Nick Chammas points out this is an exposure of Hadoop's FileInputFormat
and therefore this also works with Hadoop (and Scalding).
这篇关于如何读取多个文本文件合并为一个RDD?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!