如何将多个文本文件读入单个 RDD? [英] How to read multiple text files into a single RDD?

查看：34 发布时间：2021/11/12 5:25:36 apache-spark

本文介绍了如何将多个文本文件读入单个 RDD?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

我想从 hdfs 位置读取一堆文本文件，并使用 spark 在迭代中对其执行映射.

I want to read a bunch of text files from a hdfs location and perform mapping on it in an iteration using spark.

JavaRDDrecord = ctx.textFile(args[1], 1); 一次只能读取一个文件.

JavaRDD<String> records = ctx.textFile(args[1], 1); is capable of reading only one file at a time.

我想读取多个文件并将它们作为单个 RDD 处理.如何?

I want to read more than one file and process them as a single RDD. How?

您可以指定整个目录，使用通配符，甚至目录和通配符的 CSV.例如:

You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:

sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")

正如 Nick Chammas 指出的，这是 Hadoop 的 FileInputFormat 因此这也适用于 Hadoop(和 Scalding).

As Nick Chammas points out this is an exposure of Hadoop's FileInputFormat and therefore this also works with Hadoop (and Scalding).

这篇关于如何将多个文本文件读入单个 RDD?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文