星火上下文文本文件:加载多个文件 [英] Spark Context Textfile: load multiple files
问题描述
我需要处理分散在不同的目录多个文件。我想加载所有这些了在一个单一的RDD,然后执行的map / reduce就可以了。我看到SparkContext能够从使用通配符的单一目录加载多个文件。我不知道如何从多个文件夹中加载的文件。
I need to process multiple files scattered across various directories. I would like to load all these up in a single RDD and then perform map/reduce on it. I see that SparkContext is able to load multiple files from a single directory using wildcards. I am not sure how to load up files from multiple folders.
以下code段失败:
for fileEntry in files:
fileName = basePath + "/" + fileEntry
lines = sc.textFile(fileName)
if retval == None:
retval = lines
else:
retval = sc.union(retval, lines)
这失败与以下错误消息的第三个循环:
This fails on the third loop with the following error message:
retval = sc.union(retval, lines)
TypeError: union() takes exactly 2 arguments (3 given)
这是奇怪给我提供的只是2个参数。任何指针AP preciated。
Which is bizarre given I am providing only 2 arguments. Any pointers appreciated.
推荐答案
这样如何措辞呢?
sc.union([sc.textFile(basepath + "\n" + f) for f in files])
在Scala中 SparkContext.union()
有两种变体,一种采用可变参数的参数,一个是获得一个列表。只有第二个在Python存在(因为Python没有多态性)。
In Scala SparkContext.union()
has two variants, one that takes vararg arguments, and one that takes a list. Only the second one exists in Python (since Python does not have polymorphism).
的更新的
您可以使用一个文本文件
调用读取多个文件。
You can use a single textFile
call to read multiple files.
sc.textFile(','.join(files))
这篇关于星火上下文文本文件:加载多个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!