星火上下文文本文件：加载多个文件 [英] Spark Context Textfile: load multiple files

查看：129 发布时间：2016/5/22 15:28:32 python apache-spark

本文介绍了星火上下文文本文件：加载多个文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要处理分散在不同的目录多个文件。我想加载所有这些了在一个单一的RDD，然后执行的map / reduce就可以了。我看到SparkContext能够从使用通配符的单一目录加载多个文件。我不知道如何从多个文件夹中加载的文件。

I need to process multiple files scattered across various directories. I would like to load all these up in a single RDD and then perform map/reduce on it. I see that SparkContext is able to load multiple files from a single directory using wildcards. I am not sure how to load up files from multiple folders.

以下code段失败：

for fileEntry in files:
    fileName = basePath + "/" + fileEntry
    lines = sc.textFile(fileName)
    if retval == None:
        retval = lines
    else:
        retval = sc.union(retval, lines)

这失败与以下错误消息的第三个循环：

This fails on the third loop with the following error message:

retval = sc.union(retval, lines)
TypeError: union() takes exactly 2 arguments (3 given)

这是奇怪给我提供的只是2个参数。任何指针AP preciated。

Which is bizarre given I am providing only 2 arguments. Any pointers appreciated.

推荐答案

这样如何措辞呢？

sc.union([sc.textFile(basepath + "\n" + f) for f in files])

在Scala中 SparkContext.union（）有两种变体，一种采用可变参数的参数，一个是获得一个列表。只有第二个在Python存在（因为Python没有多态性）。

In Scala SparkContext.union() has two variants, one that takes vararg arguments, and one that takes a list. Only the second one exists in Python (since Python does not have polymorphism).

的更新的

您可以使用一个文本文件调用读取多个文件。

You can use a single textFile call to read multiple files.

sc.textFile(','.join(files))

这篇关于星火上下文文本文件：加载多个文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

星火上下文文本文件：加载多个文件 [英] Spark Context Textfile: load multiple files

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

星火上下文文本文件：加载多个文件 [英] Spark Context Textfile: load multiple files

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭