pyspark在一次加载中加载多个分区文件 [英] pyspark Loading multiple partitioned files in a single load
问题描述
我正在尝试一次加载多个文件.它们都是分区文件 当我尝试使用1个文件时,它可以工作,但是当我列出24个文件时,它给了我这个错误,除了在加载后进行并集之外,我找不到关于限制的任何文档和解决方法. 还有其他选择吗?
I am trying to load multiple files in a single load. They are all partitioned files When I tried it with 1 file it works, but when I listed down 24 files, it gives me this error and I could not find any documentation of the limitation and a workaround aside from doing the union after the load. Is there any alternatives?
下面的代码重新创建问题:
CODE Below to re-create the problem:
basepath = '/file/'
paths = ['/file/df201601.orc', '/file/df201602.orc', '/file/df201603.orc',
'/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',
'/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',
'/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',
'/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',
'/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',
'/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',
'/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc', ]
df = sqlContext.read.format('orc') \
options(header='true',inferschema='true',basePath=basePath)\
.load(*paths)
收到错误:
TypeError Traceback (most recent call last)
<ipython-input-43-7fb8fade5e19> in <module>()
---> 37 df = sqlContext.read.format('orc') .options(header='true', inferschema='true',basePath=basePath) .load(*paths)
38
TypeError: load() takes at most 4 arguments (24 given)
推荐答案
As explained in the official documentation, to read multiple files, you should pass a list
:
路径 –文件系统支持的数据源的可选字符串或字符串列表.
path – optional string or a list of string for file-system backed data sources.
所以在您的情况下:
(sqlContext.read
.format('orc')
.options(basePath=basePath)
.load(path=paths))
仅当使用可变参量定义load
时,参数解压缩(*
)才有意义,例如:
Argument unpacking (*
) would makes sense only if load
was defined with variadic arguments, form example:
def load(this, *paths):
...
这篇关于pyspark在一次加载中加载多个分区文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!