pyspark在一次加载中加载多个分区文件 [英] pyspark Loading multiple partitioned files in a single load

查看：258 发布时间：2020/8/5 20:33:52 apache-spark pyspark apache-spark-sql orc partitioned-view

本文介绍了pyspark在一次加载中加载多个分区文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试一次加载多个文件.它们都是分区文件当我尝试使用1个文件时，它可以工作，但是当我列出24个文件时，它给了我这个错误，除了在加载后进行并集之外，我找不到关于限制的任何文档和解决方法. 还有其他选择吗?

I am trying to load multiple files in a single load. They are all partitioned files When I tried it with 1 file it works, but when I listed down 24 files, it gives me this error and I could not find any documentation of the limitation and a workaround aside from doing the union after the load. Is there any alternatives?

下面的代码重新创建问题:

CODE Below to re-create the problem:

basepath = '/file/' 
paths = ['/file/df201601.orc', '/file/df201602.orc', '/file/df201603.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc',  
         '/file/df201604.orc', '/file/df201605.orc', '/file/df201606.orc', ]   

df = sqlContext.read.format('orc') \
               options(header='true',inferschema='true',basePath=basePath)\
               .load(*paths)

收到错误:

 TypeError                                 Traceback (most recent call last)
 <ipython-input-43-7fb8fade5e19> in <module>()

---> 37 df = sqlContext.read.format('orc')                .options(header='true', inferschema='true',basePath=basePath)                .load(*paths)
     38 

TypeError: load() takes at most 4 arguments (24 given)

推荐答案

如

As explained in the official documentation, to read multiple files, you should pass a list:

路径 –文件系统支持的数据源的可选字符串或字符串列表.

path – optional string or a list of string for file-system backed data sources.

所以在您的情况下:

(sqlContext.read
    .format('orc') 
    .options(basePath=basePath)
    .load(path=paths))

仅当使用可变参量定义load时，参数解压缩(*)才有意义，例如:

Argument unpacking (*) would makes sense only if load was defined with variadic arguments, form example:

def load(this, *paths):
    ...

这篇关于pyspark在一次加载中加载多个分区文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pyspark在一次加载中加载多个分区文件 [英] pyspark Loading multiple partitioned files in a single load

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

pyspark在一次加载中加载多个分区文件 [英] pyspark Loading multiple partitioned files in a single load

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭