从 Pyspark 中的多个目录读取镶木地板文件 [英] Reading parquet files from multiple directories in Pyspark

查看：76 发布时间：2021/6/14 19:23:10 pyspark parquet

本文介绍了从 Pyspark 中的多个目录读取镶木地板文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要从不是父目录或子目录的多个路径读取镶木地板文件.

I need to read parquet files from multiple paths that are not parent or child directories.

例如，

dir1 ---
       |
       ------- dir1_1
       |
       ------- dir1_2
dir2 ---
       |
       ------- dir2_1
       |
       ------- dir2_2

sqlContext.read.parquet(dir1) 从 dir1_1 和 dir1_2 读取 parquet 文件

sqlContext.read.parquet(dir1) reads parquet files from dir1_1 and dir1_2

现在我正在读取每个目录并使用unionAll"合并数据帧.有没有办法在不使用 unionAll 的情况下从 dir1_2 和 dir2_1 读取镶木地板文件，或者有没有使用 unionAll

Right now I'm reading each dir and merging dataframes using "unionAll". Is there a way to read parquet files from dir1_2 and dir2_1 without using unionAll or is there any fancy way using unionAll

谢谢

推荐答案

有点晚，但我在搜索时发现了这个，它可能会帮助其他人...

A little late but I found this while I was searching and it may help someone else...

您也可以尝试将参数列表解包到 spark.read.parquet()

You might also try unpacking the argument list to spark.read.parquet()

paths=['foo','bar']
df=spark.read.parquet(*paths)

如果您想将一些 blob 传递到路径参数中，这很方便:

This is convenient if you want to pass a few blobs into the path argument:

basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
       's3://bucket/partition_value1=*/partition_value2=2017-05-*'
      ]
df=spark.read.option("basePath",basePath).parquet(*paths)

这很酷，因为您不需要列出 basePath 中的所有文件，而且您仍然可以获得分区推断.

This is cool cause you don't need to list all the files in the basePath, and you still get partition inference.

这篇关于从 Pyspark 中的多个目录读取镶木地板文件的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从 Pyspark 中的多个目录读取镶木地板文件 [英] Reading parquet files from multiple directories in Pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

从 Pyspark 中的多个目录读取镶木地板文件 [英] Reading parquet files from multiple directories in Pyspark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭