使用 .mani/清单文件读取分桶目录 [英] reading bucketed directory with .mani/ manifest files

查看：18 发布时间：2021/11/14 23:21:46 apache-spark pyspark apache-spark-sql

本文介绍了使用 .mani/清单文件读取分桶目录的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个如下目录，需要阅读 spark.read.parquet('car_data') 将年份作为一列而不阅读 .mani(清单文件).我可以使用通配符 'car_data/year=*/*.parquet' 读取数据，但这不会将保留年份作为列添加.

I have a directory as below and need to read spark.read.parquet('car_data') keeping year as a column without reading the .mani (manifest files). I can read the data using wildcard 'car_data/year=*/*.parquet' but this doesn't add keep year as a column.

我遇到的问题是，如果我加载目录，就像使用存储桶数据一样，我会收到一个错误，因为 Spark 尝试将 mani 文件读取为镶木地板，但随后我无法使用通配符跳过它们！还有其他方法吗?

The problem I'm having is if I load the directory, as you would with bucketed data I get an error as Spark tries to read the mani files as parquet but then I can't use the wildcards to skip them! Is there another way of doing this?

我现在也尝试过 spark.read.load('/car_data/', format='parquet', pathGlobFilter='*.parquet') 并且我仍然得到相同的结果错误，环顾四周，它似乎仅在 spark 3.0 中可用，而我在 2.4 上，但必须有另一种方式

I've now also tried spark.read.load('/car_data/', format='parquet', pathGlobFilter='*.parquet') and I still get the same error, having a look around it looks like this is only available in spark 3.0 and I'm on 2.4, but there must be another way

预先感谢窥视！

car_data
 |---year=2018
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet.mani
 |---year=2019
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet.mani
 |---year=2020
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet.mani

推荐答案

您可以创建文件列表，只将需要读取的文件列表传递给 spark.read.parquet()

You can create the list of files and pass only the list of files you need to read to spark.read.parquet()

spark.read.parquet("/path/to/dir/part_1.gz", "/path/to/dir/part_2.gz")

从 Spark 1.6.0 开始，分区发现默认只查找给定路径下的分区.对于下面的示例目录结构，如果用户将 path/to/table/gender=male 传递给 SparkSession.read.parquet 或 SparkSession.read.load，则性别将不会被视为分区列.

Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths by default. For the below example directory structure, if users pass path/to/table/gender=male to either SparkSession.read.parquet or SparkSession.read.load, gender will not be considered as a partitioning column.

如果用户需要指定开始分区发现的基本路径，可以在数据源选项中设置basePath.比如path/to/table/gender=male为数据路径，用户设置basePath为path/to/table/，则gender为分区列.

If users need to specify the base path that partition discovery should start with, they can set basePath in the data source options. For example, when path/to/table/gender=male is the path of the data and users set basePath to path/to/table/, gender will be a partitioning column.

path
└── to
    └── table
        ├── gender=male
        │   ├── ...
        │   │
        │   ├── country=US
        │   │   └── data.parquet
        │   ├── country=CN
        │   │   └── data.parquet
        │   └── ...
        └── gender=female
            ├── ...
            │
            ├── country=US
            │   └── data.parquet
            ├── country=CN
            │   └── data.parquet
            └── ...

请参考 https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#partition-discovery 了解更多信息.从那里采取上述目录结构.

Please refer https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#partition-discovery for more information. Taken the above directory structure from there.

这篇关于使用 .mani/清单文件读取分桶目录的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用 .mani/清单文件读取分桶目录 [英] reading bucketed directory with .mani/ manifest files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

使用 .mani/清单文件读取分桶目录 [英] reading bucketed directory with .mani/ manifest files

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭