使用 .mani/清单文件读取分桶目录 [英] reading bucketed directory with .mani/ manifest files

查看:18
本文介绍了使用 .mani/清单文件读取分桶目录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下目录,需要阅读 spark.read.parquet('car_data') 将年份作为一列而不阅读 .mani(清单文件).我可以使用通配符 'car_data/year=*/*.parquet' 读取数据,但这不会将保留年份作为列添加.

I have a directory as below and need to read spark.read.parquet('car_data') keeping year as a column without reading the .mani (manifest files). I can read the data using wildcard 'car_data/year=*/*.parquet' but this doesn't add keep year as a column.

我遇到的问题是,如果我加载目录,就像使用存储桶数据一样,我会收到一个错误,因为 Spark 尝试将 mani 文件读取为镶木地板,但随后我无法使用通配符跳过它们!还有其他方法吗?

The problem I'm having is if I load the directory, as you would with bucketed data I get an error as Spark tries to read the mani files as parquet but then I can't use the wildcards to skip them! Is there another way of doing this?

我现在也尝试过 spark.read.load('/car_data/', format='parquet', pathGlobFilter='*.parquet') 并且我仍然得到相同的结果错误,环顾四周,它似乎仅在 spark 3.0 中可用,而我在 2.4 上,但必须有另一种方式

I've now also tried spark.read.load('/car_data/', format='parquet', pathGlobFilter='*.parquet') and I still get the same error, having a look around it looks like this is only available in spark 3.0 and I'm on 2.4, but there must be another way

预先感谢窥视!

car_data
 |---year=2018
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet.mani
 |---year=2019
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet.mani
 |---year=2020
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet.mani

推荐答案

您可以创建文件列表,只将需要读取的文件列表传递给 spark.read.parquet()

You can create the list of files and pass only the list of files you need to read to spark.read.parquet()

spark.read.parquet("/path/to/dir/part_1.gz", "/path/to/dir/part_2.gz")

从 Spark 1.6.0 开始,分区发现默认只查找给定路径下的分区.对于下面的示例目录结构,如果用户将 path/to/table/gender=male 传递给 SparkSession.read.parquet 或 SparkSession.read.load,则性别将不会被视为分区列.

Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths by default. For the below example directory structure, if users pass path/to/table/gender=male to either SparkSession.read.parquet or SparkSession.read.load, gender will not be considered as a partitioning column.

如果用户需要指定开始分区发现的基本路径,可以在数据源选项中设置basePath.比如path/to/table/gender=male为数据路径,用户设置basePath为path/to/table/,则gender为分区列.

If users need to specify the base path that partition discovery should start with, they can set basePath in the data source options. For example, when path/to/table/gender=male is the path of the data and users set basePath to path/to/table/, gender will be a partitioning column.

path
└── to
    └── table
        ├── gender=male
        │   ├── ...
        │   │
        │   ├── country=US
        │   │   └── data.parquet
        │   ├── country=CN
        │   │   └── data.parquet
        │   └── ...
        └── gender=female
            ├── ...
            │
            ├── country=US
            │   └── data.parquet
            ├── country=CN
            │   └── data.parquet
            └── ...

请参考 https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#partition-discovery 了解更多信息.从那里采取上述目录结构.

Please refer https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#partition-discovery for more information. Taken the above directory structure from there.

这篇关于使用 .mani/清单文件读取分桶目录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆