使用 .mani/清单文件读取分桶目录 [英] reading bucketed directory with .mani/ manifest files
问题描述
我有一个如下目录,需要阅读 spark.read.parquet('car_data')
将年份作为一列而不阅读 .mani(清单文件).我可以使用通配符 'car_data/year=*/*.parquet'
读取数据,但这不会将保留年份作为列添加.
I have a directory as below and need to read spark.read.parquet('car_data')
keeping year as a column without reading the .mani (manifest files). I can read the data using wildcard 'car_data/year=*/*.parquet'
but this doesn't add keep year as a column.
我遇到的问题是,如果我加载目录,就像使用存储桶数据一样,我会收到一个错误,因为 Spark 尝试将 mani 文件读取为镶木地板,但随后我无法使用通配符跳过它们!还有其他方法吗?
The problem I'm having is if I load the directory, as you would with bucketed data I get an error as Spark tries to read the mani files as parquet but then I can't use the wildcards to skip them! Is there another way of doing this?
我现在也尝试过 spark.read.load('/car_data/', format='parquet', pathGlobFilter='*.parquet')
并且我仍然得到相同的结果错误,环顾四周,它似乎仅在 spark 3.0 中可用,而我在 2.4 上,但必须有另一种方式
I've now also tried spark.read.load('/car_data/', format='parquet', pathGlobFilter='*.parquet')
and I still get the same error, having a look around it looks like this is only available in spark 3.0 and I'm on 2.4, but there must be another way
预先感谢窥视!
car_data
|---year=2018
|---xxx.snappy.parquet
|---xxx.snappy.parquet
|---xxx.snappy.parquet.mani
|---year=2019
|---xxx.snappy.parquet
|---xxx.snappy.parquet
|---xxx.snappy.parquet.mani
|---year=2020
|---xxx.snappy.parquet
|---xxx.snappy.parquet.mani
推荐答案
您可以创建文件列表,只将需要读取的文件列表传递给 spark.read.parquet()
You can create the list of files and pass only the list of files you need to read to spark.read.parquet()
spark.read.parquet("/path/to/dir/part_1.gz", "/path/to/dir/part_2.gz")
从 Spark 1.6.0 开始,分区发现默认只查找给定路径下的分区.对于下面的示例目录结构,如果用户将 path/to/table/gender=male 传递给 SparkSession.read.parquet 或 SparkSession.read.load,则性别将不会被视为分区列.
Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths by default. For the below example directory structure, if users pass path/to/table/gender=male to either SparkSession.read.parquet or SparkSession.read.load, gender will not be considered as a partitioning column.
如果用户需要指定开始分区发现的基本路径,可以在数据源选项中设置basePath.比如path/to/table/gender=male为数据路径,用户设置basePath为path/to/table/,则gender为分区列.
If users need to specify the base path that partition discovery should start with, they can set basePath in the data source options. For example, when path/to/table/gender=male is the path of the data and users set basePath to path/to/table/, gender will be a partitioning column.
path
└── to
└── table
├── gender=male
│ ├── ...
│ │
│ ├── country=US
│ │ └── data.parquet
│ ├── country=CN
│ │ └── data.parquet
│ └── ...
└── gender=female
├── ...
│
├── country=US
│ └── data.parquet
├── country=CN
│ └── data.parquet
└── ...
请参考 https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#partition-discovery 了解更多信息.从那里采取上述目录结构.
Please refer https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#partition-discovery for more information. Taken the above directory structure from there.
这篇关于使用 .mani/清单文件读取分桶目录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!