在 Spark 中读取分区镶木地板 [英] Reading partitioned parquet in Spark
问题描述
我有一个如下所示的暂存数据目录,我希望能够将 2018 年和 2019 年的数据读入一个数据帧,而无需分别读取和合并.
I have a directory of staged data as shown below and I want to be able to read 2018 and 2019 data into one dataframe without reading them separately and unioning.
据我所知,我应该能够将 car_data 目录提供给 spark 并应用一个过滤器,哪个 spark 会向下推?当我尝试执行此操作时,它说无法推断架构,因此必须手动定义它.
From my understand I should be able to give spark the car_data directory and apply a filter which spark would push down? when I try and do this it says the schema can't be inferred so it has to be manually defined.
注意:我需要在不将年份文件夹的名称更改为 year=2018 的情况下执行此操作
Note: I need to do this without changing the name of the year folders to year=2018
- 如何为以下数据指定架构?我试过研究这个,但我找不到
- 如何将数据加载为 spark.parquet('car_data').filter('year > 2019') 以便向下推过滤器并仅加载 2019-20 的数据?
- 有人知道 .mani 文件的用途吗?
提前致谢!
car_data
|---2018
|---xxx.snappy.parquet
|---xxx.snappy.parquet
|---xxx.snappy.parquet.mani
|---2019
|---xxx.snappy.parquet
|---xxx.snappy.parquet
|---xxx.snappy.parquet.mani
|---2020
|---xxx.snappy.parquet
|---xxx.snappy.parquet.mani
推荐答案
在对 Hive 表进行了一些试验后,我意识到有一个解决方案:你可以 改变表分区的位置
After experimenting a bit with Hive table, I realized there is a solution for you: you can alter the table partition's location
所以你要做的第一件事是创建一个包含所有模式的表,包括所有可能的分区,并在每个分区中添加一些虚拟数据以触发分区创建
So the first thing you want to do is creating a table with all schema into it, including all possible partitions, and adding some dummy data into each partition to trigger partition creation
create table test (a string) partitioned by (date_part String);
insert into test values ('A', '2020-01-01');
-- note the current location of this partition is
-- `/apps/hive/warehouse/default.db/test/date_part=2020-01-01`
您现在可以通过以下方式更改分区的位置
You now can alter partitions's location by
alter table test partition (date_part="2020-01-01")
set location "/apps/hive/warehouse/default.db/test/2020-01-01";
一切就绪!
select * from test where date_part="2020-01-01"
-- +----+-------------+
-- | a | date_part |
-- +----+-------------+
-- | A | 2020-01-01 |
-- +----+-------------+
这篇关于在 Spark 中读取分区镶木地板的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!