在 Spark 中读取分区镶木地板 [英] Reading partitioned parquet in Spark

查看:34
本文介绍了在 Spark 中读取分区镶木地板的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个如下所示的暂存数据目录,我希望能够将 2018 年和 2019 年的数据读入一个数据帧,而无需单独读取和合并.

I have a directory of staged data as shown below and I want to be able to read 2018 and 2019 data into one dataframe without reading them separately and unioning.

据我所知,我应该能够将 car_data 目录提供给 spark 并应用一个过滤器,哪个 spark 会向下推?当我尝试这样做时,它说无法推断架构,因此必须手动定义它.

From my understand I should be able to give spark the car_data directory and apply a filter which spark would push down? when I try and do this it says the schema can't be inferred so it has to be manually defined.

注意:我需要在不将年份文件夹的名称更改为 year=2018 的情况下执行此操作

Note: I need to do this without changing the name of the year folders to year=2018

  1. 如何为以下数据指定架构?我试过研究这个,但我找不到
  2. 如何将数据加载为 spark.parquet('car_data').filter('year > 2019') 以便向下推过滤器并仅加载 2019-20 的数据?
  3. 有人知道 .mani 文件的用途吗?

提前致谢!

car_data
 |---2018
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet.mani
 |---2019
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet.mani
 |---2020
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet.mani
                

推荐答案

在对 Hive 表进行了一些试验后,我意识到有一个解决方案:你可以 改变表分区的位置

After experimenting a bit with Hive table, I realized there is a solution for you: you can alter the table partition's location

所以你要做的第一件事是创建一个包含所有模式的表,包括所有可能的分区,并在每个分区中添加一些虚拟数据以触发分区创建

So the first thing you want to do is creating a table with all schema into it, including all possible partitions, and adding some dummy data into each partition to trigger partition creation

create table test (a string) partitioned by (date_part String);
insert into test values ('A', '2020-01-01');
-- note the current location of this partition is
-- `/apps/hive/warehouse/default.db/test/date_part=2020-01-01`

您现在可以通过以下方式更改分区的位置

You now can alter partitions's location by

alter table test partition (date_part="2020-01-01")
set location "/apps/hive/warehouse/default.db/test/2020-01-01";

一切就绪!

select * from test where date_part="2020-01-01"
-- +----+-------------+
-- | a  |  date_part  |
-- +----+-------------+
-- | A  | 2020-01-01  |
-- +----+-------------+

这篇关于在 Spark 中读取分区镶木地板的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆