在 Spark 中读取分区镶木地板 [英] Reading partitioned parquet in Spark

查看：78 发布时间：2021/6/14 19:24:29 apache-spark pyspark apache-spark-sql parquet

本文介绍了在 Spark 中读取分区镶木地板的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个如下所示的暂存数据目录，我希望能够将 2018 年和 2019 年的数据读入一个数据帧，而无需分别读取和合并.

I have a directory of staged data as shown below and I want to be able to read 2018 and 2019 data into one dataframe without reading them separately and unioning.

据我所知，我应该能够将 car_data 目录提供给 spark 并应用一个过滤器，哪个 spark 会向下推?当我尝试执行此操作时，它说无法推断架构，因此必须手动定义它.

From my understand I should be able to give spark the car_data directory and apply a filter which spark would push down? when I try and do this it says the schema can't be inferred so it has to be manually defined.

注意:我需要在不将年份文件夹的名称更改为 year=2018 的情况下执行此操作

Note: I need to do this without changing the name of the year folders to year=2018

如何为以下数据指定架构?我试过研究这个，但我找不到
如何将数据加载为 spark.parquet('car_data').filter('year > 2019') 以便向下推过滤器并仅加载 2019-20 的数据?
有人知道 .mani 文件的用途吗?

提前致谢！

car_data
 |---2018
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet.mani
 |---2019
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet.mani
 |---2020
    |---xxx.snappy.parquet
    |---xxx.snappy.parquet.mani

推荐答案

在对 Hive 表进行了一些试验后，我意识到有一个解决方案:你可以改变表分区的位置

After experimenting a bit with Hive table, I realized there is a solution for you: you can alter the table partition's location

所以你要做的第一件事是创建一个包含所有模式的表，包括所有可能的分区，并在每个分区中添加一些虚拟数据以触发分区创建

So the first thing you want to do is creating a table with all schema into it, including all possible partitions, and adding some dummy data into each partition to trigger partition creation

create table test (a string) partitioned by (date_part String);
insert into test values ('A', '2020-01-01');
-- note the current location of this partition is
-- `/apps/hive/warehouse/default.db/test/date_part=2020-01-01`

您现在可以通过以下方式更改分区的位置

You now can alter partitions's location by

alter table test partition (date_part="2020-01-01")
set location "/apps/hive/warehouse/default.db/test/2020-01-01";

一切就绪！

select * from test where date_part="2020-01-01"
-- +----+-------------+
-- | a  |  date_part  |
-- +----+-------------+
-- | A  | 2020-01-01  |
-- +----+-------------+

这篇关于在 Spark 中读取分区镶木地板的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在 Spark 中读取分区镶木地板 [英] Reading partitioned parquet in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在 Spark 中读取分区镶木地板 [英] Reading partitioned parquet in Spark

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭