在保留分区的同时缓存数据帧 [英] Caching dataframes while keeping partitions

查看:79
本文介绍了在保留分区的同时缓存数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用的是运行在EMR上的Spark 2.2.0.

I'm on Spark 2.2.0, running on EMR.

我有一个大数据框df(在压缩的快照文件中约为40G),并由键k1k2进行了分区.

I have a big dataframe df (40G or so in compressed snappy files) which is partitioned by keys k1 and k2.

当我通过k1 === v1或(k1 === v1&& k2 === v2`)查询时,我看到它仅查询分区中的文件(大约占文件的2%).

When I query by k1 === v1 or (k1 === v1 && k2 ===v2`), I can see that it's only querying the files in the partition (about 2% of the files).

但是,如果我缓存持久性 df,突然这些查询击中了所有分区,或者耗尽了内存,或者占用了大量内存表现欠佳.

However if I cache or persist df, suddenly those queries are hitting all the partitions and either blows up memory or is much less performant.

这是一个很大的惊喜-有什么方法可以进行缓存,从而保留参与信息

This is a big surprise - is there any way to do caching which preserves the partitoning information

推荐答案

这是预料之中的.用于缓存的Spark内部列式格式与输入格式无关.加载数据后,与原始输入的连接就消失了.

This is to be expected. Spark internal columnar format used for caching is input format agnostic. Once you loaded data there there is no connection to the original input is gone.

这里是新数据源API [SPARK-22389] [SQL]数据源v2分区报告界面,该界面可以保留分区信息,但它是2.3版的新功能,仍处于试验阶段.

The exception here is new data source API [SPARK-22389][SQL] data source v2 partitioning reporting interface, which allows for persisting partitioning information, but it is new in 2.3 and still experimental.

这篇关于在保留分区的同时缓存数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆