在保留分区的同时缓存数据帧 [英] Caching dataframes while keeping partitions
问题描述
我使用的是运行在EMR上的Spark 2.2.0.
I'm on Spark 2.2.0, running on EMR.
我有一个大数据框df
(在压缩的快照文件中约为40G),并由键k1
和k2
进行了分区.
I have a big dataframe df
(40G or so in compressed snappy files) which is partitioned by keys k1
and k2
.
当我通过k1
=== v1
或(k1
=== v1
&& k2 ===
v2`)查询时,我看到它仅查询分区中的文件(大约占文件的2%).
When I query by k1
=== v1
or (k1
=== v1
&& k2 ===
v2`), I can see that it's only querying the files in the partition (about 2% of the files).
但是,如果我缓存或持久性 df
,突然这些查询击中了所有分区,或者耗尽了内存,或者占用了大量内存表现欠佳.
However if I cache or persist df
, suddenly those queries are hitting all the partitions and either blows up memory or is much less performant.
这是一个很大的惊喜-有什么方法可以进行缓存,从而保留参与信息
This is a big surprise - is there any way to do caching which preserves the partitoning information
推荐答案
这是预料之中的.用于缓存的Spark内部列式格式与输入格式无关.加载数据后,与原始输入的连接就消失了.
This is to be expected. Spark internal columnar format used for caching is input format agnostic. Once you loaded data there there is no connection to the original input is gone.
这里是新数据源API [SPARK-22389] [SQL]数据源v2分区报告界面,该界面可以保留分区信息,但它是2.3版的新功能,仍处于试验阶段.
The exception here is new data source API [SPARK-22389][SQL] data source v2 partitioning reporting interface, which allows for persisting partitioning information, but it is new in 2.3 and still experimental.
这篇关于在保留分区的同时缓存数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!