在保留分区的同时缓存数据帧 [英] Caching dataframes while keeping partitions

查看：79 发布时间：2020/9/4 0:56:36 apache-spark

本文介绍了在保留分区的同时缓存数据帧的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我使用的是运行在EMR上的Spark 2.2.0.

I'm on Spark 2.2.0, running on EMR.

我有一个大数据框df(在压缩的快照文件中约为40G)，并由键k1和k2进行了分区.

I have a big dataframe df (40G or so in compressed snappy files) which is partitioned by keys k1 and k2.

当我通过k1 === v1或(k1 === v1&& k2 === v2`)查询时，我看到它仅查询分区中的文件(大约占文件的2％).

When I query by k1 === v1 or (k1 === v1 && k2 ===v2`), I can see that it's only querying the files in the partition (about 2% of the files).

但是，如果我缓存或持久性 df，突然这些查询击中了所有分区，或者耗尽了内存，或者占用了大量内存表现欠佳.

However if I cache or persist df, suddenly those queries are hitting all the partitions and either blows up memory or is much less performant.

这是一个很大的惊喜-有什么方法可以进行缓存，从而保留参与信息

This is a big surprise - is there any way to do caching which preserves the partitoning information

在保留分区的同时缓存数据帧 [英] Caching dataframes while keeping partitions

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在保留分区的同时缓存数据帧 [英] Caching dataframes while keeping partitions

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭