df.cache() 存储在哪里 [英] where does df.cache() is stored

查看:26
本文介绍了df.cache() 存储在哪里的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解以下代码存储在哪个节点(驱动程序或工作程序/执行程序)中

I would like to understand in which node (driver or worker/executor) does below code is stored

df.cache() //df is a large dataframe (200GB)

哪个性能更好:使用sql cachetablecache().我的理解是,其中一个是懒惰的,另一个是渴望的.

And which has a better performance: using sql cachetable or cache(). My understanding is that one of them is lazy and the other is eager.

推荐答案

df.cache() 调用 persist() 方法,该方法将存储级别存储为 MEMORY_AND_DISK,但您可以更改存储级别

df.cache() calls the persist() method which stores on storage level as MEMORY_AND_DISK, but you can change the storage level

persist() 方法调用sparkSession.sharedState.cacheManager.cacheQuery()当您看到 cacheTable 的代码时,它也会调用相同的sparkSession.sharedState.cacheManager.cacheQuery()

The persist() method calls sparkSession.sharedState.cacheManager.cacheQuery() and when you see the code for cacheTable it also calls the same sparkSession.sharedState.cacheManager.cacheQuery()

表示两者相同,并且惰性求值(仅在执行动作时求值),除了persist方法可以存储为提供的存储级别,这些是可用的存储级别

that means both are same and are lazily evaluated (only evaluated once action is performed), except persist method can store as the storage level provided, these are the available storage level

  • DISK_ONLY
  • DISK_ONLY_2
  • MEMORY_ONLY
  • MEMORY_ONLY_2
  • MEMORY_ONLY_SER
  • MEMORY_ONLY_SER_2
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_2
  • MEMORY_AND_DISK_SER
  • MEMORY_AND_DISK_SER_2
  • OFF_HEAP

你也可以使用SQLCACHE TABLE,它不会懒惰求值,把整个表存储在内存中,这也可能导致OOM

You can also use the SQL CACHE TABLE which is not lazily evaluated and stores the whole table in memory, which may also lead to OOM

总结: cache(), persist(), cacheTable() 是惰性求值,需要执行一个动作来工作,因为 SQL CACHE TABLE 是一个急切的

Summary: cache(), persist(), cacheTable() are lazily evaluated and need to perform an action to work where as SQL CACHE TABLE is an eager

查看这里 详情

您可以根据需要选择!

希望这会有所帮助!

这篇关于df.cache() 存储在哪里的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆