df.cache() 存储在哪里 [英] where does df.cache() is stored
问题描述
我想了解以下代码存储在哪个节点(驱动程序或工作程序/执行程序)中
I would like to understand in which node (driver or worker/executor) does below code is stored
df.cache() //df is a large dataframe (200GB)
哪个性能更好:使用sql cachetable
或cache()
.我的理解是,其中一个是懒惰的,另一个是渴望的.
And which has a better performance: using sql cachetable
or cache()
. My understanding is that one of them is lazy and the other is eager.
推荐答案
df.cache()
调用 persist()
方法,该方法将存储级别存储为 MEMORY_AND_DISK
,但您可以更改存储级别
df.cache()
calls the persist()
method which stores on storage level as MEMORY_AND_DISK
, but you can change the storage level
persist()
方法调用sparkSession.sharedState.cacheManager.cacheQuery()当您看到 cacheTable
的代码时,它也会调用相同的sparkSession.sharedState.cacheManager.cacheQuery()
The persist()
method calls
sparkSession.sharedState.cacheManager.cacheQuery()
and when you see the code for cacheTable
it also calls the same
sparkSession.sharedState.cacheManager.cacheQuery()
表示两者相同,并且惰性求值(仅在执行动作时求值),除了persist
方法可以存储为提供的存储级别,这些是可用的存储级别
that means both are same and are lazily evaluated (only evaluated once action is performed), except persist
method can store as the storage level provided, these are the available storage level
- 无
- DISK_ONLY
- DISK_ONLY_2
- MEMORY_ONLY
- MEMORY_ONLY_2
- MEMORY_ONLY_SER
- MEMORY_ONLY_SER_2
- MEMORY_AND_DISK
- MEMORY_AND_DISK_2
- MEMORY_AND_DISK_SER
- MEMORY_AND_DISK_SER_2
- OFF_HEAP
你也可以使用SQLCACHE TABLE
,它不会懒惰求值,把整个表存储在内存中,这也可能导致OOM
You can also use the SQL CACHE TABLE
which is not lazily evaluated and stores the whole table in memory, which may also lead to OOM
总结: cache()
, persist()
, cacheTable()
是惰性求值,需要执行一个动作来工作,因为 SQL CACHE TABLE
是一个急切的
Summary: cache()
, persist()
, cacheTable()
are lazily evaluated and need to perform an action to work where as SQL CACHE TABLE
is an eager
查看这里 详情!
您可以根据需要选择!
希望这会有所帮助!
这篇关于df.cache() 存储在哪里的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!