df.cache()存放在哪里 [英] where does df.cache() is stored

查看:369
本文介绍了df.cache()存放在哪里的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想了解下面代码存储在哪个节点(驱动程序或工作程序/执行程序)

I would like to understand in which node (driver or worker/executor) does below code is stored

df.cache() //df is a large dataframe (200GB)

并且具有更好的性能:使用sql cachetablecache().我的理解是,其中一个很懒,另一个很渴望.

And which has a better performance: using sql cachetable or cache(). My understanding is that one of them is lazy and the other is eager.

推荐答案

df.cache()调用persist()方法,该方法在存储级别存储为MEMORY_AND_DISK,但是您可以更改存储级别

df.cache() calls the persist() method which stores on storage level as MEMORY_AND_DISK, but you can change the storage level

persist()方法调用 sparkSession.sharedState.cacheManager.cacheQuery() 当您看到cacheTable的代码时,它也会调用相同的代码 sparkSession.sharedState.cacheManager.cacheQuery()

The persist() method calls sparkSession.sharedState.cacheManager.cacheQuery() and when you see the code for cacheTable it also calls the same sparkSession.sharedState.cacheManager.cacheQuery()

这意味着两者相同并且延迟评估(仅在执行操作后才评估),除了persist方法可以存储为提供的存储级别,这些是可用的存储级别

that means both are same and are lazily evaluated (only evaluated once action is performed), except persist method can store as the storage level provided, these are the available storage level

  • DISK_ONLY
  • DISK_ONLY_2
  • MEMORY_ONLY
  • MEMORY_ONLY_2
  • MEMORY_ONLY_SER
  • MEMORY_ONLY_SER_2
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_2
  • MEMORY_AND_DISK_SER
  • MEMORY_AND_DISK_SER_2
  • OFF_HEAP
  • NONE
  • DISK_ONLY
  • DISK_ONLY_2
  • MEMORY_ONLY
  • MEMORY_ONLY_2
  • MEMORY_ONLY_SER
  • MEMORY_ONLY_SER_2
  • MEMORY_AND_DISK
  • MEMORY_AND_DISK_2
  • MEMORY_AND_DISK_SER
  • MEMORY_AND_DISK_SER_2
  • OFF_HEAP

您还可以使用SQL CACHE TABLE,它不会被延迟计算并将整个表存储在内存中,这也可能导致OOM

You can also use the SQL CACHE TABLE which is not lazily evaluated and stores the whole table in memory, which may also lead to OOM

摘要::cache()persist()cacheTable()的评估是懒惰的,需要执行操作才能正常工作,因为SQL CACHE TABLE迫切

Summary: cache(), persist(), cacheTable() are lazily evaluated and need to perform an action to work where as SQL CACHE TABLE is an eager

在此处查看您可以根据需要选择!

希望这会有所帮助!

这篇关于df.cache()存放在哪里的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆