达斯克(Dask):腌制数据框以备后用是否安全? [英] Dask: is it safe to pickle a dataframe for later use?

查看:74
本文介绍了达斯克(Dask):腌制数据框以备后用是否安全?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含许多dask数据帧的类似数据库的对象。我想处理数据,保存并在第二天重新加载以继续分析。

I have a database-like object containing many dask dataframes. I would like to work with the data, save it and reload it on the next day to continue the analysis.

因此,我尝试使用pickle保存dask数据帧(不是计算结果,只是计算计划本身)。显然,它可以工作(至少,如果我在完全相同的机器上解开对象的东西)……但是有一些陷阱吗?

Therefore, I tried saving dask dataframes (not computation results, just the "plan of computation" itself) using pickle. Apparently, it works (at least, if I unpickle the objects on the exact same machine) ... but are there some pitfalls?

推荐答案

通常来说通常是安全的。但是,有一些警告:

Generally speaking it is usually safe. However there are a few caveats:


  1. 如果dask.dataframe包含自定义函数,例如与 df一起使用。 apply(lambda x:x),则内部函数将不会被腌制。但是,仍然可以使用 cloudpickle

  2. 序列化。数据框包含对仅在本地计算机上有效的文件的引用,尽管该文件仍可序列化,但在另一台计算机上重新序列化的版本可能不再有用

  3. 如果您的dask.dataframe包含 dask.distributed Future 对象,例如,如果使用 Executor.persist 在群集上,则这些当前无法序列化。

  4. 我建议使用> = 0.11.0的版本。

  1. If your dask.dataframe contains custom functions, such as with with df.apply(lambda x: x) then the internal function will not be pickleable. However it will still be serializable with cloudpickle
  2. If your dask.dataframe contains references to files that are only valid on your local computer then, while it will still be serializable the re-serialized version on another machine may no longer be useful
  3. If your dask.dataframe contains dask.distributed Future objects, such as would occur if you use Executor.persist on a cluster then these are not currently serializable.
  4. I recommend using a version >= 0.11.0.

这篇关于达斯克(Dask):腌制数据框以备后用是否安全?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆