达斯克(Dask):腌制数据框以备后用是否安全? [英] Dask: is it safe to pickle a dataframe for later use?
本文介绍了达斯克(Dask):腌制数据框以备后用是否安全?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我有一个包含许多dask数据帧的类似数据库的对象。我想处理数据,保存并在第二天重新加载以继续分析。
I have a database-like object containing many dask dataframes. I would like to work with the data, save it and reload it on the next day to continue the analysis.
因此,我尝试使用pickle保存dask数据帧(不是计算结果,只是计算计划本身)。显然,它可以工作(至少,如果我在完全相同的机器上解开对象的东西)……但是有一些陷阱吗?
Therefore, I tried saving dask dataframes (not computation results, just the "plan of computation" itself) using pickle. Apparently, it works (at least, if I unpickle the objects on the exact same machine) ... but are there some pitfalls?
推荐答案
通常来说通常是安全的。但是,有一些警告:
Generally speaking it is usually safe. However there are a few caveats:
- 如果dask.dataframe包含自定义函数,例如与
df一起使用。 apply(lambda x:x)
,则内部函数将不会被腌制。但是,仍然可以使用 cloudpickle - 序列化。数据框包含对仅在本地计算机上有效的文件的引用,尽管该文件仍可序列化,但在另一台计算机上重新序列化的版本可能不再有用
- 如果您的dask.dataframe包含
dask.distributed
Future
对象,例如,如果使用Executor.persist
在群集上,则这些当前无法序列化。 - 我建议使用> = 0.11.0的版本。
- If your dask.dataframe contains custom functions, such as with with
df.apply(lambda x: x)
then the internal function will not be pickleable. However it will still be serializable with cloudpickle - If your dask.dataframe contains references to files that are only valid on your local computer then, while it will still be serializable the re-serialized version on another machine may no longer be useful
- If your dask.dataframe contains
dask.distributed
Future
objects, such as would occur if you useExecutor.persist
on a cluster then these are not currently serializable. - I recommend using a version >= 0.11.0.
这篇关于达斯克(Dask):腌制数据框以备后用是否安全?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文