pandas 数据框太大而无法追加到dask数据框？ [英] Pandas dataframes too large to append to dask dataframe?

查看：75 发布时间：2020/10/15 18:34:55 python pandas dataframe jupyter dask

本文介绍了 pandas 数据框太大而无法追加到dask数据框？的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我不确定我在这里缺少什么，我认为快点可以解决我的记忆问题。我有100多个以.pickle格式保存的熊猫数据框。我想将它们全部放在同一数据帧中，但一直遇到内存问题。我已经在jupyter中增加了内存缓冲区。似乎在创建dask数据框时可能会丢失一些东西，因为它似乎在完全填满RAM后可能使笔记本电脑崩溃了（也许）。

I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already increased the memory buffer in jupyter. It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe). Any pointers?

以下是我使用的基本过程：

Below is the basic process I used:

import pandas as pd
import dask.dataframe as dd

ddf = dd.from_pandas(pd.read_pickle('first.pickle'),npartitions = 8)
for pickle_file in all_pickle_files:
    ddf = ddf.append(pd.read_pickle(pickle_file))
ddf.to_parquet('alldata.parquet', engine='pyarrow')

我尝试了各种 npartitions ，但是没有数字允许代码完成

总共要合并大约30GB的腌制数据帧

也许这不是正确的库，但文档建议dask应该能够处理此问题

I've tried a variety of npartitions but no number has allowed the code to finish running.
all in all there is about 30GB of pickled dataframes I'd like to combine
perhaps this is not the right library but the docs suggest dask should be able to handle this

推荐答案

您是否考虑过先转换 pickle 文件复制到 parquet ，然后加载到dask？我假设您所有的数据都在一个名为 raw 的文件夹中，并且您想移至已处理的

Have you considered to first convert the pickle files to parquet and then load to dask? I assume that all your data is in a folder called raw and you want to move to processed

import pandas as pd
import dask.dataframe as dd
import os

def convert_to_parquet(fn, fldr_in, fldr_out):
    fn_out = fn.replace(fldr_in, fldr_out)\
               .replace(".pickle", ".parquet")
    df = pd.read_pickle(fn)
    # eventually change dtypes
    
    df.to_parquet(fn_out, index=False)

fldr_in = 'data'
fldr_out = 'processed'
os.makedirs(fldr_out, exist_ok=True)

# you could use glob if you prefer
fns = os.listdir(fldr_in)
fns = [os.path.join(fldr_in, fn) for fn in fns]

在内存中超过一个文件的大小，您应该使用循环

If you know than no more than one file fits in memory you should use a loop

for fn in fns:
    convert_to_parquet(fn, fldr_in, fldr_out)

如果您知道内存中可以容纳更多文件，则可以使用 delayed

If you know that more files fit in memory you can use delayed

from dask import delayed, compute

# this is lazy
out = [delayed(fun)(fn) for fn in fns]
# now you are actually converting
out = compute(out)

现在您可以使用dask进行分析了。

Now you can use dask to do your analysis.

这篇关于 pandas 数据框太大而无法追加到dask数据框？的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

pandas 数据框太大而无法追加到dask数据框？ [英] Pandas dataframes too large to append to dask dataframe?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

pandas 数据框太大而无法追加到dask数据框？ [英] Pandas dataframes too large to append to dask dataframe?

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭