pandas 数据框太大而无法追加到dask数据框? [英] Pandas dataframes too large to append to dask dataframe?
问题描述
我不确定我在这里缺少什么,我认为快点可以解决我的记忆问题。我有100多个以.pickle格式保存的熊猫数据框。我想将它们全部放在同一数据帧中,但一直遇到内存问题。我已经在jupyter中增加了内存缓冲区。似乎在创建dask数据框时可能会丢失一些东西,因为它似乎在完全填满RAM后可能使笔记本电脑崩溃了(也许)。
I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already increased the memory buffer in jupyter. It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe). Any pointers?
以下是我使用的基本过程:
Below is the basic process I used:
import pandas as pd
import dask.dataframe as dd
ddf = dd.from_pandas(pd.read_pickle('first.pickle'),npartitions = 8)
for pickle_file in all_pickle_files:
ddf = ddf.append(pd.read_pickle(pickle_file))
ddf.to_parquet('alldata.parquet', engine='pyarrow')
- 我尝试了各种
npartitions
,但是没有数字允许代码完成 - 总共要合并大约30GB的腌制数据帧
- 也许这不是正确的库,但文档建议dask应该能够处理此问题
- I've tried a variety of
npartitions
but no number has allowed the code to finish running. - all in all there is about 30GB of pickled dataframes I'd like to combine
- perhaps this is not the right library but the docs suggest dask should be able to handle this
推荐答案
您是否考虑过先转换 pickle
文件复制到 parquet
,然后加载到dask?我假设您所有的数据都在一个名为 raw
的文件夹中,并且您想移至已处理的
Have you considered to first convert the pickle
files to parquet
and then load to dask? I assume that all your data is in a folder called raw
and you want to move to processed
import pandas as pd
import dask.dataframe as dd
import os
def convert_to_parquet(fn, fldr_in, fldr_out):
fn_out = fn.replace(fldr_in, fldr_out)\
.replace(".pickle", ".parquet")
df = pd.read_pickle(fn)
# eventually change dtypes
df.to_parquet(fn_out, index=False)
fldr_in = 'data'
fldr_out = 'processed'
os.makedirs(fldr_out, exist_ok=True)
# you could use glob if you prefer
fns = os.listdir(fldr_in)
fns = [os.path.join(fldr_in, fn) for fn in fns]
在内存中超过一个文件的大小,您应该使用循环
If you know than no more than one file fits in memory you should use a loop
for fn in fns:
convert_to_parquet(fn, fldr_in, fldr_out)
如果您知道内存中可以容纳更多文件,则可以使用 delayed
If you know that more files fit in memory you can use delayed
from dask import delayed, compute
# this is lazy
out = [delayed(fun)(fn) for fn in fns]
# now you are actually converting
out = compute(out)
现在您可以使用dask进行分析了。
Now you can use dask to do your analysis.
这篇关于 pandas 数据框太大而无法追加到dask数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!