pandas 数据框太大而无法追加到dask数据框? [英] Pandas dataframes too large to append to dask dataframe?

查看:75
本文介绍了 pandas 数据框太大而无法追加到dask数据框?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我不确定我在这里缺少什么,我认为快点可以解决我的记忆问题。我有100多个以.pickle格式保存的熊猫数据框。我想将它们全部放在同一数据帧中,但一直遇到内存问题。我已经在jupyter中增加了内存缓冲区。似乎在创建dask数据框时可能会丢失一些东西,因为它似乎在完全填满RAM后可能使笔记本电脑崩溃了(也许)。

I'm not sure what I'm missing here, I thought dask would resolve my memory issues. I have 100+ pandas dataframes saved in .pickle format. I would like to get them all in the same dataframe but keep running into memory issues. I've already increased the memory buffer in jupyter. It seems I may be missing something in creating the dask dataframe as it appears to crash my notebook after completely filling my RAM (maybe). Any pointers?

以下是我使用的基本过程:

Below is the basic process I used:

import pandas as pd
import dask.dataframe as dd

ddf = dd.from_pandas(pd.read_pickle('first.pickle'),npartitions = 8)
for pickle_file in all_pickle_files:
    ddf = ddf.append(pd.read_pickle(pickle_file))
ddf.to_parquet('alldata.parquet', engine='pyarrow')



  • 我尝试了各种 npartitions ,但是没有数字允许代码完成

  • 总共要合并大约30GB的腌制数据帧

  • 也许这不是正确的库,但文档建议dask应该能够处理此问题

    • I've tried a variety of npartitions but no number has allowed the code to finish running.
    • all in all there is about 30GB of pickled dataframes I'd like to combine
    • perhaps this is not the right library but the docs suggest dask should be able to handle this
    • 推荐答案

      您是否考虑过先转换 pickle 文件复制到 parquet ,然后加载到dask?我假设您所有的数据都在一个名为 raw 的文件夹中,并且您想移至已处理的

      Have you considered to first convert the pickle files to parquet and then load to dask? I assume that all your data is in a folder called raw and you want to move to processed

      import pandas as pd
      import dask.dataframe as dd
      import os
      
      def convert_to_parquet(fn, fldr_in, fldr_out):
          fn_out = fn.replace(fldr_in, fldr_out)\
                     .replace(".pickle", ".parquet")
          df = pd.read_pickle(fn)
          # eventually change dtypes
          
          df.to_parquet(fn_out, index=False)
      
      fldr_in = 'data'
      fldr_out = 'processed'
      os.makedirs(fldr_out, exist_ok=True)
      
      # you could use glob if you prefer
      fns = os.listdir(fldr_in)
      fns = [os.path.join(fldr_in, fn) for fn in fns]
      

      在内存中超过一个文件的大小,您应该使用循环

      If you know than no more than one file fits in memory you should use a loop

      for fn in fns:
          convert_to_parquet(fn, fldr_in, fldr_out)
      

      如果您知道内存中可以容纳更多文件,则可以使用 delayed

      If you know that more files fit in memory you can use delayed

      from dask import delayed, compute
      
      # this is lazy
      out = [delayed(fun)(fn) for fn in fns]
      # now you are actually converting
      out = compute(out)
      

      现在您可以使用dask进行分析了。

      Now you can use dask to do your analysis.

      这篇关于 pandas 数据框太大而无法追加到dask数据框?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆