运行df.to_csv()时出现Dask内存错误 [英] Dask Memory Error when running df.to_csv()

查看:224
本文介绍了运行df.to_csv()时出现Dask内存错误的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试索引并保存无法加载到内存中的大型csvs.我的代码可以加载csv,执行计算并通过新值进行索引,而不会出现问题.简化的版本是:

I am trying to index and save large csvs that cannot be loaded into memory. My code to load the csv, perform a computation and index by the new values works without issue. A simplified version is:

cluster = LocalCluster(n_workers=6, threads_per_worker=1)
client = Client(cluster, memory_limit='1GB')

df = dd.read_csv(filepath, header=None, sep=' ', blocksize=25e7)
df['new_col'] = df.map_partitions(lambda x: some_function(x))
df = df.set_index(df.new_col, sorted=False)

但是,当我使用大文件(即> 15GB)时,使用以下命令将数据保存到csv时会遇到内存错误:

However, when I use large files (i.e. > 15gb) I run into a memory error when saving to dataframe to csv with:

df.to_csv(os.path.join(save_dir, filename+'_*.csv'), index=False, chunksize=1000000)

我尝试设置chunksize=1000000以查看是否有帮助,但没有帮助.

I have tried setting the chunksize=1000000 to see if this would help, but it didn't.

完整的堆栈跟踪为:

Traceback (most recent call last):
  File "/home/david/data/pointframes/examples/dask_z-order.py", line 44, in <module>
    calc_zorder(fp, save_dir)
  File "/home/david/data/pointframes/examples/dask_z-order.py", line 31, in calc_zorder
    df.to_csv(os.path.join(save_dir, filename+'_*.csv'), index=False, chunksize=1000000)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.py", line 1159, in to_csv
    return to_csv(self, filename, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.py", line 654, in to_csv
    delayed(values).compute(scheduler=scheduler)
  File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 398, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/threaded.py", line 76, in get
    pack_exception=pack_exception, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/dask/local.py", line 459, in get_async
    raise_exception(exc, tb)
  File "/usr/local/lib/python2.7/dist-packages/dask/local.py", line 230, in execute_task
    result = _execute_task(task, data)
  File "/usr/local/lib/python2.7/dist-packages/dask/core.py", line 118, in _execute_task
    args2 = [_execute_task(a, cache) for a in args]
  File "/usr/local/lib/python2.7/dist-packages/dask/core.py", line 119, in _execute_task
    return func(*args2)
  File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/shuffle.py", line 426, in collect
    res = p.get(part)
  File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 73, in get
    return self.get([keys], **kwargs)[0]
  File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 79, in get
    return self._get(keys, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/partd/encode.py", line 30, in _get
    for chunk in raw]
  File "/usr/local/lib/python2.7/dist-packages/partd/pandas.py", line 175, in deserialize
    for (h, b) in zip(headers[2:], bytes[2:])]
  File "/usr/local/lib/python2.7/dist-packages/partd/pandas.py", line 136, in block_from_header_bytes
    copy=True).reshape(shape)
  File "/usr/local/lib/python2.7/dist-packages/partd/numpy.py", line 126, in deserialize
    result = result.copy()
MemoryError

我正在python 2.7中的Ubuntu 18.04系统上运行dask v1.1.0.我的电脑内存为32GB.这段代码可以正常工作于小型文件,这些文件无论如何都可以装入内存,但不适用于较大的文件.这里有我想念的东西吗?

I am running dask v1.1.0 on a Ubuntu 18.04 system in python 2.7. My computers memory is 32GB. This code works as expected with small files that can fit into memory anyway but not with larger ones. Is there something I am missing here?

推荐答案

我建议您尝试使用较小的数据块.您应该在计算的read_csv部分而不是to_csv部分进行控制.

I encourage you to try smaller chunks of data. You should control this in the read_csv part of your computation rather than the to_csv part.

这篇关于运行df.to_csv()时出现Dask内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆