运行df.to_csv()时出现Dask内存错误 [英] Dask Memory Error when running df.to_csv()
问题描述
我正在尝试索引并保存无法加载到内存中的大型csvs.我的代码可以加载csv,执行计算并通过新值进行索引,而不会出现问题.简化的版本是:
I am trying to index and save large csvs that cannot be loaded into memory. My code to load the csv, perform a computation and index by the new values works without issue. A simplified version is:
cluster = LocalCluster(n_workers=6, threads_per_worker=1)
client = Client(cluster, memory_limit='1GB')
df = dd.read_csv(filepath, header=None, sep=' ', blocksize=25e7)
df['new_col'] = df.map_partitions(lambda x: some_function(x))
df = df.set_index(df.new_col, sorted=False)
但是,当我使用大文件(即> 15GB)时,使用以下命令将数据保存到csv时会遇到内存错误:
However, when I use large files (i.e. > 15gb) I run into a memory error when saving to dataframe to csv with:
df.to_csv(os.path.join(save_dir, filename+'_*.csv'), index=False, chunksize=1000000)
我尝试设置chunksize=1000000
以查看是否有帮助,但没有帮助.
I have tried setting the chunksize=1000000
to see if this would help, but it didn't.
完整的堆栈跟踪为:
Traceback (most recent call last):
File "/home/david/data/pointframes/examples/dask_z-order.py", line 44, in <module>
calc_zorder(fp, save_dir)
File "/home/david/data/pointframes/examples/dask_z-order.py", line 31, in calc_zorder
df.to_csv(os.path.join(save_dir, filename+'_*.csv'), index=False, chunksize=1000000)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/core.py", line 1159, in to_csv
return to_csv(self, filename, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/io/csv.py", line 654, in to_csv
delayed(values).compute(scheduler=scheduler)
File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 156, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/base.py", line 398, in compute
results = schedule(dsk, keys, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/threaded.py", line 76, in get
pack_exception=pack_exception, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/dask/local.py", line 459, in get_async
raise_exception(exc, tb)
File "/usr/local/lib/python2.7/dist-packages/dask/local.py", line 230, in execute_task
result = _execute_task(task, data)
File "/usr/local/lib/python2.7/dist-packages/dask/core.py", line 118, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "/usr/local/lib/python2.7/dist-packages/dask/core.py", line 119, in _execute_task
return func(*args2)
File "/usr/local/lib/python2.7/dist-packages/dask/dataframe/shuffle.py", line 426, in collect
res = p.get(part)
File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 73, in get
return self.get([keys], **kwargs)[0]
File "/usr/local/lib/python2.7/dist-packages/partd/core.py", line 79, in get
return self._get(keys, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/partd/encode.py", line 30, in _get
for chunk in raw]
File "/usr/local/lib/python2.7/dist-packages/partd/pandas.py", line 175, in deserialize
for (h, b) in zip(headers[2:], bytes[2:])]
File "/usr/local/lib/python2.7/dist-packages/partd/pandas.py", line 136, in block_from_header_bytes
copy=True).reshape(shape)
File "/usr/local/lib/python2.7/dist-packages/partd/numpy.py", line 126, in deserialize
result = result.copy()
MemoryError
我正在python 2.7
中的Ubuntu 18.04
系统上运行dask v1.1.0
.我的电脑内存为32GB.这段代码可以正常工作于小型文件,这些文件无论如何都可以装入内存,但不适用于较大的文件.这里有我想念的东西吗?
I am running dask v1.1.0
on a Ubuntu 18.04
system in python 2.7
. My computers memory is 32GB. This code works as expected with small files that can fit into memory anyway but not with larger ones. Is there something I am missing here?
推荐答案
我建议您尝试使用较小的数据块.您应该在计算的read_csv
部分而不是to_csv
部分进行控制.
I encourage you to try smaller chunks of data. You should control this in the read_csv
part of your computation rather than the to_csv
part.
这篇关于运行df.to_csv()时出现Dask内存错误的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!