为了使速度最大化,pandas read_csv中的最佳块大小是多少? [英] what is the optimal chunksize in pandas read_csv to maximize speed?

查看:382
本文介绍了为了使速度最大化,pandas read_csv中的最佳块大小是多少?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用20GB(压缩)的.csv文件,并使用带有chunksize = 10,000参数的熊猫pd.read_csv()从中加载几个列.

但是,此参数完全是任意的,我想知道是否有一个简单的公式可以为我提供更好的块大小,从而加快数据加载速度.

有什么想法吗?

解决方案

没有最佳块大小" [*] .由于chunksize仅告诉您每个块的的数目,而不是单个行的内存大小,因此尝试制定规则是没有意义的拇指上. ([*]尽管通常我只见过块大小在100..64K范围内的

要获得内存大小,您必须将其转换为每块或每行的内存大小...

通过查看您的列数,列类型和每列的大小;按列使用 df.describe() 或其他更深入的内存使用情况:

print 'df Memory usage by column...'
print df.memory_usage(index=False, deep=True) / df.shape[0]

  • 确保在读取csv时不会耗尽所有可用内存:使用操作系统(Unix top/Windows Task Manager/MacOS Activity Monitor/etc)看看正在使用多少内存.

  • 熊猫的一个陷阱是缺少/NaN值,Python str和对象占用32或48个字节,而不是np.int32预期的4个字节或np预期的1个字节. int8列. 整列中甚至没有一个NaN值都会导致整列上的内存崩溃,而pandas.read_csv() dtypes, converters, na_values参数将不会阻止np.nan,并且将忽略所需的dtype(!).一种解决方法是在 插入数据框中之前手动对每个块进行后处理.

  • 并使用所有标准熊猫read_csv技巧,例如:

    • 为每列指定dtypes,以减少内存使用-绝对避免将每个条目读取为字符串,尤其是长的唯一字符串(如datetimes),这对于内存使用来说是很糟糕的
    • 如果只想保留一部分列,请指定usecols
    • 使用日期/时间转换器而不是pd.如果要从48个字节减少到1个或4个,则为分类.
    • 分块读取大文件.而且,如果您预先知道要用什么来估算NA/缺失值,那么在处理每个块时(而不是最后)尽可能地执行填充.如果您不能用最终值进行插补,则可能至少可以用-1、999,-Inf等标记值代替,然后再进行适当的插补.

I am using a 20GB (compressed) .csv file and I load a couple of columns from it using pandas pd.read_csv() with a chunksize=10,000 parameter.

However, this parameter is completely arbitrary and I wonder whether a simple formula could give me better chunksize that would speed-up the loading of the data.

Any ideas?

解决方案

There is no "optimal chunksize" [*]. Because chunksize only tells you the number of rows per chunk, not the memory-size of a single row, hence it's meaningless to try to make a rule-of-thumb on that. ([*] although generally I've only ever seen chunksizes in the range 100..64K)

To get memory size, you'd have to convert that to a memory-size-per-chunk or -per-row...

by looking at your number of columns, their dtypes, and the size of each; use either df.describe(), or else for more in-depth memory usage, by column:

print 'df Memory usage by column...'
print df.memory_usage(index=False, deep=True) / df.shape[0]

  • Make sure you're not blowing out all your free memory while reading the csv: use your OS (Unix top/Windows Task Manager/MacOS Activity Monitor/etc) to see how much memory is being used.

  • One pitfall with pandas is that missing/NaN values, Python strs and objects take 32 or 48 bytes, instead of the expected 4 bytes for np.int32 or 1 byte for np.int8 column. Even one NaN value in an entire column will cause that memory blowup on the entire column, and pandas.read_csv() dtypes, converters, na_values arguments will not prevent the np.nan, and will ignore the desired dtype(!). A workaround is to manually post-process each chunk before inserting in the dataframe.

  • And use all the standard pandas read_csv tricks, like:

    • specify dtypes for each column to reduce memory usage - absolutely avoid every entry being read as string, especially long unique strings like datetimes, which is terrible for memory usage
    • specify usecols if you only want to keep a subset of columns
    • use date/time-converters rather than pd.Categorical if you want to reduce from 48 bytes to 1 or 4.
    • read large files in chunks. And if you know upfront what you're going to impute NA/missing values with, if possible do as much of that filling as you process each chunk, instead of at the end. If you can't impute with the final value, you probably at least can replace with a sentinel value like -1, 999, -Inf etc. and later you can do the proper imputation.

这篇关于为了使速度最大化,pandas read_csv中的最佳块大小是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆