为了使速度最大化,pandas read_csv中的最佳块大小是多少? [英] what is the optimal chunksize in pandas read_csv to maximize speed?
问题描述
我正在使用20GB(压缩)的.csv文件,并使用带有chunksize = 10,000参数的熊猫pd.read_csv()
从中加载几个列.
但是,此参数完全是任意的,我想知道是否有一个简单的公式可以为我提供更好的块大小,从而加快数据加载速度.
有什么想法吗?
没有最佳块大小" [*] .由于chunksize
仅告诉您每个块的行的数目,而不是单个行的内存大小,因此尝试制定规则是没有意义的拇指上. ([*]尽管通常我只见过块大小在100..64K范围内的>
要获得内存大小,您必须将其转换为每块或每行的内存大小...
通过查看您的列数,列类型和每列的大小;按列使用 df.describe()
或其他更深入的内存使用情况:
print 'df Memory usage by column...'
print df.memory_usage(index=False, deep=True) / df.shape[0]
-
确保在读取csv时不会耗尽所有可用内存:使用操作系统(Unix
top
/Windows Task Manager/MacOS Activity Monitor/etc)看看正在使用多少内存. -
熊猫的一个陷阱是缺少/NaN值,Python str和对象占用32或48个字节,而不是np.int32预期的4个字节或np预期的1个字节. int8列. 整列中甚至没有一个NaN值都会导致整列上的内存崩溃,而
pandas.read_csv() dtypes, converters, na_values
参数将不会阻止np.nan,并且将忽略所需的dtype(!).一种解决方法是在 插入数据框中之前手动对每个块进行后处理. -
并使用所有标准熊猫
read_csv
技巧,例如:- 为每列指定
dtypes
,以减少内存使用-绝对避免将每个条目读取为字符串,尤其是长的唯一字符串(如datetimes),这对于内存使用来说是很糟糕的 - 如果只想保留一部分列,请指定
usecols
- 使用日期/时间转换器而不是pd.如果要从48个字节减少到1个或4个,则为分类.
- 分块读取大文件.而且,如果您预先知道要用什么来估算NA/缺失值,那么在处理每个块时(而不是最后)尽可能地执行填充.如果您不能用最终值进行插补,则可能至少可以用-1、999,-Inf等标记值代替,然后再进行适当的插补.
- 为每列指定
I am using a 20GB (compressed) .csv file and I load a couple of columns from it using pandas pd.read_csv()
with a chunksize=10,000 parameter.
However, this parameter is completely arbitrary and I wonder whether a simple formula could give me better chunksize that would speed-up the loading of the data.
Any ideas?
There is no "optimal chunksize" [*]. Because chunksize
only tells you the number of rows per chunk, not the memory-size of a single row, hence it's meaningless to try to make a rule-of-thumb on that. ([*] although generally I've only ever seen chunksizes in the range 100..64K)
To get memory size, you'd have to convert that to a memory-size-per-chunk or -per-row...
by looking at your number of columns, their dtypes, and the size of each; use either df.describe()
, or else for more in-depth memory usage, by column:
print 'df Memory usage by column...'
print df.memory_usage(index=False, deep=True) / df.shape[0]
Make sure you're not blowing out all your free memory while reading the csv: use your OS (Unix
top
/Windows Task Manager/MacOS Activity Monitor/etc) to see how much memory is being used.One pitfall with pandas is that missing/NaN values, Python strs and objects take 32 or 48 bytes, instead of the expected 4 bytes for np.int32 or 1 byte for np.int8 column. Even one NaN value in an entire column will cause that memory blowup on the entire column, and
pandas.read_csv() dtypes, converters, na_values
arguments will not prevent the np.nan, and will ignore the desired dtype(!). A workaround is to manually post-process each chunk before inserting in the dataframe.And use all the standard pandas
read_csv
tricks, like:- specify
dtypes
for each column to reduce memory usage - absolutely avoid every entry being read as string, especially long unique strings like datetimes, which is terrible for memory usage - specify
usecols
if you only want to keep a subset of columns - use date/time-converters rather than pd.Categorical if you want to reduce from 48 bytes to 1 or 4.
- read large files in chunks. And if you know upfront what you're going to impute NA/missing values with, if possible do as much of that filling as you process each chunk, instead of at the end. If you can't impute with the final value, you probably at least can replace with a sentinel value like -1, 999, -Inf etc. and later you can do the proper imputation.
- specify
这篇关于为了使速度最大化,pandas read_csv中的最佳块大小是多少?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!