Pandas 按列分块 [英] Pandas column-wise chunking
问题描述
我想读取一个大数据矩阵(目前用 90*85000 测试,后来用 150000*850000 测试)并对列做一些操作.
I want to read a large data matrix (currently testing with 90*85000, later 150000*850000) and do some operations on the columns.
为了加快速度,我尝试了分块.这大大加快了(~100 倍)读取过程,但由于我必须连接块以进行列操作,因此我在后面的步骤中失去了所有加速.
In order to speed up things I tried chunking. This drastically speeds up (~100x) the reading process, but since I have to concatenate the chunks for column wise operations, I am loosing all the speed up in later steps.
我的问题:
- 有没有办法在列维度而不是行维度中分块?
- 我想要实现的目标有替代方法吗?
My questions:
- Is there a way to chunk in the column dimension instead of the row dimension?
- Is there an alternative approach to what I want to achieve?
\Edit:一些定时运行:
\ Some timed runs:
- 读取小文件:~10s
- 读取小文件,'chunksize=20':<0.1 s
- 读取小文件,手动实现按列分块:~50s不带串联,约 4 分钟带串联
- 逐行读取文件,并进行一些与 Pandas 相同的后处理 ~13s
推荐答案
columns
和 rows
:
def chunks(lst, chunksize):
for i in range(0, len(lst), chunksize):
yield lst[i:i+chunksize]
col_chunksize, row_chunksize = 1000, 1000
for use_cols in chunks(columns, col_chunksize):
for chunk in pd.read_csv(file_path, chunksize=row_chunksize, use_cols=use_cols)
process_chunk # e.g. pd.concat() to the process all rows of use_cols
这篇关于Pandas 按列分块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!