Pandas 按列分块 [英] Pandas column-wise chunking

查看:96
本文介绍了Pandas 按列分块的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想读取一个大数据矩阵(目前用 90*85000 测试,后来用 150000*850000 测试)并对列做一些操作.

I want to read a large data matrix (currently testing with 90*85000, later 150000*850000) and do some operations on the columns.

为了加快速度,我尝试了分块.这大大加快了(~100 倍)读取过程,但由于我必须连接块以进行列操作,因此我在后面的步骤中失去了所有加速.

In order to speed up things I tried chunking. This drastically speeds up (~100x) the reading process, but since I have to concatenate the chunks for column wise operations, I am loosing all the speed up in later steps.

我的问题:
- 有没有办法在列维度而不是行维度中分块?
- 我想要实现的目标有替代方法吗?

My questions:
- Is there a way to chunk in the column dimension instead of the row dimension?
- Is there an alternative approach to what I want to achieve?

\Edit:一些定时运行:

\ Some timed runs:

  • 读取小文件:~10s
  • 读取小文件,'chunksize=20':<0.1 s
  • 读取小文件,手动实现按列分块:~50s不带串联,约 4 分钟带串联
  • 逐行读取文件,并进行一些与 Pandas 相同的后处理 ~13s

推荐答案

columnsrows:

def chunks(lst, chunksize):
    for i in range(0, len(lst), chunksize):
        yield lst[i:i+chunksize]

col_chunksize, row_chunksize = 1000, 1000
for use_cols in chunks(columns, col_chunksize):
    for chunk in pd.read_csv(file_path, chunksize=row_chunksize, use_cols=use_cols)
        process_chunk # e.g. pd.concat() to the process all rows of use_cols

这篇关于Pandas 按列分块的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆