Pandas 按列分块 [英] Pandas column-wise chunking

查看：96 发布时间：2021/6/12 20:54:50 python pandas optimization

本文介绍了Pandas 按列分块的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我想读取一个大数据矩阵(目前用 90*85000 测试，后来用 150000*850000 测试)并对列做一些操作.

I want to read a large data matrix (currently testing with 90*85000, later 150000*850000) and do some operations on the columns.

为了加快速度，我尝试了分块.这大大加快了(~100 倍)读取过程，但由于我必须连接块以进行列操作，因此我在后面的步骤中失去了所有加速.

In order to speed up things I tried chunking. This drastically speeds up (~100x) the reading process, but since I have to concatenate the chunks for column wise operations, I am loosing all the speed up in later steps.

我的问题:
- 有没有办法在列维度而不是行维度中分块?
- 我想要实现的目标有替代方法吗?

My questions:
- Is there a way to chunk in the column dimension instead of the row dimension?
- Is there an alternative approach to what I want to achieve?

\Edit:一些定时运行:

\ Some timed runs:

读取小文件:~10s
读取小文件，'chunksize=20':<0.1 s
读取小文件，手动实现按列分块:~50s不带串联，约 4 分钟带串联
逐行读取文件，并进行一些与 Pandas 相同的后处理 ~13s

推荐答案

columns 和 rows:

def chunks(lst, chunksize):
    for i in range(0, len(lst), chunksize):
        yield lst[i:i+chunksize]

col_chunksize, row_chunksize = 1000, 1000
for use_cols in chunks(columns, col_chunksize):
    for chunk in pd.read_csv(file_path, chunksize=row_chunksize, use_cols=use_cols)
        process_chunk # e.g. pd.concat() to the process all rows of use_cols

这篇关于Pandas 按列分块的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Pandas 按列分块 [英] Pandas column-wise chunking

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Pandas 按列分块 [英] Pandas column-wise chunking

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭