与pandas read_csv()相比,有一种更有效的方式来加载具有1000000+行的1列? [英] What is a more efficient way to load 1 column with 1 000 000+ rows than pandas read_csv()?

查看:81
本文介绍了与pandas read_csv()相比,有一种更有效的方式来加载具有1000000+行的1列?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Python导入大型文件(.tab / .txt,300多个列和1000万以上的行)。该文件用制表符分隔。这些列填充有整数值。我的目标之一是对每一列求和。但是,文件太大,无法使用 pandas.read_csv()导入,因为它占用了过多的RAM。
样本数据:

I'm trying to import large files (.tab/.txt, 300+ columns and 1 000 000+ rows) in Python. The file are tab seperated. The columns are filled with integer values. One of my goals is to make a sum of each column. However, the files are too large to import with pandas.read_csv() as it consumes too much RAM. sample data:

因此我编写了以下代码来导入1列,执行该列的总和,将结果存储在数据帧(= summed_cols)中,删除该列,然后继续文件的下一个列:

Therefore I wrote following code to import 1 column, perform the sum of that column, store the result in a dataframe (= summed_cols), delete the column, and go on with the next column of the file:

x=10 ###columns I'm interested in start at col 11

#empty dataframe to fill
summed_cols=pd.DataFrame(columns=["sample","read sum"])

while x<352:
    x=x+1
    sample_col=pd.read_csv("file.txt",sep="\t",usecols=[x])
    summed_cols=summed_cols.append(pd.DataFrame({"sample":[sample_col.columns[0]],"read sum":sum(sample_col[sample_col.columns[0]])}))
    del sample_col

每列代表一个样本,读取总和是该列的总和。因此,此代码的输出是一个具有2列的数据帧,第一列中每行一个样本,第二列中相应的读取总和。

Each column represents a sample and the ''read sum'' is the sum of that column. So the output of this code is a dataframe with 2 columns with in the first column one sample per row, and in the second column the corresponding read sum.

此代码可以正是我想做的事,但是效率不高。对于这个大文件,大约需要1-2个小时才能完成计算。尤其是仅加载1列会花费很长时间。

This code does exactly what I want to do, however, it is not efficient. For this large file it takes about 1-2 hours to complete the calculations. Especially the loading of just 1 columns takes quiet a long time.

我的问题:是否有更快的方法可以只导入其中一列大标签文件并执行与上面代码相​​同的计算?

My question: Is there a faster way to import just one column of this large tab file and perform the same calculations as I'm doing with the code above?

推荐答案

您可以尝试执行以下操作:

You can try something like this:

samples = []
sums = []

with open('file.txt','r') as f:
    for i,line in enumerate(f):
        columns = line.strip().split('\t')[10:] #from column 10 onward
        if i == 0: #supposing the sample_name is the first row of each column
            samples = columns #save sample names
            sums = [0 for s in samples] #init the sums to 0
        else:
            for n,v in enumerate(columns):
                sums[n] += float(v)

result = dict(zip(samples,sums)) #{sample_name:sum, ...}

我不确定因为我不知道这将工作w输入文件的内容,但它描述了一般步骤。您只打开文件一次,遍历每一行,拆分以获取列,然后存储所需的数据。
请注意,此代码不会处理缺失值。

I am not sure this will work since I don't know the content of your input file but it describes the general procedure. You open the file only once, you iterate over each line, split to get the columns, and store the data you need. Mind that this code does not deal with missing values.

可以使用以下命令来改进 else 块: numpy:

The else block can be improved using numpy:

import numpy as np
...
else:
    sums = np.add(sums, map(float,columns))

这篇关于与pandas read_csv()相比,有一种更有效的方式来加载具有1000000+行的1列?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆