加载CSV,然后返回行列表 [英] Load CSV then return list of rows

查看:206
本文介绍了加载CSV,然后返回行列表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下工作代码,该代码读取具有两列(每行约500行)的csv文件,然后返回两列的列表,并将值转换为浮点数.

I have the following working code that reads a csv file with two columns by ~500 rows, then return a list of lists for both columns and convert the values to float.

每个测试用例我要读取大约20万个文件,因此总共有约500万个.csv文件.读取200k并返回列表大约需要1.5分钟.

I'm reading around 200k files per test case, so a total of ~5M .csv files. It's taking around 1,5 min to read 200k and to return the list.

我做了一个仅读取.csvs的基准,它大约需要5秒钟,因此瓶颈在于列表理解+浮点转换.

I did a benchmark that only read the .csvs and it takes around 5s, so the bottleneck is in the list comprehension + float conversion.

是否可以加快速度?我已经尝试过熊猫,numpy loadtxt和genfromtxt.与到目前为止相比,我尝试过的所有替代方案都非常慢.

Is it possible to speed things up? I already tried pandas, numpy loadtxt and genfromtxt. All of the alternatives I've tried are very slow comparing to what I have so far.

.csv文件内容示例:

Example of a .csv file content:

1.000e-08, -1.432e-07
1.001e-08, 7.992e-07
1.003e-08, -1.838e-05
# continues for more 500 lines

一些基准:

读取500k行和2列的200k .csv文件,如上例所示:

Some benchmarks:

Reading 200k .csv files with 500 lines and 2 columns like the example above:

def read_csv_return_list_of_rows(csv_file, _delimiter):
    df=pd.read_csv(csv_file, sep=_delimiter,header=None)
    return df.astype('float').values

使用NumPy的genfromtxt:3分58秒(238秒)

def read_csv_return_list_of_rows(csv_file, _delimiter):
    return np.genfromtxt(csv_file, delimiter=_delimiter)

使用来自stdlib的CSV.reader:1分31秒(91秒)

def read_csv_return_list_of_rows(csv_file, _delimiter):
    with open(csv_file, 'r') as f_read:
        csv_reader = csv.reader(f_read, delimiter = _delimiter)
        csv_file_list = [[float(i) for i in row] for row in csv_reader]
    return csv_file_list

如果我从上一个实现中删除了float(),则时间显着减少,并且如果我删除了列表理解,那么这就是这里的两个问题.

If I remove the float() from the last implementation the time decreases significantly as well as if I remove the list comprehension, so these two are the issues here.

推荐答案

无法测试,所以只是一个建议,我将如何尝试:

Can't test, so just a proposal how I would have tried:

def read_csv_return_list_of_rows_gen(csv_file, _delimiter):
    with open(csv_file, 'r') as f_read:
        for line in f_read:
            yield [float(i) for i in line.split(_delimiter)]

result = list(read_csv_return_list_of_rows_gen(filename, ','))

这篇关于加载CSV,然后返回行列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆