加载CSV,然后返回行列表 [英] Load CSV then return list of rows
问题描述
我有以下工作代码,该代码读取具有两列(每行约500行)的csv文件,然后返回两列的列表,并将值转换为浮点数.
I have the following working code that reads a csv file with two columns by ~500 rows, then return a list of lists for both columns and convert the values to float.
每个测试用例我要读取大约20万个文件,因此总共有约500万个.csv文件.读取200k并返回列表大约需要1.5分钟.
I'm reading around 200k files per test case, so a total of ~5M .csv files. It's taking around 1,5 min to read 200k and to return the list.
我做了一个仅读取.csvs的基准,它大约需要5秒钟,因此瓶颈在于列表理解+浮点转换.
I did a benchmark that only read the .csvs and it takes around 5s, so the bottleneck is in the list comprehension + float conversion.
是否可以加快速度?我已经尝试过熊猫,numpy loadtxt和genfromtxt.与到目前为止相比,我尝试过的所有替代方案都非常慢.
Is it possible to speed things up? I already tried pandas, numpy loadtxt and genfromtxt. All of the alternatives I've tried are very slow comparing to what I have so far.
.csv文件内容示例:
Example of a .csv file content:
1.000e-08, -1.432e-07
1.001e-08, 7.992e-07
1.003e-08, -1.838e-05
# continues for more 500 lines
一些基准:
读取500k行和2列的200k .csv文件,如上例所示:
Some benchmarks:
Reading 200k .csv files with 500 lines and 2 columns like the example above:
def read_csv_return_list_of_rows(csv_file, _delimiter):
df=pd.read_csv(csv_file, sep=_delimiter,header=None)
return df.astype('float').values
使用NumPy的genfromtxt:3分58秒(238秒)
def read_csv_return_list_of_rows(csv_file, _delimiter):
return np.genfromtxt(csv_file, delimiter=_delimiter)
使用来自stdlib的CSV.reader:1分31秒(91秒)
def read_csv_return_list_of_rows(csv_file, _delimiter):
with open(csv_file, 'r') as f_read:
csv_reader = csv.reader(f_read, delimiter = _delimiter)
csv_file_list = [[float(i) for i in row] for row in csv_reader]
return csv_file_list
如果我从上一个实现中删除了float(),则时间显着减少,并且如果我删除了列表理解,那么这就是这里的两个问题.
If I remove the float() from the last implementation the time decreases significantly as well as if I remove the list comprehension, so these two are the issues here.
推荐答案
无法测试,所以只是一个建议,我将如何尝试:
Can't test, so just a proposal how I would have tried:
def read_csv_return_list_of_rows_gen(csv_file, _delimiter):
with open(csv_file, 'r') as f_read:
for line in f_read:
yield [float(i) for i in line.split(_delimiter)]
result = list(read_csv_return_list_of_rows_gen(filename, ','))
这篇关于加载CSV,然后返回行列表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!