读取和解析TSV文件,然后将其操作为保存为CSV(* efficient *) [英] reading and parsing a TSV file, then manipulating it for saving as CSV (*efficiently*)
问题描述
我的来源资料位于TSV档案中,共6列,大于2百万列。
My source data is in a TSV file, 6 columns and greater than 2 million rows.
这是我要完成的工作:
- 我需要读取此源文件中的3列(3,4,5)中的数据
- 第五列是整数。我需要使用此整数值来复制使用第三列和第四列中的数据(按整数倍数)的行条目。
- 我想要输出# 2到CSV格式的输出文件。
下面是我想出的。
我的问题:这是一种高效的方法吗?在尝试200万行时,它似乎可能是密集的。
My question: is this an efficient way to do it? It seems like it might be intensive when attempted on 2 million rows.
首先,我做了一个示例选项卡单独的文件,并称之为sample.txt 。它是基本的,只有四行:
First, I made a sample tab separate file to work with, and called it 'sample.txt'. It's basic and only has four rows:
Row1_Column1 Row1-Column2 Row1-Column3 Row1-Column4 2 Row1-Column6
Row2_Column1 Row2-Column2 Row2-Column3 Row2-Column4 3 Row2-Column6
Row3_Column1 Row3-Column2 Row3-Column3 Row3-Column4 1 Row3-Column6
Row4_Column1 Row4-Column2 Row4-Column3 Row4-Column4 2 Row4-Column6
那么我有这个代码:
import csv
with open('sample.txt','r') as tsv:
AoA = [line.strip().split('\t') for line in tsv]
for a in AoA:
count = int(a[4])
while count > 0:
with open('sample_new.csv','ab') as csvfile:
csvwriter = csv.writer(csvfile, delimiter=',')
csvwriter.writerow([a[2], a[3]])
count = count - 1
推荐答案
您应该使用 csv
模块来读取制表符分隔的值文件。不要一次就把它读入内存。您读取的每一行都具有将行写入输出CSV文件所需的所有信息。
You should use the csv
module to read the tab-separated value file. Do not read it into memory in one go. Each row you read has all the information you need to write rows to the output CSV file, after all. Keep the output file open throughout.
import csv
with open('sample.txt','rb') as tsvin, open('new.csv', 'wb') as csvout:
tsvin = csv.reader(tsvin, delimiter='\t')
csvout = csv.writer(csvout)
for row in tsvin:
count = int(row[4])
if count > 0:
csvout.writerows([row[2:4] for _ in xrange(count)])
或者使用 itertools
模块以 itertools.repeat()
:
or, using the itertools
module to do the repeating with itertools.repeat()
:
from itertools import repeat
import csv
with open('sample.txt','rb') as tsvin, open('new.csv', 'wb') as csvout:
tsvin = csv.reader(tsvin, delimiter='\t')
csvout = csv.writer(csvout)
for row in tsvin:
count = int(row[4])
if count > 0:
csvout.writerows(repeat(row[2:4], count))
这篇关于读取和解析TSV文件,然后将其操作为保存为CSV(* efficient *)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!