读取和解析TSV文件,然后将其操作为保存为CSV(* efficient *) [英] reading and parsing a TSV file, then manipulating it for saving as CSV (*efficiently*)

查看:394
本文介绍了读取和解析TSV文件,然后将其操作为保存为CSV(* efficient *)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的来源资料位于TSV档案中,共6列,大于2百万列。

My source data is in a TSV file, 6 columns and greater than 2 million rows.

这是我要完成的工作:


  1. 我需要读取此源文件中的3列(3,4,5)中的数据

  2. 第五列是整数。我需要使用此整数值来复制使用第三列和第四列中的数据(按整数倍数)的行条目。

  3. 我想要输出# 2到CSV格式的输出文件。

下面是我想出的。

我的问题:这是一种高效的方法吗?在尝试200万行时,它似乎可能是密集的。

My question: is this an efficient way to do it? It seems like it might be intensive when attempted on 2 million rows.

首先,我做了一个示例选项卡单独的文件,并称之为sample.txt 。它是基本的,只有四行:

First, I made a sample tab separate file to work with, and called it 'sample.txt'. It's basic and only has four rows:

Row1_Column1    Row1-Column2    Row1-Column3    Row1-Column4    2   Row1-Column6
Row2_Column1    Row2-Column2    Row2-Column3    Row2-Column4    3   Row2-Column6
Row3_Column1    Row3-Column2    Row3-Column3    Row3-Column4    1   Row3-Column6
Row4_Column1    Row4-Column2    Row4-Column3    Row4-Column4    2   Row4-Column6

那么我有这个代码:

import csv 

with open('sample.txt','r') as tsv:
    AoA = [line.strip().split('\t') for line in tsv]

for a in AoA:
    count = int(a[4])
    while count > 0:
        with open('sample_new.csv','ab') as csvfile:
            csvwriter = csv.writer(csvfile, delimiter=',')
            csvwriter.writerow([a[2], a[3]])
        count = count - 1


推荐答案

您应该使用 csv 模块来读取制表符分隔的值文件。不要一次就把它读入内存。您读取的每一行都具有将行写入输出CSV文件所需的所有信息。

You should use the csv module to read the tab-separated value file. Do not read it into memory in one go. Each row you read has all the information you need to write rows to the output CSV file, after all. Keep the output file open throughout.

import csv

with open('sample.txt','rb') as tsvin, open('new.csv', 'wb') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    csvout = csv.writer(csvout)

    for row in tsvin:
        count = int(row[4])
        if count > 0:
            csvout.writerows([row[2:4] for _ in xrange(count)])

或者使用 itertools 模块以 itertools.repeat()

or, using the itertools module to do the repeating with itertools.repeat():

from itertools import repeat
import csv

with open('sample.txt','rb') as tsvin, open('new.csv', 'wb') as csvout:
    tsvin = csv.reader(tsvin, delimiter='\t')
    csvout = csv.writer(csvout)

    for row in tsvin:
        count = int(row[4])
        if count > 0:
            csvout.writerows(repeat(row[2:4], count))

这篇关于读取和解析TSV文件,然后将其操作为保存为CSV(* efficient *)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆