从python中的大型文本文件中有效读取节 [英] reading sections from a large text file in python efficiently

查看:577
本文介绍了从python中的大型文本文件中有效读取节的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大文本文件,其中包含几百万行数据.第一列包含位置坐标.我需要从此原始数据创建另一个文件,但该文件仅包含基于位置坐标的指定非连续间隔.我还有另一个文件,其中包含每个间隔的坐标.例如,我的原始文件的格式与此类似:

I have a large text file containing several million lines of data. The very first column contains position coordinates. I need to create another file from this original data, but that only contains specified non-contiguous intervals based on the position coordinates. I have another file containing the coordinates for each interval. For instance, my original file is in a format similar to this:

Position   Data1   Data2   Data3  Data4  
55         a       b       c      d
63         a       b       c      d
68         a       b       c      d  
73         a       b       c      d 
75         a       b       c      d
82         a       b       c      d
86         a       b       c      d

然后可以说我的文件中包含类似以下内容的间隔...

Then lets say I have my file containing intervals that looks something like this...

name1   50   72
name2   78   93

然后我希望我的新文件看起来像这样...

Then I want my new file to look something like this...

Position   Data1   Data2   Data3  Data4  
55         a       b       c      d
63         a       b       c      d
68         a       b       c      d 
82         a       b       c      d
86         a       b       c      d

到目前为止,我已经创建了一个函数,可以将特定间隔内包含的原始文件中的数据写入新文件.我的代码如下:

So far I have created a function to write the data from the original file contained within a specific interval to my new file. My code is as follows:

def get_block(beg,end):
   output=open(output_table,'a')
   with open(input_table,'r') as f:
      for line in f:
         line=line.strip("\r\n")
         line=line.split("\t")
         position=int(line[0])
         if int(position)<=beg:
            pass
         elif int(position)>=end:
            break
         else:
            for i in line:
               output.write(("%s\t")%(i))
            output.write("\n")

然后我创建一个包含间隔时间对的列表,然后使用上述功能循环遍历我的原始文件,如下所示:

I then create a list containing the pairs of my intervals and then loop through my original file using the above function like this:

#coords=[[start1,stop1],[start2,stop2],[start3,stop3]..etc]
for i in coords:
   start_p=int(i[0]) ; stop_p=int(i[1])
   get_block(start_p,stop_p)

这执行了我想要的操作,但是随着它沿着我的坐标列表移动,它的速度成倍地变慢,因为我必须遍历整个文件,直到每次通过循环到达指定的起始坐标为止.有没有更有效的方法来做到这一点?有没有一种方法可以每次都跳到特定行,而不是逐行阅读?

This performs what I want, however it gets exponentially slower as it moves along my coordinate list because I am having to read through my entire file until I reach the specified start coordinate each time through the loop. Is there a more efficient way of accomplishing this? Is there a way to skip to a specific line each time instead of reading over every line?

推荐答案

我只是使用内置的csv模块来简化输入的读取.为了进一步加快处理速度,可以一次读取所有坐标范围,这将允许选择过程一次通过数据文件进行.

I'd just use the built-in csv module to simplify reading the input. To further speed things up, all the coord ranges could be read in at once, which would allow the selection process to occur in one pass through the data file.

import csv

# read all coord ranges into memory
with open('ranges', 'rb') as ranges:
    range_reader = csv.reader(ranges, delimiter='\t')
    coords = [map(int, (start, stop)) for name,start,stop in range_reader]

# make one pass through input file and extract positions specified
with open('output_table', 'w') as outf, open('input_table', 'rb') as inf:
    input_reader = csv.reader(inf, delimiter='\t')
    outf.write('\t'.join(input_reader.next())+'\n')  # copy header row
    for row in input_reader:
        for coord in coords:
            if coord[0] <= int(row[0]) <= coord[1]:
                outf.write('\t'.join(row)+'\n')
                break;

这篇关于从python中的大型文本文件中有效读取节的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆