Python大文件,如何查找具有特定字符串的特定行 [英] Python large files, how to find specific lines with a particular string

查看:114
本文介绍了Python大文件,如何查找具有特定字符串的特定行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python处理非常大的文本文件(〜52GB,8亿行,每行30列数据)中的数据.我正在尝试找到一种有效的方法来查找特定的行.幸运的是,该字符串始终位于第一列.

I am using Python to process data from very large text files (~52GB, 800 million lines each with 30 columns of data). I am trying to find an efficient way to find specific lines. Luckily the string is always in the first column.

整个工作正常,内存不是问题(我没有加载它,只是根据需要打开和关闭文件),并且无论如何我都在群集上运行它.它更多的是关于速度.该脚本需要几天才能运行!

The whole thing works, memory is not a problem (I'm not loading it, just opening and closing the file as needed) and I run it on a cluster anyway. Its more about speed. The script takes days to run!

数据看起来像这样:

scaffold126     1       C       0:0:20:0:0:0     0:0:1:0:0:0     0:0:0:0:0:0     
scaffold126     2       C       0:0:10:0:0:0     0:0:1:0:0:0     0:0:0:0:0:0
scaffold5112     2       C       0:0:10:0:0:0     0:0:1:0:0:0     0:0:0:0:0:0
scaffold5112     2       C       0:0:10:0:0:0     0:0:1:0:0:0     0:0:0:0:0:0

,我正在从第一列中搜索以特定字符串开头的所有行.我想处理数据并将摘要发送到输出文件.然后我在所有行中搜索另一个字符串,依此类推...

and I am searching for all the lines that start with a particular string from the first column. I want to process the data and send a summary to a output file. Then I search for all the lines for another string and so on...

我正在使用类似这样的东西:

I am using something like this:

for (thisScaff in AllScaffs):
    InFile = open(sys.argv[2], 'r')
    for line in InFile:
        LineList = line.split()
        currentScaff = LineList[0]
        if (thisScaff == currentScaff):
            #Then do this stuff...

主要问题似乎是必须查找所有8亿行以查找与当前字符串匹配的行.然后,一旦我移动到另一个字符串,则必须再次查看所有 800 个字符串.我一直在探索grep选项,但是还有另一种方法吗?

The main problem seems to be that all 800 million lines have to be looked through to find those that match the current string. Then once I move to another string, all 800 have to be looked through again. I have been exploring grep options but is there another way?

非常感谢!

推荐答案

我的第一个直觉是将数据加载到数据库中,确保从第0列创建索引,然后根据需要进行查询.

My first instinct would be to load your data into a database, making sure to create an index from column 0, and then query as needed.

对于 Python 方法,试试这个:

For a Python approach, try this:

wanted_scaffs  = set(['scaffold126', 'scaffold5112'])
files = {name: open(name+'.txt', 'w') for name in wanted_scaffs}
for line in big_file:
    curr_scaff = line.split(' ', 1)[0] # minimal splitting
    if curr_scaff in wanted_scaffs:
        files[key].write(line)
for f in files.values():
    f.close()

然后执行摘要报告:

for scaff in wanted_scaffs:
    with open(scaff + '.txt', 'r') as f:
        ... # summarize your data

这篇关于Python大文件,如何查找具有特定字符串的特定行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆