使用python中指定的分隔符逐块读取文件 [英] Reading in file block by block using specified delimiter in python

查看:575
本文介绍了使用python中指定的分隔符逐块读取文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个这样的input_file.fa文件( FASTA 格式):

I have an input_file.fa file like this (FASTA format):

> header1 description
data data
data
>header2 description
more data
data
data

我想一次在文件中读取一个块,以便每个块都包含一个标头和相应的数据,例如区块1:

I want to read in the file one chunk at a time, so that each chunk contains one header and the corresponding data, e.g. block 1:

> header1 description
data data
data

我当然可以像这样读取文件并拆分:

Of course I could just read in the file like this and split:

with open("1.fa") as f:
    for block in f.read().split(">"):
        pass

但是我想避免将整个文件读到内存中,因为文件通常很大.

But I want to avoid the reading the whole file into memory, because the files are often large.

我当然可以逐行读取文件:

I can read in the file line by line of course:

with open("input_file.fa") as f:
    for line in f:
        pass

但理想情况下,我想要的是这样的东西:

But ideally what I want is something like this:

with open("input_file.fa", newline=">") as f:
    for block in f:
        pass

但是我得到一个错误:

ValueError:非法换行值:>

ValueError: illegal newline value: >

我也尝试使用 csv模块,但没有成功.

I've also tried using the csv module, but with no success.

我确实从3中找到了此帖子年前,它为该问题提供了基于生成器的解决方案,但似乎并不那么紧凑,这真的是唯一/最佳的解决方案吗?如果可以用单行而不是单独的函数来创建生成器,那将是一件很整齐的事情,例如以下伪代码:

I did find this post from 3 years ago, which provides a generator based solution to this issue, but it doesn't seem that compact, is this really the only/best solution? It would be neat if it is possible to create the generator with a single line rather than a separate function, something like this pseudocode:

with open("input_file.fa") as f:
    blocks = magic_generator_split_by_>
    for block in blocks:
        pass

如果这不可能,那么我想您可以认为我的问题是另一篇文章的重复,但如果是这样,我希望人们可以向我解释为什么另一种解决方案是唯一的解决方案.非常感谢.

If this is impossible, then I guess you could consider my question a duplicate of the other post, but if that is so, I hope people can explain to me why the other solution is the only one. Many thanks.

推荐答案

此处的一般解决方案是为此编写一个生成器函数,该函数一次生成一组.这是您一次只能在内存中存储一​​组.

A general solution here will be write a generator function for this that yields one group at a time. This was you will be storing only one group at a time in memory.

def get_groups(seq, group_by):
    data = []
    for line in seq:
        # Here the `startswith()` logic can be replaced with other
        # condition(s) depending on the requirement.
        if line.startswith(group_by):
            if data:
                yield data
                data = []
        data.append(line)

    if data:
        yield data

with open('input.txt') as f:
    for i, group in enumerate(get_groups(f, ">"), start=1):
        print ("Group #{}".format(i))
        print ("".join(group))

输出:

Group #1
> header1 description
data data
data

Group #2
>header2 description
more data
data
data


对于一般的FASTA格式,我建议使用 Biopython 软件包.


For FASTA formats in general I would recommend using Biopython package.

这篇关于使用python中指定的分隔符逐块读取文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆