如何读取文件并在多行模式之间提取数据? [英] How to read a file and extract data between multiline patterns?

查看:53
本文介绍了如何读取文件并在多行模式之间提取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文件,我需要从中提取一条数据,由(可能)多行固定模式分隔

一些数据... [我的开局模式在这儿并且可以是多行]数据我要提取[我的结局可以是模式多行] ...更多数据

这些模式是固定的,内容总是相同的,只是它可以在单词之间包含新行.

如果我能保证我的模式将被可预测地格式化但不会,那么解决方案就会很简单.

有没有办法将这种模式"与流相匹配?

有一个问题,它几乎是重复的,并且答案指向缓冲输入.我的情况的不同之处在于我知道模式中的确切字符串,除了单词可能也由换行符分隔(因此不需要 \w* 类型的匹配)

解决方案

你在找这个吗?

<预><代码>>>>进口重新>>>数据 = """......一些数据...... [我的开场模式... 在这儿...并且可以是多行]数据...我想提取[我的结局... 可以是模式... 多行] ... 更多数据……">>>re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', 数据)['数据\n我想提取']

UPDATE 要将大文件读取成块,我建议采用以下方法:

## 下面是根据ChrisA的代码修改的## http://www.gossamer-threads.com/lists/python/python/1242366.## 标题为如何有效地从文件读取到任意分隔符?"进口重新类 ChunkIter:def __init__(self, f, delim):""" f: 文件对象delim:正则表达式模式"""self.f = fself.delim = re.compile(delim)self.buffer = ''self.part = '' # 要返回的字符串def read_to_delim(self):"""返回到最后一个 delim 的字符,如果在 EOF,则返回 None"""而未找到分隔符":b = self.f.read(256)如果不是 b: # 如果 EOFself.part = 无休息# 继续读取到缓冲区self.buffer += b# 尝试正则表达式拆分缓冲区字符串零件 = self.delim.split(self.buffer)# 如果找到模式如果零件[:-1]:# 检索直到最后一个 delim 的字符串self.part = ''.join(parts[:-1])# 重置缓冲区字符串self.buffer = 部分 [-1]休息返回self.part如果 __name__ == '__main__':with open('input.txt', 'r') as f:chunk = ChunkIter(f, '(\[[^]]*\]\s+(?:[^[]+)\s+\[[^]]+\])')而 chunk.read_to_delim():print re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', chunk.part)打印工作完成".

I have a file from which I need to extract one piece of data, delimited by (possibly) multiline fixed patterns

some data ... [my opening pattern
is here
and can be multiline] the data 
I want to extract [my ending
pattern which can be
multiline as well] ... more data

These patterns are fixed in the sense that the content is always the same, except that it can include new lines between words.

The solution would be simple if I had the assurance that my pattern will be predictably formatted but do not.

Is there a way to match such "patterns" to a stream?

There is a question which is an almost duplicate and the answers point towards buffering the input. The difference in my case is that I know exact strings in the pattern, except that the words are possibly also delimited by a newline (so no need for \w* kind of matches)

解决方案

Are you looking for this?

>>> import re
>>> data = """
... some data ... [my opening pattern
... is here
... and can be multiline] the data
... I want to extract [my ending
... pattern which can be
... multiline as well] ... more data
... """
>>> re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', data)
['the data \nI want to extract']

UPDATE To read a large file into chunks, I suggest the following approach:

## The following was modified based on ChrisA's code in
## http://www.gossamer-threads.com/lists/python/python/1242366.
## Titled " How to read from a file to an arbitrary delimiter efficiently?"
import re

class ChunkIter:
    def __init__(self, f, delim):
        """ f: file object
        delim: regex pattern"""        
        self.f = f
        self.delim = re.compile(delim)
        self.buffer = ''
        self.part = '' # the string to return

    def read_to_delim(self):
        """Return characters up to the last delim, or None if at EOF"""

        while "delimiter not found":
            b = self.f.read(256)
            if not b: # if EOF
                self.part = None
                break
            # Continue reading to buffer
            self.buffer += b
            # Try regex split the buffer string    
            parts = self.delim.split(self.buffer)
            # If pattern is found
            if parts[:-1]:
                # Retrieve the string up to the last delim
                self.part = ''.join(parts[:-1])
                # Reset buffer string
                self.buffer = parts[-1]
                break   

        return self.part

if __name__ == '__main__':
    with open('input.txt', 'r') as f:
        chunk = ChunkIter(f, '(\[[^]]*\]\s+(?:[^[]+)\s+\[[^]]+\])')
        while chunk.read_to_delim():
             print re.findall('\[[^]]*\]\s+([^[]+)\s+\[[^]]+\]', chunk.part)

    print 'job done.'

这篇关于如何读取文件并在多行模式之间提取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆