Python 读取文件直到匹配,读取直到下一个模式 [英] Python read through file until match, read until next pattern

查看:44
本文介绍了Python 读取文件直到匹配,读取直到下一个模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

Python 2.4.3

Python 2.4.3

我需要通读一些文件(可以大到 10GB).我需要它做的是遍历文件,直到它匹配一个模式.然后打印该行和它后面的每一行,直到它匹配另一个模式.此时,继续读取文件直到下一个模式匹配.

I need to read through some files (can be as large as 10GB). What I need it to do is go through the file until it matches a pattern. Then print that line and every line after it until it matches another pattern. At that point, resume reading through the file until the next pattern match.

例如.文件包含.

---- Alpha ---- Zeta
...(text lines)

---- Bravo ---- Delta
...(text lines)

如果匹配 ---- Alpha ---- Zeta,它应该打印 ---- Alpha ---- Zeta 和之后的每一行,直到遇到 ---- Bravo ---- Delta(或除了 ---- Alpha ---- Zeta),它将直接读取,直到再次匹配 ---- Alpha ---- Zeta.

If matching on ---- Alpha ---- Zeta, it should print ---- Alpha ---- Zeta and every line after that until it encounters ---- Bravo ---- Delta (or whatever other than ---- Alpha ---- Zeta), which it will read right on by it until it matches ---- Alpha ---- Zeta again.

以下内容与我要查找的内容相匹配 - 但仅打印匹配的行 - 而不是其后的文本.

The following matches what i'm looking for - but only prints the matching line - and not the text that follows it.

知道我哪里出错了吗?

import re
fh = open('text.txt', 'r')

re1='(-)'   # Any Single Character 1
re2='(-)'   # Any Single Character 2
re3='(-)'   # Any Single Character 3
re4='(-)'   # Any Single Character 4
re5='( )'   # White Space 1
re6='(Alpha)'  # Word 1
re6a='((?:[a-z][a-z]+))'   # Word 1 alternate
re7='( )'   # White Space 2
re8='(-)'   # Any Single Character 5
re9='(-)'   # Any Single Character 6
re10='(-)'  # Any Single Character 7
re11='(-)'  # Any Single Character 8
re12='(\\s+)'  # White Space 3
re13='(Zeta)'  # Word 2
re13a='((?:[a-z][a-z]+))'  # Word 2 alternate


rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10+re11+re12+re13,re.IGNORECASE|re.DOTALL)
rga =     re.compile(re1+re2+re3+re4+re5+re6a+re7+re8+re9+re10+re11+re12+re13a,re.IGNORECASE|re.DOTALL)


for line in fh:
    if re.match(rg, line):
        print line
        fh.next()
        while not re.match(rga, line):
            print fh.next()

fh.close()

和我的示例文本文件.

---- Pappa ---- Oscar
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris eleifend imperdiet 
lacus quis imperdiet. Nulla erat neque, laoreet vel fermentum a, dapibus in sem. 
Maecenas elementum nisi nec neque pellentesque ac rutrum urna cursus. Nam non purus 
sit amet dolor fringilla venenatis. Integer augue neque, scelerisque ac dictum at, 
venenatis elementum libero. Etiam nec ante in augue porttitor laoreet. Aenean ultrices
pellentesque erat, id porta nulla vehicula id. Cras eu ante nec diam dapibus hendrerit
in ac diam. Vivamus velit erat, tincidunt id tempus vitae, tempor vel leo. Donec 
aliquam nibh mi, non dignissim justo.

---- Alpha ---- Zeta
Sed molestie tincidunt euismod. Morbi ultrices diam a nibh varius congue. Nulla velit
erat, luctus ac ornare vitae, pharetra quis felis. Sed diam orci, accumsan eget 
commodo eu, posuere sed mi. Phasellus non leo erat. Mauris turpis ipsum, mollis sed 
ismod nec, aliquam non quam. Vestibulum sem eros, euismod ut pharetra sit amet, 
dignissim eget leo.

---- Charley ---- Oscar
Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. 
Aliquam commodo, metus at vulputate hendrerit, dui justo tempor dui, at posuere    
ante vitae lorem. Fusce rutrum nibh a erat condimentum laoreet. Nullam eu hendrerit 
sapien. Suspendisse id lobortis urna. Maecenas ut suscipit nisi. Proin et metus at 
urna euismod sollicitudin eu at mi. Aliquam ac egestas magna. Quisque ac vestibulum 
lectus. Duis ac libero magna, et volutpat odio. Cras mollis tincidunt nibh vel rutrum.
Curabitur fringilla, ante eget scelerisque rhoncus, libero nisl porta leo, ac
vulputate mi erat vitae felis. Praesent auctor fringilla rutrum. Aenean sapien ligula,
imperdiet sodales ullamcorper ut, vulputate at enim.


---- Bravo ---- Delta
Donec cursus tincidunt pellentesque. Maecenas neque nisi, dignissim ac aliquet ac,
vestibulum ut tortor. Pellentesque habitant morbi tristique senectus et netus et
malesuada fames ac turpis egestas. Aenean ullamcorper dapibus accumsan. Aenean eros
tortor, ultrices at adipiscing sed, lobortis nec dolor. Fusce eros ligula, posuere
quis porta nec, rhoncus et leo. Curabitur turpis nunc, accumsan posuere pulvinar eget,
sollicitudin eget ipsum. Sed a nibh ac est porta sollicitudin. Pellentesque ut urna ut 
risus pharetra mollis tincidunt sit amet sapien. Sed semper sollicitudin eros quis 
pellentesque. Curabitur ac metus lorem, ac malesuada ipsum. Nulla turpis erat, congue 
eu gravida nec, egestas id nisi. Praesent tellus ligula, pretium vitae ullamcorper 
vitae, gravida eu ipsum. Cras sed erat ligula.


---- Alpha ---- Zeta
Cras id condimentum lectus. Sed sit amet odio eros, ut mollis sapien. Etiam varius 
tincidunt quam nec mattis. Nunc eu varius magna. Maecenas id ante nisl. Cras sed augue 
ipsum, non mollis velit. Fusce eu urna id justo sagittis laoreet non id urna. Nullam 
venenatis tincidunt gravida. Proin mattis est sit amet dolor malesuada sagittis. 
Curabitur in lacus rhoncus mi posuere ullamcorper. Phasellus eget odio libero, ut 
lacinia orci. Pellentesque iaculis, ligula at varius vulputate, arcu leo dignissim 
massa, non adipiscing lectus magna nec dolor. Quisque in libero nec orci vestibulum 
dapibus. Nulla turpis massa, varius quis gravida eu, bibendum et nisl. Fusce tincidunt 
laoreet elit, sed egestas diam pharetra eget. Maecenas lacus velit, egestas nec tempor 
eget, hendrerit et massa.

++++++++++++++++++++++ 更新+++++++++++++++++++++++++++++++

+++++++++++++++++++++ Update ++++++++++++++++++++++++++++++++

以下代码确实有效 - 它在标题类型行上匹配 - 打印该行以及它后面的每一行,直到下一个标题类型模式 - 不匹配,跳过直到下一个标题类型模式.

The following code does work - it matches on the header type row - prints that and every line after it until the next header type pattern - which is that doesn't match, skips until the next header type pattern.

唯一的问题是 - 它真的很慢.10m 线路大约需要一分钟.

Only problem is - it's really really butt slow. It takes about a minute to do through 10m lines.

re1='(-)'   # Any Single Character 1
re2='(-)'   # Any Single Character 2
re3='(-)'   # Any Single Character 3
re4='(-)'   # Any Single Character 4
re5='( )'   # White Space 1
re6='(Alpha)'  # Word 1
re6a='((?:[a-z][a-z]+))'   # Word 1 alternate
re7='( )'   # White Space 2
re8='(-)'   # Any Single Character 5
re9='(-)'   # Any Single Character 6
re10='(-)'  # Any Single Character 7
re11='(-)'  # Any Single Character 8
re12='(\\s+)'  # White Space 3
re13='(Zeta)'  # Word 2
re13a='((?:[a-z][a-z]+))'  # Word 2 alternate


rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8+re9+re10+re11+re12+re13,re.IGNORECASE|re.DOTALL)
rga = re.compile(re1+re2+re3+re4+re5+re6a+re7+re8+re9+re10+re11+re12+re13a,re.IGNORECASE|re.DOTALL)



linestop = 0
fh = open('test.txt', 'r')

for line in fh:
    if linestop == 0:
        if re.match(rg, line):
            print line
            linestop = 1
    else:
        if re.match(rga, line):
            linestop = 0
        else:
            print line

fh.close()

++++++++++ 如果我先给它添加一个 grep 部分,我想这会大大加快速度.即 grep out - 然后运行上面的正则表达式脚本.

+++++++++ If I add a grep part to it first, i'm thinking that'll speed things up tremendously. i.e. grep out - then run the above regex script.

我让 os.system 运行良好 - 我看不到如何通过 pOpen 传递正则表达式匹配

I got os.system to work good - I can't see how to pass a regex match via pOpen

**** 最终更新 **********

**** Final Update **********

我称之为完成.我最终做的是:

I'm calling this completed. What I ended up doing was:

  • 使用 os.system 遍历文件 - 并将结果写出.
  • 读取文件并使用我在上面的 re.match - 仅打印出必要的项目.

最终结果是,阅读一个 1000 万行的文件(打印出必要的项目)大约需要 65 秒,现在大约需要 3.5 秒.我希望我能想出如何通过 os.system 以外的 grep - 但也许它只是在 python 2.4 中没有很好地实现

net result was it went from taking about 65 seconds to read through a 10 million line file - printing out the necessary items - to about 3.5 seconds. I wish I could have figured out how to pass grep other than os.system - but maybe it's just not well implimented in python 2.4

推荐答案

你仍然在匹配 line,它不会改变,因为你仍然在 for 循环的同一个迭代中.

You're still matching against line, which doesn't change because you're still in the same iteration of the for loop.

这篇关于Python 读取文件直到匹配,读取直到下一个模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆