瓶颈?更有效的正则表达式? [英] Bottleneck? More efficient regular expression?
问题描述
您好,
我一直在努力使用正则表达式来解析XML文件,这会一直给出运行时错误最大值
超出递归限制。这是模式字符串:
r''< code>(?P< c>。*?)< / code>。*?< targetSeq
name ="(?P< tn>。*?)">。*?< target>(?P< t>。*?)< / target>。*?< align> ;(?P< a>。*?)< / align>。*?< template>(?P< temp>。*?)< / template>。*?< a
otherTag>(?P< at>。*?)< / anotherTag>。*?< yetAnotherTag>(?P< yat>。*?)< / yetAnotherTag>''
>
文件格式是直截了当的。以下是一个示例:
< code> 1cg2< / code>
< chain> a< / chain>
< settings> abcde< / settings>
< scoreInfo> 12345< / scoreInfo>
< targetSeq name =" 1onc"> blah
< / targetSeq>
< alignment size =" 335">
< target> WLTFQKKHITNTRDVDCDNIMS< / target>
< align> :| .. | :。 | 。 |。 。 :< / align>
< template> QKRDNVLFQAATDEQPAVIKTLEKL< / template>
< anotherTag> foobarfoobar< / anotherTag>
<还有另一个标签> barfoobarfoo< / yetAnotherTag>
#这组标签然后在文件中重复多次
如果我搜索模式到< / template> (即没有< anotherTag>开始),它工作正常。一旦我将
后面的位添加到模式中就会出错。
我听说非贪婪(*?)是低效的,所以我试过更换所有。*?使用(?!< target>)等,这意味着如果
下一段文字与< target>不匹配标签继续但它给出了同样的错误。
所以我的问题是:这种模式的瓶颈是什么? RE中有经验的人可以在这里给出一些提示吗?
非常感谢您的帮助!
Tina
----- =通过Newsfeeds.Com发布,未经审查的Usenet新闻= -----
http://www.newsfeeds.com - 世界排名第一的新闻组服务!
----- ==超过100,000个新闻组 - 19个不同的服务器! = -----
Hello,
I''ve been struggling with a regular expression for parsing XML files, which keeps giving the run time error "maximum
recursion limit exceeded". Here is the pattern string:
r''<code>(?P<c>.*?)</code>.*?<targetSeq
name="(?P<tn>.*?)">.*?<target>(?P<t>.*?)</target>.*?<align>(?P<a>.*?)</align>.*?<template>(?P<temp>.*?)</template>.*?<an
otherTag>(?P<at>.*?)</anotherTag>.*?<yetAnotherTag>(?P<yat>.*?)</yetAnotherTag>''
The file format is straighforward. Here is a sample:
<code>1cg2</code>
<chain>a</chain>
<settings>abcde</settings>
<scoreInfo>12345</scoreInfo>
<targetSeq name="1onc">blah
</targetSeq>
<alignment size="335">
<target>WLTFQKKHITNTRDVDCDNIMS</target>
<align> :| ..| : . | . |. . :</align>
<template>QKRDNVLFQAATDEQPAVIKTLEKL</template>
<anotherTag>foobarfoobar</anotherTag>
<yetAnotherTag>barfoobarfoo</yetAnotherTag>
# this group of tags then repeat in the file multiple times
If I search for the pattern up to "</template>" (i.e. no <anotherTag> onwards), it works fine. As soon as I added the
later bits into the pattern it gives the error.
I heard that non-greedy (*?) is inefficient, so I tried replacing all .*? with (?!<target>) etc. which means "if the the
next piece of text doesn''t match the <target> tag keep going". But it gives the same error.
So my question is: what is the bottleneck in this pattern? Could someone more experienced in REs give some hints here?
Your help is greatly appreciated!
Tina
-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----== Over 100,000 Newsgroups - 19 Different Servers! =-----
推荐答案
Tina Li写道:
Tina Li wrote:
你好,
[跳过]
Hello,
[skipped]
就我个人而言,我宁愿把它分成几个单独的版本。比如
match_tag(tag_name)匹配字符串开头的标签组。
hth,
anton。
Personally I''d rather split it in several separate re''s. Something like
match_tag(tag_name) that matches tag group at the start of the string.
hth,
anton.
文章< 3f ******** @ corp.newsgroups.com>,
" Tina李" < tina_li23 AT hotmail DOT com>写道:
In article <3f********@corp.newsgroups.com>,
"Tina Li" <tina_li23 AT hotmail DOT com> wrote:
我一直在努力使用正则表达式解析XML文件,
不断给出运行时错误超出最大递归限制 ;。
I''ve been struggling with a regular expression for parsing XML files,
which keeps giving the run time error "maximum recursion limit
exceeded".
为什么不使用真正的XML解析器? xml.parsers.expat很容易使用,
没有递归限制的问题,并且当有人在一个生成有效的XML文件时会继续工作
稍微不同的版本
比你预期的那个。
-
David Eppstein http://www.ics.uci.edu/~eppstein/
大学。加州,欧文,信息学院和计算机科学
Why not use a real XML parser? xml.parsers.expat is easy enough to use,
doesn''t have problems with recursion limits, and will continue working
when someone generates a valid XML file in a slightly different version
than the one you expect.
--
David Eppstein http://www.ics.uci.edu/~eppstein/
Univ. of California, Irvine, School of Information & Computer Science
您好,
感谢您的建议。它确实解决了我的问题 - 但出于好奇,我想知道是什么导致了
超限。
谢谢再次!
Tina
" David Eppstein" < EP ****** @ ics.uci.edu>在留言新闻中写道:ep **************************** @ news.service.u ci.edu ...
|在文章< 3f ******** @ corp.newsgroups.com>,
| Tina Li < tina_li23 AT hotmail DOT com>写道:
|
| >我一直在努力使用正则表达式来解析XML文件,
| >它一直给出运行时错误最大递归限制
| >超过。
|
|为什么不使用真正的XML解析器? xml.parsers.expat很容易使用,
|没有递归限制的问题,并将继续工作
|当有人在略有不同的版本中生成有效的XML文件时
|比你期望的那个。
|
| -
| David Eppstein http://www.ics.uci.edu/~eppstein/
|大学。加州,欧文,信息学院和计算机科学
----- =通过Newsfeeds.Com发布,未经审查的Usenet新闻= -----
http://www.newsfeeds.com - 世界排名第一的新闻组服务!
---- - ==超过100,000个新闻组 - 19个不同的服务器! = -----
Hello,
Thanks for the suggestion. It does solve my problem -- but just out of curiosity, I''d like to know what caused the
over-limit as well.
Thanks again!
Tina
"David Eppstein" <ep******@ics.uci.edu> wrote in message news:ep****************************@news.service.u ci.edu...
| In article <3f********@corp.newsgroups.com>,
| "Tina Li" <tina_li23 AT hotmail DOT com> wrote:
|
| > I''ve been struggling with a regular expression for parsing XML files,
| > which keeps giving the run time error "maximum recursion limit
| > exceeded".
|
| Why not use a real XML parser? xml.parsers.expat is easy enough to use,
| doesn''t have problems with recursion limits, and will continue working
| when someone generates a valid XML file in a slightly different version
| than the one you expect.
|
| --
| David Eppstein http://www.ics.uci.edu/~eppstein/
| Univ. of California, Irvine, School of Information & Computer Science
-----= Posted via Newsfeeds.Com, Uncensored Usenet News =-----
http://www.newsfeeds.com - The #1 Newsgroup Service in the World!
-----== Over 100,000 Newsgroups - 19 Different Servers! =-----
这篇关于瓶颈?更有效的正则表达式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!