Python正则表达式后顾需要固定宽度模式 [英] Python regex look-behind requires fixed-width pattern

查看:2225
本文介绍了Python正则表达式后顾需要固定宽度模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 (?< p>< p> =< title。*>)([\\\\ s] *)(?=< / title>)

这将提取文档中标签之间的所有内容,并忽略标签本身。然而,当试图在Python中使用这个正则表达式时,会引发下面的异常:

  Traceback(最近一次调用最后一次):
在< module>文件中的test.py,第21行,
pattern = re.compile('(?<= 在编译
文件C:\Python31\lib\re.py,第205行,返回_compile(pattern,flags)
文件C:\Python31\lib\re .py,第273行,在_compile
p = sre_compile.compile(pattern,flags)文件
C:\Python31\lib\sre_compile.py,第495行,编译
code = _code(p,flags)文件C:\Python31\lib\sre_compile.py,行480,在_code
_compile(code,p.data,flags)文件C: \Python31\lib\sre_compile.py,第115行,在_compile
中引发错误(look-behind requires fixed-width pattern)
sre_constants.error:look-behind需要fixed-宽度模式

我使用的代码是:



<?p $ p> pattern = re.compile('(?<?< title。*>)([\ s\S] *)(?=< / title> ;)')
m = pattern.search(f)



  pattern = re.compile('(?< =< title(>)([\\\\\\\\\\\] *)(?=< / title>)')
m = pattern.search(f)

但是,这并不会考虑潜在的html标题,因为某些原因它们具有属性或类似的特征。



任何人都知道这个问题很好的解决方法?如果您只想获得标题标签,


$ b

解决方案

$ b

  html = urllib2.urlopen(http:// somewhere).read()
for html.split(< / title> ):
如果< title> in item:
print item [item.find(< title>)+ 7:]


When trying to extract the title of a html-page I have always used the following regex:

(?<=<title.*>)([\s\S]*)(?=</title>)

Which will extract everything between the tags in a document and ignore the tags themselves. However, when trying to use this regex in Python it raises the following Exception:

Traceback (most recent call last):  
File "test.py", line 21, in <module>
    pattern = re.compile('(?<=<title.*>)([\s\S]*)(?=</title>)')
File "C:\Python31\lib\re.py", line 205, in compile
    return _compile(pattern, flags)   
File "C:\Python31\lib\re.py", line 273, in _compile
    p = sre_compile.compile(pattern, flags)   File
"C:\Python31\lib\sre_compile.py", line 495, in compile
    code = _code(p, flags)   File "C:\Python31\lib\sre_compile.py", line 480, in _code
_compile(code, p.data, flags)   File "C:\Python31\lib\sre_compile.py", line 115, in _compile
    raise error("look-behind requires fixed-width pattern")
sre_constants.error: look-behind requires fixed-width pattern

The code I am using is:

pattern = re.compile('(?<=<title.*>)([\s\S]*)(?=</title>)')
m = pattern.search(f)

if I do some minimal adjustments it works:

pattern = re.compile('(?<=<title>)([\s\S]*)(?=</title>)')
m = pattern.search(f)

This will, however, not take into account potential html titles that for some reason have attributes or similar.

Anyone know a good workaround for this issue? Any tips are appreciated.

解决方案

If you just want to get the title tag,

html=urllib2.urlopen("http://somewhere").read()
for item in html.split("</title>"):
    if "<title>" in item:
        print item[ item.find("<title>")+7: ]

这篇关于Python正则表达式后顾需要固定宽度模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆