用美丽的汤标签的正则表达式 [英] using regex on beautiful soup tags
问题描述
我最近用美丽的汤4,我一直在努力理解这一些基础知识(我是bs3.x出于某种原因,相当确定)。因此,例如,客户可以通过做一些事情开始简单的像:
I have been recently using beautiful soup 4 and I have been struggling to understand some basics of this (I was quite ok with bs3.x for some reason). So, for example, lets start off by doing something simple like:
data=soup.find_all('h2')
这将产生我喜欢的东西
which yields me something like:
<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&skuId=210-0023\">more-accurate-data</a></h2>
这是罚款。但是,当我想正则表达式上面的字符串,用的东西沿着线断(假设上述存储在TEMP):
which is fine. But when I want to regex the above string, using something along the lines off (assuming the above is stored in "temp"):
t=str(re.compile(r"""<h2><a href=\\"/accurate(.*?)\\">""").search(str(temp)).group(1))
我得到:
AttributeError: 'NoneType' object has no attribute 'group'
我觉得奇怪 - 因为,当我做蟒间pretter,是这样的:
which I find strange - because, when I do on the python interpretter, something like:
k=r"""<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&skuId=210-0023\">more-accurate-data</a></h2>"""
然后用正则表达式之上,一切工作正常。我很奇怪,为什么由BS4产生的标签式的非似乎regex'able。现在我觉得,也许我做的这些,我不知道什么傻事或者也可以是已经bs3.x和BS4之间变化。任何帮助将是AP preciated。谢谢你。
and then use the above regex, everything works fine. I am wondering why the "tags" type generated by bs4 seems non regex'able. Now I feel maybe I am doing something stupid or maybe something has changed between bs3.x and bs4 which I am not aware of. Any help on this would be appreciated. Thanks.
推荐答案
您应该尝试看看再版
字符串:
You should try to see the repr
of the string:
>>> a=r"""<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&skuId=210-0023\">more-accurate-data</a></h2>"""
>>> print repr(a)
'<h2><a href=\\"/accurate-data/210-0023.prd?pageLevel=&skuId=210-0023\\">more-accurate-data</a></h2>'
和正则表达式的工作原理与此重新presentation:
And the regex works with this representation:
>>> regex = re.compile(r"""<h2><a href=\\"/accurate(.*?)\\">""")
>>> regex.match(a)
<_sre.SRE_Match object at 0x20fbf30>
问题是,距离美丽的汤,结果是不同的,因为你没有打印的再版。当使用正则表达式处理这是一个好主意,检查再版
参与,以避免这样的事情串。
The problem is that the result from beautiful soup is different, because you did not print its repr. When dealing with regexes it's a good idea to check the repr
of the strings involved to avoid things like this.
这篇关于用美丽的汤标签的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!