用美丽的汤标签的正则表达式 [英] using regex on beautiful soup tags

查看:144
本文介绍了用美丽的汤标签的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近用美丽的汤4,我一直在努力理解这一些基础知识(我是bs3.x出于某种原因,相当确定)。因此,例如,客户可以通过做一些事情开始简单的像:

I have been recently using beautiful soup 4 and I have been struggling to understand some basics of this (I was quite ok with bs3.x for some reason). So, for example, lets start off by doing something simple like:

data=soup.find_all('h2')

这将产生我喜欢的东西

which yields me something like:

<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\">more-accurate-data</a></h2>

这是罚款。但是,当我想正则表达式上面的字符串,用的东西沿着线断(假设上述存储在TEMP):

which is fine. But when I want to regex the above string, using something along the lines off (assuming the above is stored in "temp"):

t=str(re.compile(r"""<h2><a href=\\"/accurate(.*?)\\">""").search(str(temp)).group(1))

我得到:

AttributeError: 'NoneType' object has no attribute 'group'

我觉得奇怪 - 因为,当我做蟒间pretter,是这样的:

which I find strange - because, when I do on the python interpretter, something like:

k=r"""<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\">more-accurate-data</a></h2>"""

然后用正则表达式之上,一切工作正常。我很奇怪,为什么由BS4产生的标签式的非似乎regex'able。现在我觉得,也许我做的这些,我不知道什么傻事或者也可以是已经bs3.x和BS4之间变化。任何帮助将是AP preciated。谢谢你。

and then use the above regex, everything works fine. I am wondering why the "tags" type generated by bs4 seems non regex'able. Now I feel maybe I am doing something stupid or maybe something has changed between bs3.x and bs4 which I am not aware of. Any help on this would be appreciated. Thanks.

推荐答案

您应该尝试看看再版字符串:

You should try to see the repr of the string:

>>> a=r"""<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\">more-accurate-data</a></h2>"""
>>> print repr(a)
'<h2><a href=\\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\\">more-accurate-data</a></h2>'

和正则表达式的工作原理与此重新presentation:

And the regex works with this representation:

>>> regex = re.compile(r"""<h2><a href=\\"/accurate(.*?)\\">""")
>>> regex.match(a)
<_sre.SRE_Match object at 0x20fbf30>

问题是,距离美丽的汤,结果是不同的,因为你没有打印的再版。当使用正则表达式处理这是一个好主意,检查再版参与,以避免这样的事情串。

The problem is that the result from beautiful soup is different, because you did not print its repr. When dealing with regexes it's a good idea to check the repr of the strings involved to avoid things like this.

这篇关于用美丽的汤标签的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆