用美丽的汤标签的正则表达式 [英] using regex on beautiful soup tags

查看：144 发布时间：2016/8/5 19:18:25 python regex python-2.7 beautifulsoup

本文介绍了用美丽的汤标签的正则表达式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我最近用美丽的汤4，我一直在努力理解这一些基础知识（我是bs3.x出于某种原因，相当确定）。因此，例如，客户可以通过做一些事情开始简单的像：

I have been recently using beautiful soup 4 and I have been struggling to understand some basics of this (I was quite ok with bs3.x for some reason). So, for example, lets start off by doing something simple like:

data=soup.find_all('h2')

这将产生我喜欢的东西

which yields me something like:

<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\">more-accurate-data</a></h2>

这是罚款。但是，当我想正则表达式上面的字符串，用的东西沿着线断（假设上述存储在TEMP）：

which is fine. But when I want to regex the above string, using something along the lines off (assuming the above is stored in "temp"):

t=str(re.compile(r"""<h2><a href=\\"/accurate(.*?)\\">""").search(str(temp)).group(1))

我得到：

AttributeError: 'NoneType' object has no attribute 'group'

我觉得奇怪 - 因为，当我做蟒间pretter，是这样的：

which I find strange - because, when I do on the python interpretter, something like:

k=r"""<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\">more-accurate-data</a></h2>"""

然后用正则表达式之上，一切工作正常。我很奇怪，为什么由BS4产生的标签式的非似乎regex'able。现在我觉得，也许我做的这些，我不知道什么傻事或者也可以是已经bs3.x和BS4之间变化。任何帮助将是AP preciated。谢谢你。

and then use the above regex, everything works fine. I am wondering why the "tags" type generated by bs4 seems non regex'able. Now I feel maybe I am doing something stupid or maybe something has changed between bs3.x and bs4 which I am not aware of. Any help on this would be appreciated. Thanks.

推荐答案

您应该尝试看看再版字符串：

You should try to see the repr of the string:

>>> a=r"""<h2><a href=\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\">more-accurate-data</a></h2>"""
>>> print repr(a)
'<h2><a href=\\"/accurate-data/210-0023.prd?pageLevel=&amp;skuId=210-0023\\">more-accurate-data</a></h2>'

和正则表达式的工作原理与此重新presentation：

And the regex works with this representation:

>>> regex = re.compile(r"""<h2><a href=\\"/accurate(.*?)\\">""")
>>> regex.match(a)
<_sre.SRE_Match object at 0x20fbf30>

问题是，距离美丽的汤，结果是不同的，因为你没有打印的再版。当使用正则表达式处理这是一个好主意，检查再版参与，以避免这样的事情串。

The problem is that the result from beautiful soup is different, because you did not print its repr. When dealing with regexes it's a good idea to check the repr of the strings involved to avoid things like this.

这篇关于用美丽的汤标签的正则表达式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用美丽的汤标签的正则表达式 [英] using regex on beautiful soup tags

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

用美丽的汤标签的正则表达式 [英] using regex on beautiful soup tags

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭