Beautiful Soup 的 Python 正则表达式 [英] Python regular expression for Beautiful Soup
问题描述
我用Beautiful Soup拉出特定的div标签,好像不能用简单的字符串匹配.
I am using Beautiful Soup to pull out specific div tags, and it seems I can't use simple string matching.
页面有一些
<div class="comment form new"...>
我想忽略的,还有一些
where the x's represent an integer of arbitrary length, and the ellipses represents an arbitrary number of other values separated by white spaces (that I'm not concerned about). I can't figure out the
correct regex expression, especially since I've never used python's re class.
其中 x 代表任意长度的整数,椭圆代表任意数量的由空格分隔的其他值(我不关心).我想不通正确的正则表达式,尤其是因为我从未使用过 python 的 re 类.
soup.find_all(class_="comment")
使用
soup.find_all(class_=re.compile(r'(comment)( )(comment)'))
soup.find_all(class_=re.compile(r'comment comment.*'))
查找以单词 comment 开头的所有标签.我试过使用
and lots of other variations, but I think I'm missing something obvious here about how regex expressions or match() work. Can anyone help me out?
解决方案
和许多其他变体,但我认为我在这里遗漏了一些关于正则表达式或 match() 工作方式的明显内容.谁能帮帮我?
I think I've got it:
推荐答案
我想我明白了:
请注意,与 BS3 中的等效项不同,它不是这样的:
['comment form new', 'comment comment-xxxx...']
And that's why your regexps won't match.
这就是您的正则表达式不匹配的原因.
But you can match, e.g., this:
但是你可以匹配,例如,这个:
>>> soup.find_all('div', class_=re.compile('comment-'))
[<div class="comment comment-xxxx..."></div>]
请注意,BS 相当于 re.search
,而不是 re.match
,因此您不需要 'comment-.*'代码>.当然,如果你想匹配
'comment-12345'
而不是 'comment-of-another-kind
你想要,例如, 'comment-d+'
.
Note that BS does the equivalent of re.search
, not re.match
, so you don't need 'comment-.*'
. Of course if you want to match 'comment-12345'
but not 'comment-of-another-kind
you'd want, e.g., 'comment-d+'
.
这篇关于Beautiful Soup 的 Python 正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!