使用BeautifulSoup排除findAll的不需要的结果 [英] Excluding unwanted results of findAll using BeautifulSoup
问题描述
使用BeautifulSoup,我的目标是抓取与此HTML挂钩相关的文本:
Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook:
<p class="review_comment">
因此,使用以下简单代码,
So, using the simple code as follows,
content = page.read()
soup = BeautifulSoup(content)
results = soup.find_all("p", "review_comment")
我很高兴地解析这里的文字:
I am happily parsing the text that is living here:
<p class="review_comment">
This place is terrible!</p>
坏消息是soup.find_all
每30次左右匹配一次,它也匹配并捕获我真正不想要的东西,这是用户自更新以来的旧评论:>
The bad news is that every 30 or so times the soup.find_all
gets a match, it also matches and grabs something that I really don't want, which is a user's old review that they've since updated:
<p class="review_comment">
It's 1999, and I will always love this place…
<a href="#" class="show-archived">Read more »</a></p>
在我试图排除这些旧的重复评论时,我尝试了各种各样的想法.
In my attempts to exclude these old duplicate reviews, I have tried a hodgepodge of ideas.
- 我一直在尝试更改
soup.find_all()
调用中的参数 专门排除<a href="#" class="show-archived">Read more »</a>
之前之前的所有文本
- 我沉迷于正则表达式类型的匹配边缘,但没有成功.
- 我似乎无法利用
class="show-archived"
属性.
- I've been trying to alter the arguments in my
soup.find_all()
call to specifically exclude any text that comes before the<a href="#" class="show-archived">Read more »</a>
- I've drowned in Regular Expressions-type matching limbo with no success.
- I can't seem to take advantage of the
class="show-archived"
attribute.
任何想法将不胜感激.预先感谢.
Any ideas would be gratefully appreciated. Thanks in advance.
推荐答案
这是您要寻找的吗?
for p in soup.find_all("p", "review_comment"):
if p.find(class_='show-archived'):
continue
# p is now a wanted p
这篇关于使用BeautifulSoup排除findAll的不需要的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!