使用BeautifulSoup排除findAll的不需要的结果 [英] Excluding unwanted results of findAll using BeautifulSoup

查看:380
本文介绍了使用BeautifulSoup排除findAll的不需要的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用BeautifulSoup,我的目标是抓取与此HTML挂钩相关的文本:

Using BeautifulSoup, I am aiming to scrape the text associated with this HTML hook:

<p class="review_comment">

因此,使用以下简单代码,

So, using the simple code as follows,

content = page.read()  
soup = BeautifulSoup(content)  
results = soup.find_all("p", "review_comment")

我很高兴地解析这里的文字:

I am happily parsing the text that is living here:

<p class="review_comment">
    This place is terrible!</p>

坏消息是soup.find_all每30次左右匹配一次,它也匹配并捕获我真正不想要的东西,这是用户自更新以来的旧评论:

The bad news is that every 30 or so times the soup.find_all gets a match, it also matches and grabs something that I really don't want, which is a user's old review that they've since updated:

<p class="review_comment">
    It's 1999, and I will always love this place…  
<a href="#" class="show-archived">Read more &raquo;</a></p>

在我试图排除这些旧的重复评论时,我尝试了各种各样的想法.

In my attempts to exclude these old duplicate reviews, I have tried a hodgepodge of ideas.

  • 我一直在尝试更改soup.find_all()调用中的参数 专门排除<a href="#" class="show-archived">Read more &raquo;</a>
  • 之前之前的所有文本
  • 我沉迷于正则表达式类型的匹配边缘,但没有成功.
  • 我似乎无法利用class="show-archived"属性.
  • I've been trying to alter the arguments in my soup.find_all() call to specifically exclude any text that comes before the <a href="#" class="show-archived">Read more &raquo;</a>
  • I've drowned in Regular Expressions-type matching limbo with no success.
  • I can't seem to take advantage of the class="show-archived" attribute.

任何想法将不胜感激.预先感谢.

Any ideas would be gratefully appreciated. Thanks in advance.

推荐答案

这是您要寻找的吗?

for p in soup.find_all("p", "review_comment"):
    if p.find(class_='show-archived'):
        continue
    # p is now a wanted p

这篇关于使用BeautifulSoup排除findAll的不需要的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆