BeautifulSoup从注释html提取文本 [英] BeautifulSoup extract text from comment html

查看:226
本文介绍了BeautifulSoup从注释html提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

很抱歉,如果这个问题与其他人很相似,我将无法使其他任何解决方案正常工作.我正在使用beautifulsoup抓取一个网站,并且试图从带有注释的表字段中获取信息:

Apologies if this question is simular to others, I wasn't able to make any of the other solutions work. I'm scraping a website using beautifulsoup and I am trying to get the information from a table field that's commented:

<td>
    <span class="release" data-release="1518739200"></span>
    <!--<p class="statistics">

                      <span class="views" clicks="1564058">1.56M Clicks</span>

                        <span class="interaction" likes="0"></span>

    </p>-->
</td>

我如何获得视图"和互动"部分?

How do I get the part 'views' and 'interaction'?

推荐答案

您需要从注释中提取HTML,然后使用BeautifulSoup再次解析它,如下所示:

You need to extract the HTML from the comment and parse it again with BeautifulSoup like this:

from bs4 import BeautifulSoup, Comment
html = """<td>
    <span class="release" data-release="1518739200"></span>
    <!--<p class="statistics">

                      <span class="views" clicks="1564058">1.56M Clicks</span>

                        <span class="interaction" likes="0"></span>

    </p>-->
</td>"""
soup = BeautifulSoup(html , 'lxml')
comment = soup.find(text=lambda text:isinstance(text, Comment))
commentsoup = BeautifulSoup(comment , 'lxml')
views = commentsoup.find('span', {'class': 'views'})
interaction= commentsoup.find('span', {'class': 'interaction'})
print (views.get_text(), interaction['likes'])

输出:

156万点击0

1.56M Clicks 0

如果评论不是页面上的第一条,则需要像这样将其编入索引:

If the comment is not the first on the page you would need to index it like this:

comment = soup.find_all(text=lambda text:isinstance(text, Comment))[1]

或从父元素中找到它.

已更新,以回应评论:

您可以为此使用父级"tr"元素.您提供的页面具有共享"而不是交互",因此我希望您得到一个NoneType对象,该对象给了您看到的错误.如果需要,可以在代码中为NoneType对象添加测试.

You can use the parent 'tr' element for this. The page you supplied had "shares" not "interaction" so I expect you got a NoneType object which gave you the error you saw. You could add tests in you code for NoneType objects if you need to.

from bs4 import BeautifulSoup, Comment
import requests
url = "https://imvdb.com/calendar/2018?page=1"
html = requests.get(url).text
soup = BeautifulSoup(html , 'lxml')

for tr in soup.find_all('tr'):
    comment = tr.find(text=lambda text:isinstance(text, Comment))
    commentsoup = BeautifulSoup(comment , 'lxml')
    views = commentsoup.find('span', {'class': 'views'})
    shares= commentsoup.find('span', {'class': 'shares'})
    print (views.get_text(), shares['data-shares'])

输出:

3.60K Views 0
1.56M Views 0
220.28K Views 0
6.09M Views 0
133.04K Views 0
163.62M Views 0
30.44K Views 0
2.95M Views 0
2.10M Views 0
83.21K Views 0
5.27K Views 0
...

这篇关于BeautifulSoup从注释html提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆