使用BeautifulSoup在HTML注释之间提取文本 [英] Extracting Text Between HTML Comments with BeautifulSoup

查看:225
本文介绍了使用BeautifulSoup在HTML注释之间提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用Python 3和BeautifulSoup 4,我希望能够从HTML页面提取仅由其上方的注释来描述的文本.一个例子:

Using Python 3 and BeautifulSoup 4, I would like to be able to extract text from an HTML page that only delineated by a comment above it. An example:

<\!--UNIQUE COMMENT-->
I would like to get this text
<\!--SECOND UNIQUE COMMENT-->
I would also like to find this text

我找到了多种方法来提取页面的文本或评论,但没有办法完成我要寻找的事情.任何帮助将不胜感激.

I have found various ways to extract a page's text or comments, but no way to do what I'm looking for. Any help would be greatly appreciated.

推荐答案

您只需要遍历所有可用注释以查看它是否是您的必需条目之一,然后显示以下元素的文本,如下所示:

You just need to iterate through all of the available comments to see if it is one of your required entries, and then display the text for the following element as follows:

from bs4 import BeautifulSoup, Comment

html = """
<html>
<body>
<p>p tag text</p>
<!--UNIQUE COMMENT-->
I would like to get this text
<!--SECOND UNIQUE COMMENT-->
I would also like to find this text
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')

for comment in soup.findAll(text=lambda text:isinstance(text, Comment)):
    if comment in ['UNIQUE COMMENT', 'SECOND UNIQUE COMMENT']:
        print comment.next_element.strip()

这将显示以下内容:

I would like to get this text
I would also like to find this text

这篇关于使用BeautifulSoup在HTML注释之间提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆