如何在中查找所有文本使用BeautifulSoup的HTML页面中的元素 [英] How to find all text inside elements in an HTML page using BeautifulSoup

查看：74 发布时间：2020/9/20 7:02:23 python unicode html-parsing beautifulsoup

本文介绍了如何在中查找所有文本使用BeautifulSoup的HTML页面中的元素的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要使用Python中的BeautifulSoup在HTML文件中找到段落元素内的所有可见标签.
例如，
Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.
应该返回:
Many hundreds of cultivars exist.

I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python.
For example,
Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.
should return:
Many hundreds of cultivars exist.

P.S.某些文件包含需要提取的Unicode字符(印地语).
任何想法如何做到这一点?

P.S. Some files contain Unicode characters (Hindi) which need to be extracted.
Any ideas how to do that?

推荐答案

以下是使用BeautifulSoup的方法.这样会删除所有不在VALID_TAGS中的标签，但会保留已删除标签的内容.

Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the removed tags.

from BeautifulSoup import BeautifulSoup

VALID_TAGS = ['div', 'p']

soup = BeautifulSoup(value)

for tag in soup.findAll('p'):
    if tag.name not in VALID_TAGS:
        tag.replaceWith(tag.renderContents())

print soup.renderContents()

参考

这篇关于如何在中查找所有文本使用BeautifulSoup的HTML页面中的元素的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何在< p>中查找所有文本使用BeautifulSoup的HTML页面中的元素 [英] How to find all text inside <p> elements in an HTML page using BeautifulSoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何在&lt; p&gt;中查找所有文本使用BeautifulSoup的HTML页面中的元素 [英] How to find all text inside &lt;p&gt; elements in an HTML page using BeautifulSoup

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

如何在< p>中查找所有文本使用BeautifulSoup的HTML页面中的元素 [英] How to find all text inside <p> elements in an HTML page using BeautifulSoup

登录关闭