如何在< p>中查找所有文本使用BeautifulSoup的HTML页面中的元素 [英] How to find all text inside <p> elements in an HTML page using BeautifulSoup
问题描述
我需要使用Python中的BeautifulSoup在HTML文件中找到段落元素内的所有可见标签.
例如,
<p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p>
应该返回:
Many hundreds of cultivars exist.
I need to find all the visible tags inside paragraph elements in an HTML file using BeautifulSoup in Python.
For example,
<p>Many hundreds of named mango <a href="/wiki/Cultivar" title="Cultivar">cultivars</a> exist.</p>
should return:
Many hundreds of cultivars exist.
P.S.某些文件包含需要提取的Unicode字符(印地语).
任何想法如何做到这一点?
P.S. Some files contain Unicode characters (Hindi) which need to be extracted.
Any ideas how to do that?
推荐答案
以下是使用BeautifulSoup的方法.这样会删除所有不在VALID_TAGS中的标签,但会保留已删除标签的内容.
Here's how you can do it with BeautifulSoup. This will remove any tags not in VALID_TAGS but keep the content of the removed tags.
from BeautifulSoup import BeautifulSoup
VALID_TAGS = ['div', 'p']
soup = BeautifulSoup(value)
for tag in soup.findAll('p'):
if tag.name not in VALID_TAGS:
tag.replaceWith(tag.renderContents())
print soup.renderContents()
这篇关于如何在< p>中查找所有文本使用BeautifulSoup的HTML页面中的元素的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!