使用 BeautifulSoup 删除标签但保留其内容 [英] Remove a tag using BeautifulSoup but keep its contents
本文介绍了使用 BeautifulSoup 删除标签但保留其内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
目前我有这样的代码:
soup = BeautifulSoup(value)
for tag in soup.findAll(True):
if tag.name not in VALID_TAGS:
tag.extract()
soup.renderContents()
除了我不想扔掉无效标签内的内容.调用soup.renderContents()时如何去掉标签但保留内容?
Except I don't want to throw away the contents inside the invalid tag. How do I get rid of the tag but keep the contents inside when calling soup.renderContents()?
推荐答案
我使用的策略是,如果标签属于 NavigableString
类型,则用其内容替换标签,如果不是,则递归到它们并用 NavigableString
等替换它们的内容.试试这个:
The strategy I used is to replace a tag with its contents if they are of type NavigableString
and if they aren't, then recurse into them and replace their contents with NavigableString
, etc. Try this:
from BeautifulSoup import BeautifulSoup, NavigableString
def strip_tags(html, invalid_tags):
soup = BeautifulSoup(html)
for tag in soup.findAll(True):
if tag.name in invalid_tags:
s = ""
for c in tag.contents:
if not isinstance(c, NavigableString):
c = strip_tags(unicode(c), invalid_tags)
s += unicode(c)
tag.replaceWith(s)
return soup
html = "<p>Good, <b>bad</b>, and <i>ug<b>l</b><u>y</u></i></p>"
invalid_tags = ['b', 'i', 'u']
print strip_tags(html, invalid_tags)
结果是:
<p>Good, bad, and ugly</p>
我在另一个问题上给出了同样的答案.似乎出现了很多.
I gave this same answer on another question. It seems to come up a lot.
这篇关于使用 BeautifulSoup 删除标签但保留其内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文