如何使用 BeautifulSoup 删除嵌套标签中的内容? [英] How to remove content in nested tags with BeautifulSoup?
本文介绍了如何使用 BeautifulSoup 删除嵌套标签中的内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
如何使用 BeautifulSoup
删除嵌套标签中的内容?这些帖子显示反向检索嵌套标签中的内容:How使用 BeautifulSoup 和 BeautifulSoup:如何从包含一些嵌套 的
> 列表中提取所有
我试过 .text
但它只删除标签
所需的输出:
<块引用>别的东西
解决方案
您可以检查孩子的 bs4.element.NavigableString
:
from bs4 import BeautifulSoup as bs进口BS4html = "<foo>某事<bar>等等</bar>某事<bar2>GONE!</bar2> else</foo>"def get_only_text(elem):对于 elem.children 中的项目:if isinstance(item,bs4.element.NavigableString):产量项目打印 ''.join(get_only_text(bs(html).find_all('foo')[0]))
输出;
东西别的东西
How to remove content in nested tags with BeautifulSoup
? These posts showed the reverse to retrieve the content in nested tags: How to get contents of nested tag using BeautifulSoup, and BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?
I have tried .text
but it only removes the tags
>>> from bs4 import BeautifulSoup as bs
>>> html = "<foo>Something something <bar> blah blah</bar> something</foo>"
>>> bs(html).find_all('foo')[0]
<foo>Something something <bar> blah blah</bar> something else</foo>
>>> bs(html).find_all('foo')[0].text
u'Something something blah blah something else'
Desired output:
Something something something else
解决方案
You can check for bs4.element.NavigableString
on children:
from bs4 import BeautifulSoup as bs
import bs4
html = "<foo>Something something <bar> blah blah</bar> something <bar2>GONE!</bar2> else</foo>"
def get_only_text(elem):
for item in elem.children:
if isinstance(item,bs4.element.NavigableString):
yield item
print ''.join(get_only_text(bs(html).find_all('foo')[0]))
Output;
Something something something else
这篇关于如何使用 BeautifulSoup 删除嵌套标签中的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文