如何使用BeautifulSoup删除嵌套标签中的内容? [英] How to remove content in nested tags with BeautifulSoup?

查看:547
本文介绍了如何使用BeautifulSoup删除嵌套标签中的内容?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何使用BeautifulSoup删除嵌套标签中的内容?这些帖子显示了在嵌套标签中检索内容的反向操作:如何使用BeautifulSoup

所需的输出:

有些别的东西

您可以检查子级上的bs4.element.NavigableString:

from bs4 import BeautifulSoup as bs
import bs4
html = "<foo>Something something <bar> blah blah</bar> something <bar2>GONE!</bar2> else</foo>"
def get_only_text(elem):
    for item in elem.children:
        if isinstance(item,bs4.element.NavigableString):
            yield item

print ''.join(get_only_text(bs(html).find_all('foo')[0]))

输出;

Something something  something  else

How to remove content in nested tags with BeautifulSoup? These posts showed the reverse to retrieve the content in nested tags: How to get contents of nested tag using BeautifulSoup, and BeautifulSoup: How do I extract all the <li>s from a list of <ul>s that contains some nested <ul>s?

I have tried .text but it only removes the tags

>>> from bs4 import BeautifulSoup as bs
>>> html = "<foo>Something something <bar> blah blah</bar> something</foo>"
>>> bs(html).find_all('foo')[0]
<foo>Something something <bar> blah blah</bar> something else</foo>
>>> bs(html).find_all('foo')[0].text
u'Something something  blah blah something else'

Desired output:

Something something something else

解决方案

You can check for bs4.element.NavigableString on children:

from bs4 import BeautifulSoup as bs
import bs4
html = "<foo>Something something <bar> blah blah</bar> something <bar2>GONE!</bar2> else</foo>"
def get_only_text(elem):
    for item in elem.children:
        if isinstance(item,bs4.element.NavigableString):
            yield item

print ''.join(get_only_text(bs(html).find_all('foo')[0]))

Output;

Something something  something  else

这篇关于如何使用BeautifulSoup删除嵌套标签中的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆