使用Beautiful Soup从字符串中剥离html标签 [英] Using Beautiful Soup to strip html tags from a string

查看:65
本文介绍了使用Beautiful Soup从字符串中剥离html标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人有示例代码说明如何使用Python的Beautiful Soup从文本字符串中剥离除某些HTML标记之外的所有html标签吗?

Does anyone have some sample code that illustrates how to use Python's Beautiful Soup to strip all html tags, except some, from a string of text?

我想剥离所有的javascript和html标签,除了:

I want to strip all javascript and html tags everything except:

<a></a>
<b></b>
<i></i>

还有类似的东西

<a onclick=""></a>

感谢您的帮助-为此,我在互联网上找不到很多东西.

Thanks for helping -- I couldn't find much on the internet for this purpose.

推荐答案

import BeautifulSoup

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onclick="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
        print(tag)

收益

<i>paragraph</i>
<a onclick="">one</a>
<i>paragraph</i>
<b>two</b>

如果只需要文本内容,则可以将 print(tag)更改为 print(tag.string).

If you just want the text contents, you could change print(tag) to print(tag.string).

如果要从 a 标记中删除诸如 onclick =" 之类的属性,则可以执行以下操作:

If you want to remove an attribute like onclick="" from the a tag, you could do this:

if isinstance(tag,BeautifulSoup.Tag) and tag.name in ('a','b','i'):
    if tag.name=='a':
        del tag['onclick']
    print(tag)

这篇关于使用Beautiful Soup从字符串中剥离html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆