替换或删除 HTML 标签 &内容 Python 正则表达式 [英] Replace or Remove HTML Tag & Content Python Regex

查看:26
本文介绍了替换或删除 HTML 标签 &内容 Python 正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想用正则表达式删除一个打开和关闭的 HTML 以及两个标签之间的内容.如何删除以下字符串中的 标记.

I want to remove an HTML open and close and the content between the two tags with regular expressions. How can I remove the <head> tag in the following string.

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

所以它看起来像这样:

my_string = '''
<html>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

推荐答案

您可以使用 Beautiful Soup 在 Python 中使用 decompose() 函数.试试这个 Python 代码,

You can remove head tag from HTML text using Beautiful Soup in Python using decompose() function. Try this Python code,

from bs4 import BeautifulSoup

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

soup = BeautifulSoup(my_string)
soup.find('head').decompose()  # find head tag and decompose/destroy it from the html
print(soup)                    # prints html text without head tag

打印,

<html>

<meta/>
<p>
        this is a different paragraph tag
        </p>
</html>

另外,虽然不推荐使用正则表达式,但是如果您要删除的标签不是嵌套的,您可以使用这些 Python 代码使用您在评论中提到的正则表达式来删除它.但始终避免使用正则表达式来解析嵌套结构并使用适当的解析器.

Also, although regex way is not recommended, but if the tag you want to remove isn't nested, you can remove it using the regex you mentioned in your comments using these Python codes. But always avoid using regex for parsing nested structures and go for a proper parser.

import re

my_string = '''
<html>
    <head>
        <p>
        this is a paragraph tag
        </p>
    </head>
    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>
'''

print(re.sub(r'(?s)<head>.*?</head>', '', my_string))

打印以下内容并注意 (?s) 的用法,当您的 html 分布在多行中时,它需要启用点匹配换行符,

Prints following and notice the usage of (?s) which is needed to enable dot matching newlines as your html is spread across multiple lines,

<html>

    <meta>
        <p>
        this is a different paragraph tag
        </p>
    </meta>
</html>

这篇关于替换或删除 HTML 标签 &amp;内容 Python 正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆