删除< style> ...< / style>使用html5lib或bleach的标签 [英] Remove contents of <style>...</style> tags using html5lib or bleach

查看:80
本文介绍了删除< style> ...< / style>使用html5lib或bleach的标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在使用出色的漂白库来删除错误的HTML。

I've been using the excellent bleach library for removing bad HTML.

我已经从Microsoft Word粘贴了很多HTML文档,其中包含以下内容:

I've got a load of HTML documents which have been pasted in from Microsoft Word, and contain things like:

<STYLE> st1:*{behavior:url(#ieooui) } </STYLE>

使用漂白剂(带有 style 标记不允许),让我留下:

Using bleach (with the style tag implicitly disallowed), leaves me with:

st1:*{behavior:url(#ieooui) }

这没有帮助。漂白剂似乎只能选择以下选项:

Which isn't helpful. Bleach seems only to have options to:


  • 转义标签;

  • 删除标签(但不删除标签)

我正在寻找第三个选项-删除标签及其内容。

I'm looking for a third option - remove the tags and their contents.

是否可以使用漂白剂或html5lib完全删除 style 标记及其内容? html5lib的文档并不是很多帮助。

Is there any way to use bleach or html5lib to completely remove the style tag and its contents? The documentation for html5lib isn't really a great deal of help.

推荐答案

原来是 lxml 是完成此任务的更好工具:

It turned out lxml was a better tool for this task:

from lxml.html.clean import Cleaner

def clean_word_text(text):
    # The only thing I need Cleaner for is to clear out the contents of
    # <style>...</style> tags
    cleaner = Cleaner(style=True)
    return cleaner.clean_html(text)

这篇关于删除&lt; style&gt; ...&lt; / style&gt;使用html5lib或bleach的标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆