如何使用 Python/Django 执行 HTML 解码/编码? [英] How do I perform HTML decoding/encoding using Python/Django?

查看:17
本文介绍了如何使用 Python/Django 执行 HTML 解码/编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个 HTML 编码的字符串:

I have a string that is HTML encoded:

'''<img class="size-medium wp-image-113"
 style="margin-left: 15px;" title="su1"
 src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg"
 alt="" width="300" height="194" />'''

我想把它改成:

<img class="size-medium wp-image-113" style="margin-left: 15px;" 
  title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" 
  alt="" width="300" height="194" /> 

我希望将其注册为 HTML,以便浏览器将其呈现为图像,而不是显示为文本.

I want this to register as HTML so that it is rendered as an image by the browser instead of being displayed as text.

字符串是这样存储的,因为我使用了一个名为 BeautifulSoup,它扫描"一个网页并从中获取某些内容,然后以该格式返回字符串.

The string is stored like that because I am using a web-scraping tool called BeautifulSoup, it "scans" a web-page and gets certain content from it, then returns the string in that format.

我已经在 C# 中找到了如何做到这一点,但在 Python 中没有找到.有人可以帮我吗?

I've found how to do this in C# but not in Python. Can someone help me out?

推荐答案

鉴于 Django 用例,对此有两个答案.下面是它的django.utils.html.escape函数,供参考:

Given the Django use case, there are two answers to this. Here is its django.utils.html.escape function, for reference:

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&l
t;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))

为了扭转这种情况,Jake 的回答中描述的 Cheetah 函数应该可以工作,但缺少单引号.此版本包含一个更新的元组,替换顺序颠倒以避免对称问题:

To reverse this, the Cheetah function described in Jake's answer should work, but is missing the single-quote. This version includes an updated tuple, with the order of replacement reversed to avoid symmetric problems:

def html_decode(s):
    """
    Returns the ASCII decoded version of the given HTML string. This does
    NOT remove normal HTML tags like <p>.
    """
    htmlCodes = (
            ("'", '&#39;'),
            ('"', '&quot;'),
            ('>', '&gt;'),
            ('<', '&lt;'),
            ('&', '&amp;')
        )
    for code in htmlCodes:
        s = s.replace(code[1], code[0])
    return s

unescaped = html_decode(my_string)

然而,这不是通用的解决方案;它只适用于用 django.utils.html.escape 编码的字符串.更一般地说,坚持使用标准库是个好主意:

This, however, is not a general solution; it is only appropriate for strings encoded with django.utils.html.escape. More generally, it is a good idea to stick with the standard library:

# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# >= Python 3.5:
from html import unescape
unescaped = unescape(my_string)

建议:将未转义的 HTML 存储在数据库中可能更有意义.如果可能,值得研究从 BeautifulSoup 中获取未转义的结果,并完全避免此过程.

As a suggestion: it may make more sense to store the HTML unescaped in your database. It'd be worth looking into getting unescaped results back from BeautifulSoup if possible, and avoiding this process altogether.

使用Django,转义只发生在模板渲染时;所以为了防止转义,你只需告诉模板引擎不要转义你的字符串.为此,请在模板中使用以下选项之一:

With Django, escaping only occurs during template rendering; so to prevent escaping you just tell the templating engine not to escape your string. To do that, use one of these options in your template:

{{ context_var|safe }}
{% autoescape off %}
    {{ context_var }}
{% endautoescape %}

这篇关于如何使用 Python/Django 执行 HTML 解码/编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆