如何使用Python / Django执行HTML解码/编码? [英] How do I perform HTML decoding/encoding using Python/Django?

查看:104
本文介绍了如何使用Python / Django执行HTML解码/编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个HTML编码的字符串:

 & lt; img class =& quot; size-medium WP-图像113安培; QUOT; 
style =& quot; margin-left:15px;& quot;标题=安培; QUOT; SU1&安培; QUOT;
src =& quot; http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&
alt =& quot;& quot;宽度=安培; QUOT; 300&安培; QUOT;高度=安培; QUOT; 194&安培; QUOT; /&安培; GT;

我想将其更改为:

 < img class =size-medium wp-image-113style =margin-left:15px; 
title =su1src =http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg
alt =width =300height =194/>

我想要这样注册为HTML,以便它被浏览器呈现为图像,而不是显示为文本。



我已经在C#中找到了,但不是在Python中。有人可以帮助我吗?



谢谢。



编辑:有人问我为什么我的字符串是这样存储的。这是因为我正在使用一个网页抓取工具扫描一个网页,并从中获取某些内容。该工具(BeautifulSoup)返回该格式的字符串。



相关




解决方案

给定Django用例,有两个答案。这是它的 django.utils.html.escape 函数,供参考:

 返回带有&号,引号和插入符号的给定HTML。
return mark_safe(force_unicode(html).replace('&', & amp;')替换(''','& l
t;')。replace('>','& )替换(','&#39;'))

反过来,Jake的回答中描述的猎豹功能应该可以工作,但是缺少单引号,这个版本包括一个更新的元组,替换顺序颠倒,以避免对称问题:

  def html_decode(s):

返回给定HTML字符串的ASCII解码版本。这样做
不会删除普通的HTML标签,例如< p>。

htmlCodes =(
(','&#39;'),
('','& b $ b('&','& gt;'),
('<','& ')

代码在htmlCodes:
s = s.replace(代码[1],代码[0])
返回s

未转义= html_decode(my_string)

然而,这不是一般的解决方案;只适用于使用 django.utils.html.escape 编码的字符串。更一般来说,坚持标准库是一个好主意:

 #Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)

#Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)

作为一个建议:将数据库中未转义的HTML存储起来更有意义。如果可能,可以从BeautifulSoup获取未转义的结果,并避免这个过程。值得一提的是,只有在Django中,才能在模板渲染过程中进行转义。所以为了防止转义你只要告诉模板引擎不要逃脱你的字符串。为此,请在模板中使用以下选项之一:

  {{context_var | safe}} 
{% autoescape off%}
{{context_var}}
{%endautoescape%}


I have a string that is html encoded:

&lt;img class=&quot;size-medium wp-image-113&quot; 
  style=&quot;margin-left: 15px;&quot; title=&quot;su1&quot; 
  src=&quot;http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg&quot; 
  alt=&quot;&quot; width=&quot;300&quot; height=&quot;194&quot; /&gt;

I want to change that to:

<img class="size-medium wp-image-113" style="margin-left: 15px;" 
  title="su1" src="http://blah.org/wp-content/uploads/2008/10/su1-300x194.jpg" 
  alt="" width="300" height="194" /> 

I want this to register as HTML so that it is rendered as an image by the browser instead of being displayed as text.

I've found how to do this in C# but not in in Python. Can someone help me out?

Thanks.

Edit: Someone asked why my strings are stored like that. It's because I am using a web-scraping tool that "scans" a web-page and gets certain content from it. The tool (BeautifulSoup) returns the string in that format.

Related

解决方案

Given the Django use case, there are two answers to this. Here is its django.utils.html.escape function, for reference:

def escape(html):
    """Returns the given HTML with ampersands, quotes and carets encoded."""
    return mark_safe(force_unicode(html).replace('&', '&amp;').replace('<', '&l
t;').replace('>', '&gt;').replace('"', '&quot;').replace("'", '&#39;'))

To reverse this, the Cheetah function described in Jake's answer should work, but is missing the single-quote. This version includes an updated tuple, with the order of replacement reversed to avoid symmetric problems:

def html_decode(s):
    """
    Returns the ASCII decoded version of the given HTML string. This does
    NOT remove normal HTML tags like <p>.
    """
    htmlCodes = (
            ("'", '&#39;'),
            ('"', '&quot;'),
            ('>', '&gt;'),
            ('<', '&lt;'),
            ('&', '&amp;')
        )
    for code in htmlCodes:
        s = s.replace(code[1], code[0])
    return s

unescaped = html_decode(my_string)

This, however, is not a general solution; it is only appropriate for strings encoded with django.utils.html.escape. More generally, it is a good idea to stick with the standard library:

# Python 2.x:
import HTMLParser
html_parser = HTMLParser.HTMLParser()
unescaped = html_parser.unescape(my_string)

# Python 3.x:
import html.parser
html_parser = html.parser.HTMLParser()
unescaped = html_parser.unescape(my_string)

As a suggestion: it may make more sense to store the HTML unescaped in your database. It'd be worth looking into getting unescaped results back from BeautifulSoup if possible, and avoiding this process altogether.

With Django, escaping only occurs during template rendering; so to prevent escaping you just tell the templating engine not to escape your string. To do that, use one of these options in your template:

{{ context_var|safe }}
{% autoescape off %}
    {{ context_var }}
{% endautoescape %}

这篇关于如何使用Python / Django执行HTML解码/编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆