在Python 2.7.3 / Raspberry Pi中使用特殊字符转义HTML [英] Unescaping HTML with special characters in Python 2.7.3 / Raspberry Pi

查看:89
本文介绍了在Python 2.7.3 / Raspberry Pi中使用特殊字符转义HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我一直在这里尝试取消对HTML特殊字符的转义。

I'm stuck here trying to unescape HTML special characters.

有问题的文本是

Rudimental & Emeli Sandé

应转换为
Rudimental& EmeliSandé

文本是通过WGET下载的(在Python之外)

The text is downloaded via WGET (outside of python)

要对此进行测试,用此行保存一个ANSI文件并导入。

To test this, save a ANSI file with this line and import it.

import HTMLParser

trackentry = open('import.txt', 'r').readlines()
print(trackentry)
track = trackentry[0]
html_parser = HTMLParser.HTMLParser()

track = html_parser.unescape(track)

print(track)

当一行中包含é时,我会收到此错误。

I get this error when a line has é in it.

*pi@raspberrypi ~/scripting $ python unparse.py
['Rudimental & Emeli Sand\xe9\n']
Traceback (most recent call last):
  File "unparse.py", line 9, in <module>
    track = html_parser.unescape(track)
  File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)*

同一代码在Windows下也可以正常工作-我只在运行Python 2.7.3的树莓派pi
上出现问题。

The same code works fine under windows - I only have problems on the raspberry pi running Python 2.7.3.

推荐答案

Python无法解码'é'(' \xe9 ')使用ASCII编解码器,因为此字符不是7位ASCII。

Python cannot decode 'é' ('\xe9') using the ASCII codec because this character is not 7-bit ASCII.

您的问题(

import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental &amp; Emeli Sand\xe9'
output = parser.unescape(input)

产生

Traceback (most recent call last):
  File "problem.py", line 4, in <module>
    output = parser.unescape(input)
  File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)

HTMLParser.unescape() >返回 unicode 对象,因此必须转换您的输入 str 。因此,它要求使用默认编码(在您的情况下为ASCII),并且无法将 \xe9 解释为ASCII字符(因为不是)。我猜您的文件编码为ISO-8859-1,其中 \xe9 é

HTMLParser.unescape() returns a unicode object, and therefore has to convert your input str. So it asks for the default encoding (which in your case is ASCII) and fails to interpret '\xe9' as an ASCII character (because it isn't). I guess your file encoding is ISO-8859-1 where '\xe9' is 'é'.

有两种简单的解决方案。您可以手动进行转换:

There are two easy solutions. Either you do the conversion manually:

import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental &amp; Emeli Sand\xe9'
input = input.decode('iso-8859-1')
output = parser.unescape(input)

或在使用文件时使用 codecs.open()代替 open()

or you use codecs.open() instead of open() whenever you are working with files:

import codecs
import HTMLParser
parser = HTMLParser.HTMLParser()
input = codecs.open("import.txt", encoding="iso-8859-1").readline()
output = parser.unescape(input)

这篇关于在Python 2.7.3 / Raspberry Pi中使用特殊字符转义HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆