在 Python 2.7.3/Raspberry Pi 中使用特殊字符转义 HTML [英] Unescaping HTML with special characters in Python 2.7.3 / Raspberry Pi
问题描述
我被困在这里试图对 HTML 特殊字符进行转义.
I'm stuck here trying to unescape HTML special characters.
有问题的文字是
Rudimental & Emeli Sandé
应该转换为基本的&埃梅莉·桑德
文本通过WGET下载(python之外)
The text is downloaded via WGET (outside of python)
要对此进行测试,请使用此行保存一个 ANSI 文件并将其导入.
To test this, save a ANSI file with this line and import it.
import HTMLParser
trackentry = open('import.txt', 'r').readlines()
print(trackentry)
track = trackentry[0]
html_parser = HTMLParser.HTMLParser()
track = html_parser.unescape(track)
print(track)
当一行中包含 é 时,我会收到此错误.
I get this error when a line has é in it.
*pi@raspberrypi ~/scripting $ python unparse.py
['Rudimental & Emeli Sandxe9
']
Traceback (most recent call last):
File "unparse.py", line 9, in <module>
track = html_parser.unescape(track)
File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)*
同样的代码在 windows 下运行良好 - 我只在 raspberry pi 上有问题运行 Python 2.7.3.
The same code works fine under windows - I only have problems on the raspberry pi running Python 2.7.3.
推荐答案
Python 无法使用 ASCII 编解码器解码 'é' ('xe9') 因为这个字符不是 7 位 ASCII.
Python cannot decode 'é' ('xe9') using the ASCII codec because this character is not 7-bit ASCII.
你的问题(浓缩):
import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental & Emeli Sandxe9'
output = parser.unescape(input)
生产
Traceback (most recent call last):
File "problem.py", line 4, in <module>
output = parser.unescape(input)
File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape
return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|w{1,8}));", replaceEntities, s)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)
HTMLParser.unescape() 返回一个 unicode 对象,因此必须转换您的输入 str.所以它要求默认编码(在您的情况下是 ASCII)并且无法将 'xe9' 解释为 ASCII 字符(因为它不是).我猜你的文件编码是 ISO-8859-1,其中 'xe9' 是 'é'.
HTMLParser.unescape() returns a unicode object, and therefore has to convert your input str. So it asks for the default encoding (which in your case is ASCII) and fails to interpret 'xe9' as an ASCII character (because it isn't). I guess your file encoding is ISO-8859-1 where 'xe9' is 'é'.
有两个简单的解决方案.要么您手动进行转换:
There are two easy solutions. Either you do the conversion manually:
import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental & Emeli Sandxe9'
input = input.decode('iso-8859-1')
output = parser.unescape(input)
或者在处理文件时使用 codecs.open() 而不是 open():
or you use codecs.open() instead of open() whenever you are working with files:
import codecs
import HTMLParser
parser = HTMLParser.HTMLParser()
input = codecs.open("import.txt", encoding="iso-8859-1").readline()
output = parser.unescape(input)
这篇关于在 Python 2.7.3/Raspberry Pi 中使用特殊字符转义 HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!