在 Python 2.7.3/Raspberry Pi 中使用特殊字符转义 HTML [英] Unescaping HTML with special characters in Python 2.7.3 / Raspberry Pi

查看:22
本文介绍了在 Python 2.7.3/Raspberry Pi 中使用特殊字符转义 HTML的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我被困在这里试图对 HTML 特殊字符进行转义.

I'm stuck here trying to unescape HTML special characters.

有问题的文字是

Rudimental & Emeli Sandé

应该转换为基本的&埃梅莉·桑德

文本通过WGET下载(python之外)

The text is downloaded via WGET (outside of python)

要对此进行测试,请使用此行保存一个 ANSI 文件并将其导入.

To test this, save a ANSI file with this line and import it.

import HTMLParser

trackentry = open('import.txt', 'r').readlines()
print(trackentry)
track = trackentry[0]
html_parser = HTMLParser.HTMLParser()

track = html_parser.unescape(track)

print(track)

当一行中包含 é 时,我会收到此错误.

I get this error when a line has é in it.

*pi@raspberrypi ~/scripting $ python unparse.py
['Rudimental & Emeli Sandxe9
']
Traceback (most recent call last):
  File "unparse.py", line 9, in <module>
    track = html_parser.unescape(track)
  File "/usr/lib/python2.7/HTMLParser.py", line 472, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)*

同样的代码在 windows 下运行良好 - 我只在 raspberry pi 上有问题运行 Python 2.7.3.

The same code works fine under windows - I only have problems on the raspberry pi running Python 2.7.3.

推荐答案

Python 无法使用 ASCII 编解码器解码 'é' ('xe9') 因为这个字符不是 7 位 ASCII.

Python cannot decode 'é' ('xe9') using the ASCII codec because this character is not 7-bit ASCII.

你的问题(浓缩):

import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental &amp; Emeli Sandxe9'
output = parser.unescape(input)

生产

Traceback (most recent call last):
  File "problem.py", line 4, in <module>
    output = parser.unescape(input)
  File "/usr/lib/python2.7/HTMLParser.py", line 475, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|w{1,8}));", replaceEntities, s)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 11: ordinal not in range(128)

HTMLParser.unescape() 返回一个 unicode 对象,因此必须转换您的输入 str.所以它要求默认编码(在您的情况下是 ASCII)并且无法将 'xe9' 解释为 ASCII 字符(因为它不是).我猜你的文件编码是 ISO-8859-1,其中 'xe9' 是 'é'.

HTMLParser.unescape() returns a unicode object, and therefore has to convert your input str. So it asks for the default encoding (which in your case is ASCII) and fails to interpret 'xe9' as an ASCII character (because it isn't). I guess your file encoding is ISO-8859-1 where 'xe9' is 'é'.

有两个简单的解决方案.要么您手动进行转换:

There are two easy solutions. Either you do the conversion manually:

import HTMLParser
parser = HTMLParser.HTMLParser()
input = 'Rudimental &amp; Emeli Sandxe9'
input = input.decode('iso-8859-1')
output = parser.unescape(input)

或者在处理文件时使用 codecs.open() 而不是 open():

or you use codecs.open() instead of open() whenever you are working with files:

import codecs
import HTMLParser
parser = HTMLParser.HTMLParser()
input = codecs.open("import.txt", encoding="iso-8859-1").readline()
output = parser.unescape(input)

这篇关于在 Python 2.7.3/Raspberry Pi 中使用特殊字符转义 HTML的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆