在Python中将XML / HTML实体转换为Unicode字符串 [英] Convert XML/HTML Entities into Unicode String in Python
问题描述
我正在做一些网页抓取,网站经常使用HTML实体来表示非ASCII字符。 Python有一个实用工具,它接受一个带有HTML实体的字符串并返回一个unicode类型?
例如:
我回来了:
ǎ
表示带有音调标记的ǎ。在二进制中,这表示为16位01ce。我想将html实体转换为值 u'\\\ǎ'
Python有 htmlentitydefs 模块,但这不会不包括一个函数来隐藏HTML实体。
Python开发人员Fredrik Lundh(elementtree的作者)具有这样一个函数在他的网站上,它使用十进制,十六进制和命名实体:
import re,htmlentitydefs
##
#从文本字符串中删除HTML或XML字符引用和实体。
#
#@param text HTML(或XML)源文本。
#@return必要时将纯文本作为Unicode字符串。
def unescape(text):
def fixup(m):
text = m.group(0)
if text [:2] ==& amp ;#:
#字符引用
尝试:
if text [:3] ==& #x:
return unichr(int(text [3: - 1],16))
else:
返回unichr(int(text [2:-1]))
除ValueError:
传递
else:
$命名实体
尝试:
text = unichr(htmlentitydefs.name2codepoint [text [1:-1]])
除了KeyError:
传递
返回文本#离开原样
return re.sub(&#?\w +;,fixup,text)
I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?
For example:
I get back:
ǎ
which represents an "ǎ" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'
Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.
Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:
import re, htmlentitydefs
##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.
def unescape(text):
def fixup(m):
text = m.group(0)
if text[:2] == "&#":
# character reference
try:
if text[:3] == "&#x":
return unichr(int(text[3:-1], 16))
else:
return unichr(int(text[2:-1]))
except ValueError:
pass
else:
# named entity
try:
text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
except KeyError:
pass
return text # leave as is
return re.sub("&#?\w+;", fixup, text)
这篇关于在Python中将XML / HTML实体转换为Unicode字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!