在Python中将XML / HTML实体转换为Unicode字符串 [英] Convert XML/HTML Entities into Unicode String in Python

查看:276
本文介绍了在Python中将XML / HTML实体转换为Unicode字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在做一些网页抓取,网站经常使用HTML实体来表示非ASCII字符。 Python有一个实用工具,它接受一个带有HTML实体的字符串并返回一个unicode类型?



例如:

我回来了:

 ǎ 

表示带有音调标记的ǎ。在二进制中,这表示为16位01ce。我想将html实体转换为值 u'\\\ǎ'

解决方案

Python有 htmlentitydefs 模块,但这不会不包括一个函数来隐藏HTML实体。



Python开发人员Fredrik Lundh(elementtree的作者)具有这样一个函数在他的网站上,它使用十进制,十六进制和命名实体:

  import re,htmlentitydefs 

##
#从文本字符串中删除HTML或XML字符引用和实体。

#@param text HTML(或XML)源文本。
#@return必要时将纯文本作为Unicode字符串。

def unescape(text):
def fixup(m):
text = m.group(0)
if text [:2] ==& amp ;#:
#字符引用
尝试:
if text [:3] ==& #x:
return unichr(int(text [3: - 1],16))
else:
返回unichr(int(text [2:-1]))
除ValueError:
传递
else:
$命名实体
尝试:
text = unichr(htmlentitydefs.name2codepoint [text [1:-1]])
除了KeyError:
传递
返回文本#离开原样
return re.sub(&#?\w +;,fixup,text)


I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?

For example:

I get back:

ǎ

which represents an "ǎ" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'

解决方案

Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.

Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:

import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

这篇关于在Python中将XML / HTML实体转换为Unicode字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆