在Python中将XML / HTML实体转换为Unicode字符串 [英] Convert XML/HTML Entities into Unicode String in Python

查看：276 发布时间：2018/6/13 9:32:43 python html entities

本文介绍了在Python中将XML / HTML实体转换为Unicode字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在做一些网页抓取，网站经常使用HTML实体来表示非ASCII字符。 Python有一个实用工具，它接受一个带有HTML实体的字符串并返回一个unicode类型？

例如：

我回来了：

 &＃x01ce;

表示带有音调标记的ǎ。在二进制中，这表示为16位01ce。我想将html实体转换为值 u'\\\ǎ'

解决方案

Python有 htmlentitydefs 模块，但这不会不包括一个函数来隐藏HTML实体。

Python开发人员Fredrik Lundh（elementtree的作者）具有这样一个函数在他的网站上，它使用十进制，十六进制和命名实体：

  import re，htmlentitydefs 
 
 ## 
＃从文本字符串中删除HTML或XML字符引用和实体。 
＃
＃@param text HTML（或XML）源文本。 
＃@return必要时将纯文本作为Unicode字符串。 
 
 def unescape（text）：
 def fixup（m）：
 text = m.group（0）
 if text [：2] ==& amp ;＃：
＃字符引用
尝试：
 if text [：3] ==& #x：
 return unichr（int（text [3： - 1]，16））
 else：
返回unichr（int（text [2：-1]））
除ValueError：
传递
 else：
 $命名实体
尝试：
 text = unichr（htmlentitydefs.name2codepoint [text [1：-1]]）
除了KeyError：
传递
返回文本＃离开原样
 return re.sub（&＃？\w +;，fixup，text）

I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?

For example:

I get back:

&#x01ce;

which represents an "ǎ" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'

解决方案

Python has the htmlentitydefs module, but this doesn't include a function to unescape HTML entities.

Python developer Fredrik Lundh (author of elementtree, among other things) has such a function on his website, which works with decimal, hex and named entities:

import re, htmlentitydefs

##
# Removes HTML or XML character references and entities from a text string.
#
# @param text The HTML (or XML) source text.
# @return The plain text, as a Unicode string, if necessary.

def unescape(text):
    def fixup(m):
        text = m.group(0)
        if text[:2] == "&#":
            # character reference
            try:
                if text[:3] == "&#x":
                    return unichr(int(text[3:-1], 16))
                else:
                    return unichr(int(text[2:-1]))
            except ValueError:
                pass
        else:
            # named entity
            try:
                text = unichr(htmlentitydefs.name2codepoint[text[1:-1]])
            except KeyError:
                pass
        return text # leave as is
    return re.sub("&#?\w+;", fixup, text)

这篇关于在Python中将XML / HTML实体转换为Unicode字符串的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

在Python中将XML / HTML实体转换为Unicode字符串 [英] Convert XML/HTML Entities into Unicode String in Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

在Python中将XML / HTML实体转换为Unicode字符串 [英] Convert XML/HTML Entities into Unicode String in Python

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭