用 html 实体替换重音字符 [英] Replace accented character with html entity

查看:45
本文介绍了用 html 实体替换重音字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试自动执行一系列查询,但是,我需要用相应的 html 实体替换带有重音符号的字符.它需要在 Python3 中

示例:

vèlit[需要成为]vèlit

问题是,每当我尝试执行 word.replace 时,它都找不到.

这个:

if u'è' 在句子中:打印(你找到 è')

工作并找到è",但这样做:

word.replace('è','è')

什么都不做.

解决方案

您可以使用 str.translate 方法和 python 中的数据 html 包将字符转换为等效的 html 实体.

为此,str.translate 需要一个映射字符的字典(从技术上讲,字符的整数表示,或 序数) 到 html 实体.

html.entities.codepoint2name 包含所需的数据,但实体名称不受&"的限制和 ';'.您可以使用 dict comprehension 创建一个包含您需要的值的表.

创建表格后,以表格为参数调用字符串的 translate 方法,结果将是一个新字符串,其中任何与 html 实体等效的字符都将被转换.

<预><代码>>>>导入 html.entities>>>s = 'velit'>>># 创建翻译表>>>table = {k: '&{};'.format(v) for k, v in html.entities.codepoint2name.items()}>>>s.translate(表)'v&egrave;lit'>>>'Voilà'.translate(table)'Voil&agrave;'

请注意,带重音的拉丁字符可能由 unicode 代码点的组合表示:'è' 可以由单个代码点表示 - 带有 GRAVE 的拉丁小写字母 E - 或两个代码点 -拉丁文小写字母 E 后跟 组合严重重音.在后一种情况下(称为分解形式),翻译将不会按预期工作.

为了解决这个问题,您可以使用 normalize 函数来自 unicodedata 模块在 Python 的标准库中.

<预><代码>>>>分解的'velit'>>>分解 == s错误的>>>len(decomposed) # 分解比组合长6>>>分解.translate(表)'velit'>>>组合 = unicodedata.normalize('NFC', 分解)>>>组成 == s真的>>>组合翻译(表)'v&egrave;lit'

I'm trying to automate a series of queries but, I need to replace characters with accents with the corresponding html entity. It needs to be in Python3

Example:

vèlit 
[needs to become] 
v&egrave;lit

The thing is, whenever I try to do a word.replace, it doesn't find it.

This:

if u'è' in sentence:
    print(u'Found è')

Works and finds "è", but doing:

word.replace('è','&egrave;')

Doesn't do anything.

解决方案

You can use the str.translate method and the data in python's html package to convert characters to the equivalent html entity.

To do this, str.translate needs a dictionary that maps characters (technically the character's integer representation, or ordinal) to html entities.

html.entities.codepoint2name contains the required data, but the entity names are not bounded by '&' and ';'. You can use a dict comprehension to create a table with the values you need.

Once the table has been created, call your string's translate method with the table as the argument and the result will be a new string in which any characters with an html entity equivalent will have been converted.

>>> import html.entities
>>> s = 'vèlit'

>>> # Create the translation table
>>> table = {k: '&{};'.format(v) for k, v in html.entities.codepoint2name.items()}

>>> s.translate(table)
'v&egrave;lit'

>>> 'Voilà'.translate(table)
'Voil&agrave;'

Be aware that accented latin characters may be represented by a combination of unicode code points: 'è' can be represented by the single code point - LATIN SMALL LETTER E WITH GRAVE - or two codepoints - LATIN SMALL LETTER E followed by COMBINING GRAVE ACCENT. In the latter case (known as the decomposed form), the translation will not work as expected.

To get around this, you can convert the two-codepoint decomposed form to the single codepoint composed form using the normalize function from the unicodedata module in Python's standard library.

>>> decomposed
'vèlit'
>>> decomposed == s
False
>>> len(decomposed)    # decomposed is longer than composed
6
>>> decomposed.translate(table)
'vèlit'
>>> composed = unicodedata.normalize('NFC', decomposed)
>>> composed == s
True
>>> composed.translate(table)
'v&egrave;lit'

这篇关于用 html 实体替换重音字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆