在 Python unicode 字符串中删除重音(标准化)的最佳方法是什么? [英] What is the best way to remove accents (normalize) in a Python unicode string?

查看:35
本文介绍了在 Python unicode 字符串中删除重音(标准化)的最佳方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 Python 中有一个 Unicode 字符串,我想删除所有的重音符号(变音符号).

我在网上找到了一种优雅的方式来做到这一点(在 Java 中):

  1. 将 Unicode 字符串转换为其长规范化形式(字母和变音符号使用单独的字符)
  2. 删除所有 Unicode 类型为变音符号"的字符.

我是否需要安装诸如 pyICU 之类的库,还是仅使用 Python 标准库就可以实现?那么python 3呢?

重要说明:我想避免使用从重音字符到非重音对应物的显式映射的代码.

解决方案

这个怎么样:

导入unicodedatadef strip_accents(s):return ''.join(c for c in unicodedata.normalize('NFD', s)如果 unicodedata.category(c) != 'Mn')

这也适用于希腊字母:

<预><代码>>>>strip_accents(u"A u00c0 u0394 u038E")u'A A u0394 u03a5'>>>

字符类别Mn"代表Nonspacing_Mark,这类似于 MiniQuark 的答案中的 unicodedata.combining(我没有想到 unicodedata.combining,但它可能是更好的解决方案,因为它更明确).

请记住,这些操作可能会显着改变文本的含义.口音、变音等不是装饰".

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the web an elegant way to do this (in Java):

  1. convert the Unicode string to its long normalized form (with a separate character for letters and diacritics)
  2. remove all the characters whose Unicode type is "diacritic".

Do I need to install a library such as pyICU or is this possible with just the Python standard library? And what about python 3?

Important note: I would like to avoid code with an explicit mapping from accented characters to their non-accented counterpart.

解决方案

How about this:

import unicodedata
def strip_accents(s):
   return ''.join(c for c in unicodedata.normalize('NFD', s)
                  if unicodedata.category(c) != 'Mn')

This works on greek letters, too:

>>> strip_accents(u"A u00c0 u0394 u038E")
u'A A u0394 u03a5'
>>> 

The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

这篇关于在 Python unicode 字符串中删除重音(标准化)的最佳方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆