保持字符串转换为ASCII前后的一致性 [英] Maintaining the consistency of strings before and after converting to ASCII

查看:35
本文介绍了保持字符串转换为ASCII前后的一致性的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有许多 unicode 格式的字符串,例如 carbonpolymers—III\n12-Géotechnique\n 以及更多具有许多不同 unicode 字符的字符串,位于名为 txtWords.

I have many strings in unicode format such as carbon copolymers—III\n12- Géotechnique\n and many more having many different unicode characters, in a string variable named txtWords.

我的目标是删除所有非 ASCII 字符同时保持字符串的一致性.例如,我想将第一句变成 carbonpolymers IIIcarbonpolymers iii(这里不区分大小写),第二句变成 geotechnique\n 等等...

My goal is to remove all non-ASCII characters while preserving the consistency of the strings. For instance I want to first sentence turn into carbon copolymers III or carbon copolymers iii (no case-sensitivity here) and the second one to geotechnique\n and so on ...

目前我正在使用以下代码,但它并没有帮助我实现我的期望.当前代码将碳共聚物III更改为碳共聚物iii,这绝对不是它应该的样子:

Currently I am using the following code but it doesn't help me achieve what I expect. The current code changes carbon copolymers III to carbon copolymersiii which is definitely not what it should be:

import unicodedata, re
txtWords = unicodedata.normalize('NFKD', txtWords.lower()).encode('ascii','ignore')
txtWords = re.sub(r'[^a-z^\n]',r' ',txtWords)

如果我先使用正则表达式代码,那么我会得到更糟的东西(就我的期望而言):

If I use the regex code first then I get something worse (in terms of what I expect):

    import unicodedata, re
    txtWords = re.sub(r'[^a-z^\n]',r' ',txtWords)
    txtWords = unicodedata.normalize('NFKD', txtWords.lower()).encode('ascii','ignore')

这样,对于字符串 Géotechnique\n 我得到 otechnique

This way, for the string Géotechnique\n I get otechnique!

我该如何解决这个问题?

How can I resolve this issue?

推荐答案

使用 \w 正则表达式在分解技巧前去除非字母数字:

Use the \w regular expression to strip non-alphanumerics before the decomposing trick:

#coding:utf8
from __future__ import unicode_literals,print_function
import unicodedata as ud
import re
txtWords = 'carbon copolymers—III\n12- Géotechnique\n'
txtWords = re.sub(r'[^\w\n]',r' ',txtWords.lower(),flags=re.U)
txtWords = ud.normalize('NFKD',txtWords).encode('ascii','ignore').decode()
print(txtWords)

输出(Python 2 和 3):

Output (Python 2 and 3):

carbon copolymers iii
12  geotechnique

这篇关于保持字符串转换为ASCII前后的一致性的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆