正则表达式用于python中的重音不敏感替换 [英] Regex for accent insensitive replacement in python
问题描述
在 Python 3 中,我希望能够以不区分重音"的方式使用 re.sub()
,就像我们可以使用 re.I
不区分大小写替换的标志.
In Python 3, I'd like to be able to use re.sub()
in an "accent-insensitive" way, as we can do with the re.I
flag for case-insensitive substitution.
可能类似于 re.IGNOREACCENTS
标志:
original_text = "¿It's 80°C, I'm drinking a café in a cafe with Chloë。"
accent_regex = r'a café'
re.sub(accent_regex, 'X', original_text, flags=re.IGNOREACCENTS)
这会导致¿ 80°C,我和Chloë一起喝X in X."(注意Chloë"仍然带有重音)而不是¿这是80°C,我正在喝酒X 和 Chloë 在咖啡馆里."在真正的蟒蛇中.
This would lead to "¿It's 80°C, I'm drinking X in X with Chloë。" (note that there's still an accent on "Chloë") instead of "¿It's 80°C, I'm drinking X in a cafe with Chloë。" in real python.
我认为这样的标志不存在.那么这样做的最佳选择是什么?在 original_text
和 accent_regex
上使用 re.finditer
和 unidecode
然后通过拆分字符串来替换?或者通过重音变体修改 accent_regex
中的所有字符,例如:r'[cç][aàâ]f[éèêë]'
?
I think that such a flag doesn't exist. So what would be the best option to do this? Using re.finditer
and unidecode
on both original_text
and accent_regex
and then replace by splitting the string? Or modifying all characters in the accent_regex
by their accented variants, for instance: r'[cç][aàâ]f[éèêë]'
?
推荐答案
unidecode
在 Python 中经常被提及用于去除重音符号,但它的作用还不止于此:它将 '°'
转换为 'deg'
,这可能不是所需的输出.
unidecode
is often mentioned for removing accents in Python, but it also does more than that : it converts '°'
to 'deg'
, which might not be the desired output.
unicodedata
似乎有
此方法适用于任何模式和任何文本.
This method should work with any pattern and any text.
您可以暂时从文本和正则表达式模式中删除重音符号.来自 re.finditer()
的匹配信息(开始和结束索引)可用于修改原始的重音文本.
You can temporarily remove the accents from both the text and regex pattern. The match information from re.finditer()
(start and end indices) can be used to modify the original, accented text.
请注意,为了不修改以下索引,必须反转匹配项.
Note that the matches must be reversed in order to not modify the following indices.
import re
import unicodedata
original_text = "I'm drinking a 80° café in a cafe with Chloë, François Déporte and Francois Deporte."
accented_pattern = r'a café|François Déporte'
def remove_accents(s):
return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
print(remove_accents('äöüßéèiìììíàáç'))
# aoußeeiiiiiaac
pattern = re.compile(remove_accents(accented_pattern))
modified_text = original_text
matches = list(re.finditer(pattern, remove_accents(original_text)))
for match in matches[::-1]:
modified_text = modified_text[:match.start()] + 'X' + modified_text[match.end():]
print(modified_text)
# I'm drinking a 80° café in X with Chloë, X and X.
如果模式是一个词或一组词
你可以:
- 从您的模式词中删除重音并将它们保存在一个集合中以便快速查找
- 使用
\w+
查找文本中的每个单词 - 从单词中删除重音符号:
- 如果匹配,替换为
X
- 如果不匹配,请保持原状
import re from unidecode import unidecode original_text = "I'm drinking a café in a cafe with Chloë." def remove_accents(string): return unidecode(string) accented_words = ['café', 'français'] words_to_remove = set(remove_accents(word) for word in accented_words) def remove_words(matchobj): word = matchobj.group(0) if remove_accents(word) in words_to_remove: return 'X' else: return word print(re.sub('\w+', remove_words, original_text)) # I'm drinking a X in a X with Chloë.
这篇关于正则表达式用于python中的重音不敏感替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- 如果匹配,替换为