正则表达式用于python中的重音不敏感替换 [英] Regex for accent insensitive replacement in python

查看:43
本文介绍了正则表达式用于python中的重音不敏感替换的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 Python 3 中,我希望能够以不区分重音"的方式使用 re.sub(),就像我们可以使用 re.I 不区分大小写替换的标志.

In Python 3, I'd like to be able to use re.sub() in an "accent-insensitive" way, as we can do with the re.I flag for case-insensitive substitution.

可能类似于 re.IGNOREACCENTS 标志:

original_text = "¿It's 80°C, I'm drinking a café in a cafe with Chloë。"
accent_regex = r'a café'
re.sub(accent_regex, 'X', original_text, flags=re.IGNOREACCENTS)

这会导致¿ 80°C,我和Chloë一起喝X in X."(注意Chloë"仍然带有重音)而不是¿这是80°C,我正在喝酒X 和 Chloë 在咖啡馆里."在真正的蟒蛇中.

This would lead to "¿It's 80°C, I'm drinking X in X with Chloë。" (note that there's still an accent on "Chloë") instead of "¿It's 80°C, I'm drinking X in a cafe with Chloë。" in real python.

我认为这样的标志不存在.那么这样做的最佳选择是什么?在 original_textaccent_regex 上使用 re.finditerunidecode 然后通过拆分字符串来替换?或者通过重音变体修改 accent_regex 中的所有字符,例如:r'[cç][aàâ]f[éèêë]'?

I think that such a flag doesn't exist. So what would be the best option to do this? Using re.finditer and unidecode on both original_text and accent_regex and then replace by splitting the string? Or modifying all characters in the accent_regex by their accented variants, for instance: r'[cç][aàâ]f[éèêë]'?

推荐答案

unidecode 在 Python 中经常被提及用于去除重音符号,但它的作用还不止于此:它将 '°' 转换为 'deg',这可能不是所需的输出.

unidecode is often mentioned for removing accents in Python, but it also does more than that : it converts '°' to 'deg', which might not be the desired output.

unicodedata 似乎有 足够的功能来去除重音.

此方法适用于任何模式和任何文本.

This method should work with any pattern and any text.

您可以暂时从文本和正则表达式模式中删除重音符号.来自 re.finditer() 的匹配信息(开始和结束索引)可用于修改原始的重音文本.

You can temporarily remove the accents from both the text and regex pattern. The match information from re.finditer() (start and end indices) can be used to modify the original, accented text.

请注意,为了不修改以下索引,必须反转匹配项.

Note that the matches must be reversed in order to not modify the following indices.

import re
import unicodedata

original_text = "I'm drinking a 80° café in a cafe with Chloë, François Déporte and Francois Deporte."

accented_pattern = r'a café|François Déporte'

def remove_accents(s):
    return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))

print(remove_accents('äöüßéèiìììíàáç'))
# aoußeeiiiiiaac

pattern = re.compile(remove_accents(accented_pattern))

modified_text = original_text
matches = list(re.finditer(pattern, remove_accents(original_text)))

for match in matches[::-1]:
    modified_text = modified_text[:match.start()] + 'X' + modified_text[match.end():]

print(modified_text)
# I'm drinking a 80° café in X with Chloë, X and X.

如果模式是一个词或一组词

你可以:

  • 从您的模式词中删除重音并将它们保存在一个集合中以便快速查找
  • 使用 \w+
  • 查找文本中的每个单词
  • 从单词中删除重音符号:
    • 如果匹配,替换为X
    • 如果不匹配,请保持原状
    import re
    from unidecode import unidecode
    
    original_text = "I'm drinking a café in a cafe with Chloë."
    
    def remove_accents(string):
        return unidecode(string)
    
    accented_words = ['café', 'français']
    
    words_to_remove = set(remove_accents(word) for word in accented_words)
    
    def remove_words(matchobj):
        word = matchobj.group(0)
        if remove_accents(word) in words_to_remove:
            return 'X'
        else:
            return word
    
    print(re.sub('\w+', remove_words, original_text))
    # I'm drinking a X in a X with Chloë.
    

    这篇关于正则表达式用于python中的重音不敏感替换的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆