正则表达式:匹配特定模式,如果匹配是在特定上下文中,则排除 [英] Regex: Match a specific pattern, exclude if match is in a specific context

查看:101
本文介绍了正则表达式:匹配特定模式,如果匹配是在特定上下文中,则排除的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是regex的初学者,想问一下如何使用regex解决此问题.目前,我正在尝试预处理德语文本.德语的字母中有一些特定的字符(ä,ö,ü).但是,这些字母也可以用其他方式(ae,oe,ue)书写.所以我只是使用了replace方法,效果很好.

I am a beginner in regex and wanted to ask how you can solve this problem with regex. At the moment I am trying to preprocess german text. German has a few specific characters in it's alphabet (ä, ö, ü). However those letters can also be written in a different way (ae, oe, ue). So I simply used the replace method, which worked fine.

import pandas as pd
df = pd.DataFrame({"text": ["Uebergang", "euer"]})
df["text"] = df["text"].str.replace("ae", "ä")
df["text"] = df["text"].str.replace("Ae", "Ä")
df["text"] = df["text"].str.replace("oe", "ö")
df["text"] = df["text"].str.replace("Oe", "Ö")
df["text"] = df["text"].str.replace("ue", "ü")
df["text"] = df["text"].str.replace("Ue", "Ü")

但是,在某些特定模式下,不应进行替换.就像"euer"一词一样.在这篇文章的一些帮助下,我试图制作一个有效的正则表达式表达式:匹配的正则表达式模式,不包括.../之间的时间

But there are also specific patterns where the replacement shouldn't take place. Like in the word "euer". With some help of this post, I tried to make a working regex expression: Regex Pattern to Match, Excluding when... / Except between

df["text"] = df["text"].str.replace("[AaÄäEe]ue|(ue)", "ü")

因此,如果括号[AaÄäEe]中有任何字符,然后后面出现"ue",那么我想排除这些情况.否则,"ue"将被替换为ü".但这是行不通的,那你怎么办呢?预先感谢.

So if there are any of the characters in the brackets [AaÄäEe] and afterwards the "ue" follows, then I would like to exlude those cases. Otherwise "ue" will be replaced by "ü". But this doesn't work, so how do you do it? Thanks in advance.

推荐答案

您可以使用

import re
import pandas as pd
dct = {'ae' : 'ä', 'Ae' : 'Ä', 'oe' : 'ö', 'Oe' : 'Ö', 'ue' : 'ü', 'Ue' : 'Ü'}
df = pd.DataFrame({"text": ["Uebergang", "euer"]})
df['text'].str.replace(r'[AaÄäEe]ue|([aouAOU]e)', lambda x: dct[x.group(1)] if x.group(1) else x.group())
# => 0    Übergang
#    1        euer
#    Name: text, dtype: object

[AaÄäEe]ue|([aouAOU]e)模式匹配:

  • [AaÄäEe]ue-AaÄäEe,后跟ue子字符串
  • |-或
  • ([aouAOU]e)-第1组:aouAOU,然后是e
  • [AaÄäEe]ue - A, a, Ä, ä, E or e followed with ue substring
  • | - or
  • ([aouAOU]e) - Group 1: a, o, u, A, O or U and then e

lambda x: dct[x.group(1)] if x.group(1) else x.group() lambda表达式执行以下操作:组1匹配后,dct[x.group(1)]将返回替换字符串.否则,找到的匹配项将被粘贴回去.

The lambda x: dct[x.group(1)] if x.group(1) else x.group() lambda expression does the following: once Group 1 matches, dct[x.group(1)] will return the replacement string. Else, the match found is pasted back.

这篇关于正则表达式:匹配特定模式,如果匹配是在特定上下文中,则排除的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆