在不区分大小写的搜索期间提取与模式中使用的原始大小写的匹配 [英] Extracting matches with the original case used in the pattern during a case insensitive search

查看:37
本文介绍了在不区分大小写的搜索期间提取与模式中使用的原始大小写的匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在进行正则表达式模式匹配时,我们得到匹配的内容.如果我想要在内容中找到的模式怎么办?

请看下面的例子:

<预><代码>>>>进口重新>>>r = re.compile('ERP|Gap', re.I)>>>string = 'ERP 是 GAP 的组成部分,所以 erp 永远不能被忽略,ErP!'>>>r.findall(字符串)['ERP', 'GAP', 'erp', 'ErP']

但我希望输出看起来像这样:['ERP', 'Gap', 'ERP', 'ERP']

因为如果我对原始输出进行分组和求和,我会得到以下输出作为数据帧:

ERP 1企业资源规划 1企业资源计划 1差距1差距 1

但是如果我希望输出看起来像这样

ERP 3差距 2

与我要搜索的关键字一样吗?

更多背景

我有一个这样的关键字列表:['ERP', 'Gap'].我有一个这样的字符串:"ERP, erp, ErP, GAP, gap"

我想计算每个关键字在字符串中出现的次数.现在,如果我进行模式匹配,我会得到以下输出:[ERP, erp, ErP, GAP, gap].

现在,如果我想聚合并进行计数,我会得到以下数据框:

ERP 1企业资源规划 1企业资源计划 1差距1差距 1

虽然我希望输出看起来像这样:

ERP 3差距 2

解决方案

您可以动态构建模式以在组名称中包含您搜索的词的索引,然后抓取那些匹配的模式部分:

导入重新词 = [ERP",差距"]words_dict = { f'g{i}':item for i,item in enumerate(words) }rx = rf"\b(?:{'|'.join([ rf'(?P<g{i}>{item})' for i,item in enumerate(words) ])})\b"text = 'ERP 是 GAP 的组成部分,所以 erp 永远不能被忽略,ErP!'结果 = []对于 re.finditer(rx, text, flags=re.IGNORECASE) 中的匹配:results.append( [words_dict.get(key) for key,value in match.groupdict().items() if value][0] )打印(结果)# =>['ERP', '差距', 'ERP', 'ERP']

查看 Python 在线演示

模式看起来像 \b(?:(?PERP)|(?PGap))\b:

  • \b - 一个词边界
  • (?: - 非捕获组封装模式部分的开始:
    • (?PERP) - 组g0":ERP
    • | - 或
    • (?PGap) - 组g1":Gap
  • ) - 组结束
  • \b - 一个词边界.

查看正则表达式演示.

注意 [0][words_dict.get(key) for key,value in match.groupdict().items() if value][0] 将在所有情况下都有效,因为当有匹配时,只有一组匹配.

While doing a regex pattern match, we get the content which has been a match. What if I want the pattern which was found in the content?

See the below example:

>>> import re
>>> r = re.compile('ERP|Gap', re.I)
>>> string = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'
>>> r.findall(string)
['ERP', 'GAP', 'erp', 'ErP']

but I want the output to look like this : ['ERP', 'Gap', 'ERP', 'ERP']

Because if I do a group by and sum on the original output, I would get the following output as a dataframe:

ERP 1
erp 1
ErP 1
GAP 1
gap 1

But what if I want the output to look like

ERP 3
Gap 2

in par with the keywords I am searching for?

MORE CONTEXT

I have a keyword list like this: ['ERP', 'Gap']. I have a string like this: "ERP, erp, ErP, GAP, gap"

I want to take count of number of times each keyword has appeared in the string. Now if I am doing a pattern matching, I am getting the following output: [ERP, erp, ErP, GAP, gap].

Now if I want to aggregate and take a count, I am getting the following dataframe:

ERP 1
erp 1
ErP 1
GAP 1
gap 1

While I want the output to look like this:

ERP 3
Gap 2

解决方案

You may build the pattern dynamically to include indices of the words you search for in the group names and then grab those pattern parts that matched:

import re

words = ["ERP", "Gap"]
words_dict = { f'g{i}':item for i,item in enumerate(words) } 

rx = rf"\b(?:{'|'.join([ rf'(?P<g{i}>{item})' for i,item in enumerate(words) ])})\b"

text = 'ERP is integral part of GAP, so erp can never be ignored, ErP!'

results = []
for match in re.finditer(rx, text, flags=re.IGNORECASE):
    results.append( [words_dict.get(key) for key,value in match.groupdict().items() if value][0] )

print(results) # => ['ERP', 'Gap', 'ERP', 'ERP']

See the Python demo online

The pattern will look like \b(?:(?P<g0>ERP)|(?P<g1>Gap))\b:

  • \b - a word boundary
  • (?: - start of a non-capturing group encapsulating pattern parts:
    • (?P<g0>ERP) - Group "g0": ERP
    • | - or
    • (?P<g1>Gap) - Group "g1": Gap
  • ) - end of the group
  • \b - a word boundary.

See the regex demo.

Note [0] with [words_dict.get(key) for key,value in match.groupdict().items() if value][0] will work in all cases since when there is a match, only one group matched.

这篇关于在不区分大小写的搜索期间提取与模式中使用的原始大小写的匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆