在 Python 中匹配字符串的特定模式之后获取数字 [英] Get number present after a particular pattern of a matching string in Python
问题描述
我想获取所有匹配的数字(仅数字示例 '0012--22')或包含与之对应的一些文本(示例 'RF332')的数字,这些文本与提供的字符串列表匹配(my_list" in编码).带有数字的文本将出现的格式就像用一两个空格分隔.提供示例输入文件以供参考.
I want to get all the matching numbers(only numbers example '0012--22') or numbers which contain some text (example 'RF332') corresponding to it which matches with a list of strings provided("my_list" in the code). The format in which the text with number will be present is like separated by a space or two. Providing sample input file for reference.
这是输入文件:
$cat input_file
some text before Expedien: 1-21-212-16-26 some random text
Reference RE9833 of all sentences.
abc
123
456
something blah blah Ref.:
tramite 1234567
Ref.:
some junk Expedien N° 18-00777 # some new content
some text Expedien N°18-0022995 # some garbled content
现在的脚本附在下面:它目前只识别一个元素,它是 {'tramite': '1234567'}
The script till now is attached below: It is currently only identifying one element which is {'tramite': '1234567'}
import re
import glob
import os
my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien']
#open the file as input
with open('garb.txt','r') as infile:
res = dict()
for line in infile:
elems = re.split('(?::)?\s+', line)
#print(elems)
if len(elems) >= 2 :
contains = False
tmp = ''
for elem in elems:
if contains:
res.update({tmp : elem})
print(res)
contains = False
break
if elem in my_list:
contains = True
tmp = elem
#print(res)
这是预期的输出:
示例输出:
{'Expedien N°': '18-0022995'}
{'Expedien N°': '18-0022995'}
{'Expedien': '1-21-212-16-26'}
{'Reference' : 'RE9833'}
等等等等
推荐答案
您可以使用
(?<!\w)(your|escaped|keywords|here)\W*([A-Z]*\d+(?:-+[A-Z]*\d+)*)
请参阅正则表达式演示.
模式详情
(?<!\w)
- 左词边界(明确,\b
含义取决于上下文,如果下一个字符是非词字符,它将需要左侧的字符字符,这不是用户通常所期望的)(your|escaped|keywords|here)
- 捕获第 1 组:您的关键字列表,可以使用'|'.join(map(re.escape,my_list))
(注意re.escape
是转义特殊正则表达式元字符(如.
、+
、(
、[
等)\W*
- 0+ 个非单词字符(字母、数字或_
以外的字符)([A-Z]*\d+(?:-+[A-Z]*\d+)*)
- 捕获第 2 组:[A-Z]*
- 零个或多个大写 ASCII 字母\d+
- 1 个或多个数字(?:-+[A-Z]*\d+)*
- 0 次或多次重复-+
- 一个或多个连字符[A-Z]*\d+
- 零个或多个大写 ASCII 字母,1 个或多个数字
(?<!\w)
- left word boundary (unambiguous,\b
meaning is context dependent and if the next char is a non-word char, it will require a word char on the left, and that is not something users usually expect)(your|escaped|keywords|here)
- Capturing group 1: your list of keywords, it can be easily built using'|'.join(map(re.escape,my_list))
(notere.escape
is necessary to escape special regex metacharacters like.
,+
,(
,[
, etc.)\W*
- 0+ non-word chars (chars other than letters, digits or_
)([A-Z]*\d+(?:-+[A-Z]*\d+)*)
- Capturing group 2:[A-Z]*
- zero or more uppercase ASCII letters\d+
- 1 or more digits(?:-+[A-Z]*\d+)*
- 0 or more repetitions of-+
- one or more hyphens[A-Z]*\d+
- zero or more uppercase ASCII letters, 1 or more digits
查看 Python 演示:
import re s="""your_text_here""" my_list = ['Ref.:', 'Reference', 'tramite', 'Expediente', 'Expediente No', 'Expedien N°', 'Exp.No', 'Expedien'] rx = r'(?<!\w)({})\W*([A-Z]*\d+(?:-+[A-Z]*\d+)*)'.format('|'.join(map(re.escape,my_list))) print(re.findall(rx, s))
输出:
[('Expedien', '1-21-212-16-26'), ('Reference', 'RE9833'), ('tramite', '1234567'), ('Expedien N°', '18-00777'), ('Expedien N°', '18-0022995')]
这篇关于在 Python 中匹配字符串的特定模式之后获取数字的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!