python正则表达式查找以数字为中心的子字符串 [英] python regex find substrings centered with numbers

查看:18
本文介绍了python正则表达式查找以数字为中心的子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个字符串.我想将字符串切割成子字符串,其中包含一个包含数字的单词,两边由(最多)4 个单词包围.如果子串重叠,它们应该合并.

I have a string. I want to cut the string up into substrings that include a number-containing word surrounded by (up to) 4 words on either side. If the substrings overlap they should combine.

Sampletext = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."
re.findall('(\s[*\s]){1,4}\d(\s[*\s]){1,4}', Sampletext)
desired output = ['the way I know 54 how to take praise', 'to take praise for 65 excellent questions 34 thank you for asking']

推荐答案

重叠匹配:使用 Lookaheads

这样做:

subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."
for match in re.finditer(r"(?=((?:\b\w+\b ){4}\d+(?: \b\w+\b){4}))", subject):
    print(match.group(1))

什么是词?

输出取决于您对单词的定义.在这里,总而言之,我允许使用数字.这会产生以下输出.

The output depends on your definition of a word. Here, in a word, I have allowed digits. This produces the following output.

输出(允许在单词中使用数字)

the way I know 54 how to take praise
to take praise for 65 excellent questions 34 thank
for 65 excellent questions 34 thank you for asking

选项 2:单词中没有数字

subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated."    
for match in re.finditer(r"(?=((?:\b[a-z]+\b ){4}\d+(?: \b[a-z]+\b){4}))", subject, re.IGNORECASE):
    print(match.group(1))

输出 2

the way I know 54 how to take praise

选项 3:扩展到四个不间断的非数字单词

根据您的评论,此选项将扩展到枢轴的左侧和右侧,直到匹配四个不间断的非数字单词.逗号被忽略.

Based on your comments, this option will extend to the left and right of the pivot until four uninterrupted non-digit words are matched. Commas are ignored.

subject = "by the way I know 54 how to take praise for 65 excellent questions 34 thank you for asking appreciated. One Two Three Four 55 Extend 66 a b c d AA BB CC DD 71 HH DD, JJ FF"
for match in re.finditer(r"(?=((?:\b[a-z]+[ ,]+){4}(?:\d+ (?:[a-z]+ ){1,3}?)*?\d+.*?(?:[ ,]+[a-z]+){4}))", subject, re.IGNORECASE):
    print(match.group(1))

输出 3

the way I know 54 how to take praise
to take praise for 65 excellent questions 34 thank you for asking
One Two Three Four 55 Extend 66 a b c d
AA BB CC DD 71 HH DD, JJ FF

这篇关于python正则表达式查找以数字为中心的子字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆