正则表达式:为什么在 re.findall() 中包含空字符串(在元组列表中)? [英] Regex: Why do empty strings get included (in a list of tuples) in re.findall()?

查看:86
本文介绍了正则表达式:为什么在 re.findall() 中包含空字符串(在元组列表中)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据模式匹配

由于正则表达式模式不包含捕获组,re.findall 只会返回匹配,不会捕获组内容:

导入重新p = re.compile(r'(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')test_str = "来自 mail.example.com (example.com. [213.239.250.131]) 来自\n mx.google.com 与 ESMTPS id xc4si15480310lbb.82.2014.10.26.06.16.58 for\n 

在线 Python 演示的输出:

['213.239.250.131', '014.10.26.06']

According to the pattern match here, the matches are 213.239.250.131 and 014.10.26.06.

Yet when I run the generated Python code and print out the value of re.findall(p, test_str), I get:

[('', '', '213.239.250.131'), ('', '', '014.10.26.06')]

I could hack around the list and it tuples to get the values I'm looking for (the IP addresses), but (i) they might not always be in the same position in the tuples and (ii) I'd rather understand what's going on here so I can either tighten up the regex, or extract only IP addresses using Python's own re functionality.

Why do I get this list of tuples, why the apparent whitespace matches, and how do we ensure that only the IP addresses are returned?

解决方案

Whenever you are using a capturing group, it always returns a submatch, even if it is empty/null. You have 3 capturing groups, so you will always have them in the findall result.

In regex101.com, you can see these non-participating groups by turning them on in Options:

You may tighten up your regex by removing capturing groups:

(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}

Or even (?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}(?:\.\d{1,3}){3}.

See a regex demo

And since the regex pattern does not contain capturing groups, re.findall will only return matches, not capturing group contents:

import re
p = re.compile(r'(?:[a-z0-9]{1,4}:+){3,5}[a-z0-9]{1,4}|\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
test_str = "from mail.example.com (example.com. [213.239.250.131]) by\n mx.google.com with ESMTPS id xc4si15480310lbb.82.2014.10.26.06.16.58 for\n <alex@example.com> (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256\n bits=128/128); Sun, 26 Oct 2014 06:16:58 -0700 (PDT)"
print(re.findall(p, test_str))

Output of the online Python demo:

['213.239.250.131', '014.10.26.06']

这篇关于正则表达式:为什么在 re.findall() 中包含空字符串(在元组列表中)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆