使用 Python re.match 提取字符串 [英] Extract string with Python re.match

查看:83
本文介绍了使用 Python re.match 提取字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

import re
str="x8f8dL:s://www.qqq.zzz/iziv8ds8f8.dafidsao.dsfsi"

str2=re.match("[a-zA-Z]*//([a-zA-Z]*)",str)
print str2.group()

current result=> error
expected => wwwqqqzzz

我想提取字符串wwwqqqzzz.我怎么做?

I want to extract the string wwwqqqzzz. How I do that?

也许有很多点,比如:

"whatever..s#$@.d.:af//wwww.xxx.yn.zsdfsd.asfds.f.ds.fsd.whatever/123.dfiid"

在这种情况下,我基本上想要由 /// 限定的东西.我如何做到这一点?

In this case, I basically want the stuff bounded by // and /. How do I achieve that?

另外一个问题:

import re
str="xxx.yyy.xxx:80"

m = re.search(r"([^:]*)", str)
str2=m.group(0)
print str2
str2=m.group(1)
print str2

看起来 m.group(0)m.group(1) 是一样的.

Seems that m.group(0) and m.group(1) are the same.

推荐答案

match 尝试匹配 整个 字符串.使用 search 代替.以下模式将符合您的要求:

match tries to match the entire string. Use search instead. The following pattern would then match your requirements:

m = re.search(r"//([^/]*)", str)
print m.group(1)

基本上,我们正在寻找/,然后尽可能多地使用非斜杠字符.那些非斜杠字符将被捕获在第 1 组中.

Basically, we are looking for /, then consume as many non-slash characters as possible. And those non-slash characters will be captured in group number 1.

事实上,还有一种稍微高级的技术可以做同样的事情,但不需要捕获(这通常很耗时).它使用所谓的lookbehind:

In fact, there is a slightly more advanced technique that does the same, but does not require capturing (which is generally time-consuming). It uses a so-called lookbehind:

m = re.search(r"(?<=//)[^/]*", str)
print m.group()

实际比赛中不包括环视,因此是预期的结果.

Lookarounds are not included in the actual match, hence the desired result.

此(或任何其他合理的正则表达式解决方案)不会立即删除 . .但这可以在第二步中轻松完成:

This (or any other reasonable regex solution) will not remove the .s immediately. But this can easily be done in a second step:

m = re.search(r"(?<=//)[^/]*", str)
host = m.group()
cleanedHost = host.replace(".", "")

这甚至不需要正则表达式.

That does not even require regular expressions.

当然,如果您想删除除字母和数字以外的所有内容(例如将 www.regular-expressions.info 转换为 wwwregularexpressionsinfo),那么您最好使用 replace 的正则表达式版本:

Of course, if you want to remove everything except for letters and digits (e.g. to turn www.regular-expressions.info into wwwregularexpressionsinfo) then you are better off using the regex version of replace:

cleanedHost = re.sub(r"[^a-zA-Z0-9]+", "", host)

这篇关于使用 Python re.match 提取字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆