使用 Python re.match 提取字符串 [英] Extract string with Python re.match
问题描述
import re
str="x8f8dL:s://www.qqq.zzz/iziv8ds8f8.dafidsao.dsfsi"
str2=re.match("[a-zA-Z]*//([a-zA-Z]*)",str)
print str2.group()
current result=> error
expected => wwwqqqzzz
我想提取字符串wwwqqqzzz
.我怎么做?
I want to extract the string wwwqqqzzz
. How I do that?
也许有很多点,比如:
"whatever..s#$@.d.:af//wwww.xxx.yn.zsdfsd.asfds.f.ds.fsd.whatever/123.dfiid"
在这种情况下,我基本上想要由 //
和 /
限定的东西.我如何做到这一点?
In this case, I basically want the stuff bounded by //
and /
. How do I achieve that?
另外一个问题:
import re
str="xxx.yyy.xxx:80"
m = re.search(r"([^:]*)", str)
str2=m.group(0)
print str2
str2=m.group(1)
print str2
看起来 m.group(0)
和 m.group(1)
是一样的.
Seems that m.group(0)
and m.group(1)
are the same.
推荐答案
match
尝试匹配 整个 字符串.使用 search
代替.以下模式将符合您的要求:
match
tries to match the entire string. Use search
instead. The following pattern would then match your requirements:
m = re.search(r"//([^/]*)", str)
print m.group(1)
基本上,我们正在寻找/
,然后尽可能多地使用非斜杠字符.那些非斜杠字符将被捕获在第 1 组中.
Basically, we are looking for /
, then consume as many non-slash characters as possible. And those non-slash characters will be captured in group number 1.
事实上,还有一种稍微高级的技术可以做同样的事情,但不需要捕获(这通常很耗时).它使用所谓的lookbehind:
In fact, there is a slightly more advanced technique that does the same, but does not require capturing (which is generally time-consuming). It uses a so-called lookbehind:
m = re.search(r"(?<=//)[^/]*", str)
print m.group()
实际比赛中不包括环视,因此是预期的结果.
Lookarounds are not included in the actual match, hence the desired result.
此(或任何其他合理的正则表达式解决方案)不会立即删除 .
.但这可以在第二步中轻松完成:
This (or any other reasonable regex solution) will not remove the .
s immediately. But this can easily be done in a second step:
m = re.search(r"(?<=//)[^/]*", str)
host = m.group()
cleanedHost = host.replace(".", "")
这甚至不需要正则表达式.
That does not even require regular expressions.
当然,如果您想删除除字母和数字以外的所有内容(例如将 www.regular-expressions.info
转换为 wwwregularexpressionsinfo
),那么您最好使用 replace
的正则表达式版本:
Of course, if you want to remove everything except for letters and digits (e.g. to turn www.regular-expressions.info
into wwwregularexpressionsinfo
) then you are better off using the regex version of replace
:
cleanedHost = re.sub(r"[^a-zA-Z0-9]+", "", host)
这篇关于使用 Python re.match 提取字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!