了解用于在html中的字符串之间查找字符串的正则表达式模式 [英] Understanding regex pattern used to find string between strings in html

查看:146
本文介绍了了解用于在html中的字符串之间查找字符串的正则表达式模式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下html文件:

I have the following html file:

<!-- <div class="_5ay5"><table class="uiGrid _51mz" cellspacing="0" cellpadding="0"><tbody><tr class="_51mx"><td class="_51m-"><div class="_u3y"><div class="_5asl"><a class="_47hq _5asm" href="/Dev/videos/1610110089242029/" aria-label="Who said it?" ajaxify="/Dev/videos/1610110089242029/" rel="theater">

为了提取videos//"之间的数字字符串,我使用发现的以下方法:

In order to pull the string of numbers between videos/ and /", I'm using the following method that I found:

import re 

Source_file = open('source.html').read()
result = re.compile('videos/(.*?)/"').search(Source_file)
print result

我已经尝试使用Google搜索详细解释(.*?)在此特定实现中的工作原理,但我仍不清楚.有人可以向我解释吗?这就是所谓的非贪婪"比赛吗?如果是,那是什么意思?

I've tried Googling an explanation for exactly how the (.*?) works in this particular implementation, but I'm still unclear. Could someone explain this to me? Is this what's known as a "non-greedy" match? If yes, what does that mean?

推荐答案

在这种情况下,?是重复运算符(+*?)上的特殊运算符.在可用的引擎中,这会导致重复是懒惰非贪婪不情愿或其他此类术语.通常,重复是贪婪的,这意味着它应该尽可能地匹配.因此,在大多数现代的Perl兼容引擎中,您有三种重复类型:

The ? in this context is a special operator on the repetition operators (+, *, and ?). In engines where it is available this causes the repetition to be lazy or non-greedy or reluctant or other such terms. Typically repetition is greedy which means that it should match as much as possible. So you have three types of repetition in most modern perl-compatible engines:

.*  # Match any character zero or more times
.*? # Match any character zero or more times until the next match (reluctant)
.*+ # Match any character zero or more times and don't stop matching! (possessive)

可在此处找到更多信息: http://www.regular-expressions.info/repeat.html#lazy 表示不愿意/懒惰,请访问以下网址: http://www. regular-expressions.info/possessive.html 表示所有格(我将在此答案中跳过讨论).

More information can be found here: http://www.regular-expressions.info/repeat.html#lazy for reluctant/lazy and here: http://www.regular-expressions.info/possessive.html for possessive (which I'll skip discussing in this answer).

假设我们有字符串aaaa.我们可以将所有a与/(a+)a/匹配.从字面上看是

Suppose we have the string aaaa. We can match all of the a's with /(a+)a/. Literally this is

匹配一个或多个a,后跟一个a.

这将与aaaa匹配.正则表达式是贪婪的,它将与尽可能多的a匹配.第一个子匹配项是aaa.

This will match aaaa. The regex is greedy and will match as many a's as possible. The first submatch is aaa.

如果我们使用正则表达式/(a+?)a,则为

If we use the regex /(a+?)a this is

不情愿匹配一个或多个a,然后匹配一个a

匹配一个或多个a,直到我们到达另一个a

reluctantly match one or more as followed by an a
or
match one or more as until we reach another a

也就是说,只匹配我们所需要的.因此,在这种情况下,匹配项为aa,而第一个子匹配项为a.我们只需要匹配一个a来满足重复,然后再跟一个a.

That is, only match what we need. So in this case the match is aa and the first submatch is a. We only need to match one a to satisfy the repetition and then it is followed by an a.

当使用正则表达式在html标签,引号之类的内容中进行匹配时,这会出现很多问题-通常保留给快速和肮脏的操作.也就是说,使用正则表达式从非常大和复杂的html字符串或带转义序列的带引号的字符串中提取可能会引起很多问题,但是对于特定的用例而言,这是完全可以的.因此,在您的情况下,我们有:

This comes up a lot when using regex to match within html tags, quotes and the suchlike -- usually reserved for quick and dirty operations. That is to say using regex to extract from very large and complex html strings or quoted strings with escape sequence can cause a lot of problems but it's perfectly fine for specific use cases. So in your case we have:

/Dev/videos/1610110089242029/

表达式需要匹配videos/,后跟零个或多个字符,后跟/".如果只有一个视频URL,那就没问题了.

The expression needs to match videos/ followed by zero or more characters followed by /". If there is only one videos URL there that's just fine without being reluctant.

但是我们有

/videos/1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029/"

在不情愿的情况下,正则表达式将匹配:

Without reluctance, the regex will match:

1610110089242029/" ... ajaxify="/Dev/videos/1610110089242029

它尝试尽可能匹配,并且/"满足.就好了.勉强地,匹配在第一个/"处停止了(实际上它回溯了,但您可以单独阅读).因此,您只会获得所需的部分网址.

It tries to match as much as possible and / and " satisfy . just fine. With reluctance, the matching stops at the first /" (actually it backtracks but you can read about that separately). Thus you only get the part of the url you need.

这篇关于了解用于在html中的字符串之间查找字符串的正则表达式模式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆