正则表达式解析包含特定单词的链接 [英] Regex to parse links containing specific words

查看:302
本文介绍了正则表达式解析包含特定单词的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

采用线程更进一步,有人可以告诉我这两个正则表达式之间有什么区别?他们俩似乎都完成了同样的事情:从html中拉出一个链接.

Taking this thread a step further, can someone tell me what the difference is between these two regular expressions? They both seem to accomplish the same thing: pulling a link out of html.

表达式1:

'/(https?://)?(www.)?([a-zA-Z0-9_%]*)\b.[a-z]{2,4}(.[a-z]{2})?((/[a-zA-Z0-9_%])+)?(.[a-z])?/'

表达式2:

'/<a.*?href\s*=\s*["\']([^"\']+)[^>]*>.*?<\/a>/si'

哪个会更好?我该如何修改这些表达式之一,使其仅匹配包含某些单词的链接,而忽略不包含这些单词的任何匹配?

Which one would be better to use? And how could I modify one of those expressions to match only links that contain certain words, and to ignore any matches that do not contain those words?

谢谢.

推荐答案

区别在于,表达式1根据规范查找有效和完整的URI.因此,您将获得所有位于代码内部的完整URL.这与获取所有链接并没有真正的关系,因为它与经常使用的相对URL不匹配,并且它会获取每个URL,而不仅仅是链接目标的URL.

The difference is that expression 1 looks for valid and full URIs, following the specification. So you get all full urls that are somewhere inside of the code. This is not really related to getting all links, because it doesn't match relative urls that are very often used, and it gets every url, not only the ones that are link targets.

第二个查找a标记并获取href属性的内容.因此,此链接将为您提供所有链接.除了该表达式中的一个错误*,使用它非常安全,并且可以很好地为您提供每个链接-它检查是否可能出现足够的差异,例如空格或其他属性.

The second looks for a tags and gets the content of the href attribute. So this one will get you every link. Except for one error* in that expression, it is quite safe to use it and it will work good enough to get you every link – it checks for enough differences that can appear, such as whitespace or other attributes.

*但是该表达式中有一个错误,因为它不查找href属性的右引号,因此您应该添加它,否则您可能会匹配奇怪的东西:

*However there is one error in that expression, as it does not look for the closing quote of the href attribute, you should add that or you might match weird things:

/<a.*?href\s*=\s*["\']([^"\'>]+)["\'][^>]*>.*?<\/a>/si

针对评论进行修改:

要在链接网址中查找word,请使用:

To look for word inside of the link url, use:

/<a.*?href\s*=\s*["\']([^"\'>]*word[^"\'>]*)["\'][^>]*>.*?<\/a>/si

要在链接文本中查找word,请使用:

To look for word inside of the link text, use:

/<a.*?href\s*=\s*["\']([^"\'>]+)["\'][^>]*>.*?word.*?<\/a>/si

这篇关于正则表达式解析包含特定单词的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆