正则表达式查找不在超链接内的URL [英] Regular expression to find URLs not inside a hyperlink

查看:79
本文介绍了正则表达式查找不在超链接内的URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有很多正则表达式来匹配一个URL。但是,我试图匹配不出现在< a> 超链接标记( HREF ,内在价值等)。因此,这些网址中的NONE都应该匹配:

 
< a href =http://www.example.com/> ;东西< / A>
< a href =http://www.example.com/> http://www.example2.com< / a>
< a href =http://www.example.com/>< b>某事< / b> http://www.example.com/&span>>测试< / span> ;< / A>

< a>< / a> 之外的任何网址都应该是匹配。

我尝试的一种方法是使用负向预测来查看第一个< a> 标记在URL开始< a> 或关闭< / a> 后。如果它是一个结束< / a> ,那么该URL必须位于超链接内。我认为这个想法是可以的,但负面的lookahead正则表达式不起作用(或者更准确地说,正则表达式写得不正确)。任何提示都非常感谢。

解决方案

您可以通过两步来完成,而不是试图想出一个正则表达式:


  1. 将HTML锚点部分(整个锚点标记:开始标记,内容和结束标记) 。

  2. 在Perl中匹配 它可能是:

      my $ curLine = $ _; #如果需要别的东西,请不要更改$ _。 
    $ curLine =〜/< a [^ <] +< \ / a> // g; #删除所有HTML定位标记,< a,< / a>并介于两者之间。
    if($ curLine =〜/ http:\ / \ //)
    {
    print在HTML锚点之外匹配一个URL:$ _\\\
    ;
    }


    There's many regex's out there to match a URL. However, I'm trying to match URLs that do not appear anywhere within a <a> hyperlink tag (HREF, inner value, etc.). So NONE of the URLs in these should match:

    <a href="http://www.example.com/">something</a>
    <a href="http://www.example.com/">http://www.example2.com</a>
    <a href="http://www.example.com/"><b>something</b>http://www.example.com/<span>test</span></a>
    

    Any URL outside of <a></a> should be matched.

    One approach I tried was to use a negative lookahead to see if the first <a> tag after the URL was an opening <a> or a closing </a>. If it is a closing </a> then the URL must be inside a hyperlink. I think this idea was okay, but the negative lookahead regex didn't work (or more accurately, the regex wasn't written correctly). Any tips are very appreciated.

    解决方案

    You can do it in two steps instead of trying to come up with a single regular expression:

    1. Blend out (replace with nothing) the HTML anchor part (the entire anchor tag: opening tag, content and closing tag).

    2. Match the URL

    In Perl it could be:

    my $curLine = $_; #Do not change $_ if it is needed for something else.
    $curLine =~ /<a[^<]+<\/a>//g; #Remove all of HTML anchor tag, "<a", "</a>" and everything in between.
    if ( $curLine =~ /http:\/\//)
    {
      print "Matched an URL outside a HTML anchor !: $_\n";
    }
    

    这篇关于正则表达式查找不在超链接内的URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆