使用正则表达式搜索包含“http://”的字符串。并且不包含“mysite.com” [英] Using Regex to search for a string that contains "http://" and does not contain "mysite.com"

查看:136
本文介绍了使用正则表达式搜索包含“http://”的字符串。并且不包含“mysite.com”的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何编写Regular Rxpression来搜索包含http://且不包含mysite.com的字符串?

解决方案

试图将正则表达式绑定到布尔逻辑中,语言是一个吃力不讨好的工作。虽然可以使用复杂的lookahead编写 / PAT1 /而不是/ PAT2 / ,这样它只是一个模式,这是一个痛苦的任务。 您不要这样做!



您应该已经解释了您真正在做的事情 - 首先进行某种匹配在文本编辑器中进行操作。你没有。所以,你得到一个普遍的答案,这将是一个具有挑战性的适应你的本地化情况。



快速回答



 (?sx)#let dot cross lines,enable comments& whitspace 
(?=。* http://)#lookahead assertion for http://
(?!。* mysite \.com)#lookahead negation for mysite.com

使用Perl语法,您可以将该(预)编译模式粘贴到变量中,以便将来以这种方式使用:

my $ is_valid_rx = qr {
(?=。* http://)#lookahead assertion for http://
(?!。* mysite \.com)#mysaite.com的lookahead negation
} sx; #/ s跨越换行符,/ x用于评论&如果($ some_string =〜$ is_valid_rx){
#你的字符串有一个http等等,并且缺少一个mysite等等
} $(


# b $ b

但是,如果您的目标是提取所有这些链接,这不会对您有所帮助,因为这些lookahead并不告诉你链接发生在字符串的哪里。



在这种情况下,写一些东西来拉出链接然后过滤掉会更容易您的不需要的情况下,使用两个单独的正则表达式,而不是试图做任何事情。


$ b

  @ all_links =($ some_string =〜m {https?:// \S +} xg); 
@good_links = grep!/mysite\.com/,@all_links;

请注意,不会尝试仅匹配包含有效URL字符的链接,或者没有偶然的标点符号经常出现在纯文本中。



现在,对于真正的答案



还请注意如果你用这个解析HTML,上面概述的方法只是一种快速而又肮脏的,快速而松散的拍摄 - 从 - 臀部的链接提取。构建有效的输入很容易产生大量误报,而且也不容易构建产生误报的输入。

这里,相反,一个完整的程序,可以将所有的< a ...> < img ...> 链接地址的URL参数,并且实际上是这样做的,因为它使用了真正的解析器。


$ b

 #!/ usr / bin / env perl 

#fetchlinks - 获取所有< a>和< img>来自列出的URL的链接args
#Tom Christiansen< tchrist@perl.com>
#Wed Mar 14 08:03:53 MDT 2012

use strict;
使用警告;

使用LWP :: UserAgent;
使用HTML :: LinkExtor;
使用URI :: URL;

dieusage:$ 0 url ... \\\
除非@ARGV;

为我的$ arg(@ARGV){
my @links = fetch_wanted_links($ arg => qw< a img>);
for my $ link(@links){
print$ arg =>if if @ARGV> 1;
打印$ link\\\
;
}
}

exit;

fetch_wanted_links {
my($ url,@wanted)= @_;

my%wanted;
@wanted {@wanted} =(1)x @wanted;

my $ agent = LWP :: UserAgent-> new;

#设置一个回调函数来收集想要的变量
my @hits =();

#制作解析器。不幸的是,我们还不知道基本的
#(它可能与$ url不同)
my $ parser = new HTML :: LinkExtor sub {
my($ tag,%attr )= @_;
如果需要%而不是$ wanted {$ tag};
push @hits,values%attr;
};

#请求文档并在它到达时进行解析
my $ response = $ agent->请求(
HTTP :: Request-> new(GET => $ url),
sub {$ parser-> parse($ _ [0])},
);

#将所有图片网址扩展为绝对值
my $ base = $ response-> base;
@hits = map {$ _ = url($ _,$ base) - > abs} @hits;
return @hits;



$ b $ p
$ b如果你在这样的URL上运行它,锚点和图片链接:

  $ perl fetchlinks http://www.perl.org/ 
http://www.perl.org/
http://st.pimg.net/perlweb/images/camel_head.v25e738a.png
http://www.perl.org /
http://www.perl.org/learn.html
http://www.perl.org/docs.html
http://www.perl.org/cpan .html
http://www.perl.org/community.html
http://www.perl.org/contribute.html
http://www.perl.org/ about.html
http://www.perl.org/get.html
http://www.perl.org/get.html
http://www.perl.org /get.html
http://www.perl.org/about.html
http://www.perl.org/learn.html
http://st.pimg。 net / perlweb / images / icons / learn.v0e1f83c.png
http://www.perl.org/learn.html
http://www.perl.org/community.html
http://st.pimg.net/perlweb/images/icons/community.v03bf8ce.png
http://www.perl.org/community.html
http://www.perl.org/docs.html
http://st.pimg.net/perlweb/images/icons/docs.v2622a01.png
http:// www .perl.org / docs.html
http://www.perl.org/contribute.html
http://st.pimg.net/perlweb/images/icons/cog.v08b9acc.png
http://www.perl.org/contribute.html
http://www.perl.org/dev.html
http://www.perl.org/contribute。 html
http://www.perl.org/cpan.html
http://st.pimg.net/perlweb/images/icons/cpan.vdc5be93.png
http:/ /www.perl.org/cpan.html
http://www.perl.org/events.html
http://st.pimg.net/perlweb/images/icons/cal.v705acef .png
http://www.perl.org/events.html
http://www.perl6.org/
http://st.pimg.net/perlweb/images /icons/perl6.v8ff6c63.png
http://www.perl6.org/
http://www.perl.org/dev.html
http://www.perlfoundation .org /
http://st.pimg.net/perlweb/images/icons/onion.vee5cb98.png
http://www.perlfoundation.org/
http:// www.cpan.org/
http://search.cpan.org/~jtang/Net-Stomp-0.45/
http:// sea rch.cpan.org/~vaxman/Array-APX-0.3/
http://search.cpan.org/~salva/Net-SFTP-Foreign-1.71/
http:// search。 cpan.org/~grandpa/Win32-MSI-HighLevel-1.0008/
http://search.cpan.org/~teejay/Catalyst-TraitFor-Component-ConfigPerSite-0.06/
http:// search.cpan.org/~jwieland/WebService-Embedly-0.04/
http://search.cpan.org/~mariab/WWW-TMDB-API0.04/
http:// search。 cpan.org/~teejay/SOAP-Data-Builder-1/
http://search.cpan.org/~dylan/WWW-Google-Translate-0.03/
http:// search。 cpan.org/~jtbraun/Parse-RecDescent-1.967_008/
http://www.perl.org/get.html
http://www.perl.org/learn.html
http://www.perl.org/docs.html
http://www.perl.org/community.html
http://www.perl.org/events.html
http://www.perl.org/siteinfo.html#sponsors
http://www.yellowbot.com/
http://st.pimg.net/perlweb/images/ friends / yellowbot.vcc29f5b.gif
http://www.perl.org/
http://blogs.perl.org/
http://jobs.perl.org/
http://learn.perl.org/
http:// dev。 perl.org/
http://creativecommons.org/licenses/by-nc-nd/3.0/us/
http://i.creativecommons.org/l/by-nc-nd/ 3.0 / us / 80x15.png
http://www.perl.org/siteinfo.html

对于任何需要在文件上快速运行 grep 来观察常规结果的工作,您需要使用合适的解析器来完成这种工作。 p>

how can I write Regular Rxpression to search for a string that contains "http://" AND does not contain "mysite.com"?

解决方案

WARNING

Attempting to rope regexes into boolean logic best accomplished in a proper programming language is a thankless job. While it is possible to write /PAT1/ and not /PAT2/ using complex lookaheads so that it is just one pattern, it is a painful task. You don’t to do it this way!

You should have explained what you were really doing in the first place — some sort of match operation in a text editor. You didn’t. So you get a general answer that is going to be challenging to adapt to your localized situation.

Quick Answer

(?sx)                 # let dot cross newlines, enable comments & whitspace
(?= .* http://     )  # lookahead assertion for http://
(?! .* mysite\.com )  # lookahead negation  for mysite.com

Using Perl syntax, you could stick that (pre-)compiled pattern into a variable for future use this way:

my $is_valid_rx = qr{
    (?= .* http://     )  # lookahead assertion for http://
    (?! .* mysite\.com )  # lookahead negation  for mysite.com
}sx;                      # /s to cross newlines, /x for comments & whitespace

# then later on…
if ($some_string =~ $is_valid_rx) { 
     # your string has an http blah and lacks a mysite blah
}

However, if your goal is to pull out all such links, that isn’t going to help you, because those lookaheads do not tell you where in the string your link occurs.

In that case, it’s a lot easier to write something to pull out the links and then filter out your unwanted cases after that, using two separate regexes instead of trying to make do everything.

 @all_links = ($some_string =~ m{ https?://\S+ }xg);
 @good_links = grep !/mysite\.com/, @all_links;

Note that no attempt is made to match only links that contain valid URL characters, or that there is no accidental trailing punctuation as so often occurs in plain text.

And now, for a real answer

Note also that if you’re parsing HTML with this, the approach outlined above is just a quick-and-dirty, fast-and-loose, shoot-from-the-hip kind of link extraction. It’s easy to construct valid input that turns up a lot of false positives, and not altogether hard to construct input that produces false negatives, too.

Here, in contrast, is a full program that dumps out all the <a ...> and <img ...> link address in its URL arguments, and actually does so correctly because it uses a real parser.

#!/usr/bin/env perl
#
# fetchlinks - fetch all <a> and <img> links from listed URL args
# Tom Christiansen <tchrist@perl.com>
# Wed Mar 14 08:03:53 MDT 2012
#
use strict;
use warnings;

use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;

die "usage: $0 url ...\n" unless @ARGV;

for my $arg (@ARGV) {
    my @links = fetch_wanted_links($arg => qw<a img>);
    for my $link (@links) {
        print "$arg => " if @ARGV > 1;
        print "$link\n";
    }
}

exit;

sub fetch_wanted_links {
    my($url, @wanted) = @_;

    my %wanted;
    @wanted{@wanted} = (1) x @wanted;

    my $agent = LWP::UserAgent->new;

    # Set up a callback that collect links of the wanted variety
    my @hits = ();

    # Make the parser.  Unfortunately, we don't know the base yet
    # (it might be different from $url)
    my $parser = new HTML::LinkExtor sub {
       my($tag, %attr) = @_;
       return if %wanted and not $wanted{$tag};
       push @hits, values %attr;
    };

    # Request document and parse it as it arrives
    my $response = $agent->request(
           HTTP::Request->new(GET => $url),
           sub { $parser->parse( $_[0] ) },
    );

    # Expand all image URLs to absolute ones
    my $base = $response->base;
    @hits = map { $_ = url($_, $base)->abs } @hits;
    return @hits;
}

If you run it on a URL like this, it gives this accounting of all the anchor and image links:

$ perl fetchlinks http://www.perl.org/
http://www.perl.org/
http://st.pimg.net/perlweb/images/camel_head.v25e738a.png
http://www.perl.org/
http://www.perl.org/learn.html
http://www.perl.org/docs.html
http://www.perl.org/cpan.html
http://www.perl.org/community.html
http://www.perl.org/contribute.html
http://www.perl.org/about.html
http://www.perl.org/get.html
http://www.perl.org/get.html
http://www.perl.org/get.html
http://www.perl.org/about.html
http://www.perl.org/learn.html
http://st.pimg.net/perlweb/images/icons/learn.v0e1f83c.png
http://www.perl.org/learn.html
http://www.perl.org/community.html
http://st.pimg.net/perlweb/images/icons/community.v03bf8ce.png
http://www.perl.org/community.html
http://www.perl.org/docs.html
http://st.pimg.net/perlweb/images/icons/docs.v2622a01.png
http://www.perl.org/docs.html
http://www.perl.org/contribute.html
http://st.pimg.net/perlweb/images/icons/cog.v08b9acc.png
http://www.perl.org/contribute.html
http://www.perl.org/dev.html
http://www.perl.org/contribute.html
http://www.perl.org/cpan.html
http://st.pimg.net/perlweb/images/icons/cpan.vdc5be93.png
http://www.perl.org/cpan.html
http://www.perl.org/events.html
http://st.pimg.net/perlweb/images/icons/cal.v705acef.png
http://www.perl.org/events.html
http://www.perl6.org/
http://st.pimg.net/perlweb/images/icons/perl6.v8ff6c63.png
http://www.perl6.org/
http://www.perl.org/dev.html
http://www.perlfoundation.org/
http://st.pimg.net/perlweb/images/icons/onion.vee5cb98.png
http://www.perlfoundation.org/
http://www.cpan.org/
http://search.cpan.org/~jtang/Net-Stomp-0.45/
http://search.cpan.org/~vaxman/Array-APX-0.3/
http://search.cpan.org/~salva/Net-SFTP-Foreign-1.71/
http://search.cpan.org/~grandpa/Win32-MSI-HighLevel-1.0008/
http://search.cpan.org/~teejay/Catalyst-TraitFor-Component-ConfigPerSite-0.06/
http://search.cpan.org/~jwieland/WebService-Embedly-0.04/
http://search.cpan.org/~mariab/WWW-TMDB-API0.04/
http://search.cpan.org/~teejay/SOAP-Data-Builder-1/
http://search.cpan.org/~dylan/WWW-Google-Translate-0.03/
http://search.cpan.org/~jtbraun/Parse-RecDescent-1.967_008/
http://www.perl.org/get.html
http://www.perl.org/learn.html
http://www.perl.org/docs.html
http://www.perl.org/community.html
http://www.perl.org/events.html
http://www.perl.org/siteinfo.html#sponsors
http://www.yellowbot.com/
http://st.pimg.net/perlweb/images/friends/yellowbot.vcc29f5b.gif
http://www.perl.org/
http://blogs.perl.org/
http://jobs.perl.org/
http://learn.perl.org/
http://dev.perl.org/
http://creativecommons.org/licenses/by-nc-nd/3.0/us/
http://i.creativecommons.org/l/by-nc-nd/3.0/us/80x15.png
http://www.perl.org/siteinfo.html

For any work for serious than running a quick grep over a file to eyeball general results, you need to use a proper parser to do this sort of thing.

这篇关于使用正则表达式搜索包含“http://”的字符串。并且不包含“mysite.com”的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆