如何从Perl正则表达式中提取HTML链接? [英] How do I extract links from HTML with a Perl regex?
问题描述
我有一个巨大的html,有许多我不需要的东西,但它里面有以下格式提供的URL:
I have a HUGE html which has many things I don't need, but inside it has URLs that are provided in the following format:
<a href="http://www.retailmenot.com/" class=l
我尝试提取网址...我试过了,无济于事:
I'm trying to extract the URLs... I tried, to no avail:
open(FILE,"<","HTML.htm") or die "$!";
my @str = <FILE>;
my @matches = grep { m/a href="(.+?") class=l/ } @str
任何关于如何匹配的想法?
Any idea on how to match this?
推荐答案
使用 HTML :: SimpleLinkExtor , HTML :: LinkExtor ,或者提取Perl模块的其他链接之一。你根本不需要一个正则表达式。
Use HTML::SimpleLinkExtor, HTML::LinkExtor, or one of the other link extracting Perl modules. You don't need a regex at all.
下面是一个简短的例子。你不必子类。您只需告诉%HTML :: Tagset :: linkElements
要收集哪些属性:
Here's a short example. You don't have to subclass. You just have to tell %HTML::Tagset::linkElements
which attributes to collect:
#!perl
use HTML::LinkExtor;
$HTML::Tagset::linkElements{'a'} = [ qw( href class ) ];
$p = HTML::LinkExtor->new;
$p->parse( do { local $/; <> } );
my @links = grep {
my( $tag, %hash ) = @$_;
no warnings 'uninitialized';
$hash{class} eq 'foo';
} $p->links;
如果您需要为任何其他代码收集网址,请进行类似的调整。
If you need to collect URLs for any other tags, you make similar adjustments.
如果你想有一个回调例程,那也不是那么难。您可以在解析器运行时观察链接:
If you'd rather have a callback routine, that's not so hard either. You can watch the links as the parser runs into them:
use HTML::LinkExtor;
$HTML::Tagset::linkElements{'a'} = [ qw( href class ) ];
my @links;
my $callback = sub {
my( $tag, %hash ) = @_;
no warnings 'uninitialized';
push @links, $hash{href} if $hash{class} eq 'foo';
};
my $p = HTML::LinkExtor->new( $callback );
$p->parse( do { local $/; <DATA> } );
这篇关于如何从Perl正则表达式中提取HTML链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!