如何从Perl正则表达式中提取HTML链接? [英] How do I extract links from HTML with a Perl regex?

查看:130
本文介绍了如何从Perl正则表达式中提取HTML链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个巨大的html,有许多我不需要的东西,但它里面有以下格式提供的URL:

I have a HUGE html which has many things I don't need, but inside it has URLs that are provided in the following format:

<a href="http://www.retailmenot.com/" class=l

我尝试提取网址...我试过了,无济于事:

I'm trying to extract the URLs... I tried, to no avail:

open(FILE,"<","HTML.htm") or die "$!";
my @str = <FILE>;

my @matches = grep { m/a href="(.+?") class=l/ } @str

任何关于如何匹配的想法?

Any idea on how to match this?

推荐答案

使用 HTML :: SimpleLinkExtor HTML :: LinkExtor ,或者提取Perl模块的其他链接之一。你根本不需要一个正则表达式。

Use HTML::SimpleLinkExtor, HTML::LinkExtor, or one of the other link extracting Perl modules. You don't need a regex at all.

下面是一个简短的例子。你不必子类。您只需告诉%HTML :: Tagset :: linkElements 要收集哪些属性:

Here's a short example. You don't have to subclass. You just have to tell %HTML::Tagset::linkElements which attributes to collect:

#!perl
use HTML::LinkExtor;

$HTML::Tagset::linkElements{'a'} = [ qw( href class ) ];

$p = HTML::LinkExtor->new;
$p->parse( do { local $/; <> } );

my @links = grep { 
    my( $tag, %hash ) = @$_;
    no warnings 'uninitialized';
    $hash{class} eq 'foo';
    } $p->links;

如果您需要为任何其他代码收集网址,请进行类似的调整。

If you need to collect URLs for any other tags, you make similar adjustments.

如果你想有一个回调例程,那也不是那么难。您可以在解析器运行时观察链接:

If you'd rather have a callback routine, that's not so hard either. You can watch the links as the parser runs into them:

use HTML::LinkExtor;

$HTML::Tagset::linkElements{'a'} = [ qw( href class ) ];

my @links;
my $callback = sub {
    my( $tag, %hash ) = @_;
    no warnings 'uninitialized';
    push @links, $hash{href} if $hash{class} eq 'foo';
    };

my $p = HTML::LinkExtor->new( $callback );
$p->parse( do { local $/; <DATA> } );

这篇关于如何从Perl正则表达式中提取HTML链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆