使用Perl解析正则表达式链接的HTML页面 [英] Parse HTML Page For Links With Regex Using Perl

查看:112
本文介绍了使用Perl解析正则表达式链接的HTML页面的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述


可能存在重复:




好吧,我现在正在为一位客户做一份工作,他只是将他的语言选择转换为Perl。我不是Perl中最好的,但我之前做过这样的事情,虽然前一阵子。



有很多这样的链接: p>

 < a href =/ en /字幕/ 3586224 / death-become-her-entitle =字幕死亡成为她的onclick = 重新链接( '/ EN /字幕/ 3586224 /死亡成为HER-EN'); class =bnone>死亡成为她的
(1992)< / a>

我想匹配路径/ en / subtitles / 3586224 / death-become-her-en 并将它们放入数组或列表中(不确定Perl中哪些更好)。我一直在寻找perl文档,以及查看正则表达式教程,并且大多数(如果不是全部)似乎适用于使用〜=来匹配内容而不是捕获匹配。



Thanks,



Cody

解决方案

使用适当的HTML解析器解析HTML。请参阅此示例 ://search.cpan.org/dist/HTML-Parserrel =noreferrer> HTML :: Parser 。



或者,考虑以下简单的例子:

 #!/ usr / bin / perl 

use strict;使用警告;

使用HTML :: TokeParser :: Simple;

my $ parser = HTML :: TokeParser :: Simple-> new(\ * DATA);

my @hrefs;

(my $ anchor = $ parser-> get_tag('a')){
if(my $ href = $ anchor-> get_attr('href')){
推送@ hrefs,$ href if $ href =〜m!/ en / subtitles / !;
}
}

为@hrefs打印$ _\\\
;

__DATA__
成为她的onclick =reLink '/ EN /字幕/ 3586224 /死亡成为HER-EN');
class =bnone>死亡成为她的
(1992)< / a>

输出:

 
/ en / subtitles / 3586224 / death-become-her-en


Possible Duplicate:
How can I remove external links from HTML using Perl?

Alright, i'm working on a job for a client right now who just switched up his language choice to Perl. I'm not the best in Perl, but i've done stuff like this before with it albeit a while ago.

There are lots of links like this:

<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');" class="bnone">Death Becomes Her
        (1992)</a>

I want to match the path "/en/subtitles/3586224/death-becomes-her-en" and put those into an array or list (not sure which ones better in Perl). I've been searching the perl docs, as well as looking at regex tutorials, and most if not all seemed geared towards using ~= to match stuff rather than capture matches.

Thanks,

Cody

解决方案

Use a proper HTML parser to parse HTML. See this example included with HTML::Parser.

Or, consider the following simple example:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);

my @hrefs;

while ( my $anchor = $parser->get_tag('a') ) {
    if ( my $href = $anchor->get_attr('href') ) {
        push @hrefs, $href if $href =~ m!/en/subtitles/!;
    }
}

print "$_\n" for @hrefs;

__DATA__
<a href="/en/subtitles/3586224/death-becomes-her-en" title="subtitlesDeath 
Becomes Her" onclick="reLink('/en/subtitles/3586224/death-becomes-her-en');" 
class="bnone">Death Becomes Her
                (1992)</a>

Output:

/en/subtitles/3586224/death-becomes-her-en

这篇关于使用Perl解析正则表达式链接的HTML页面的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆