如何使用Perl从HTML中删除外部链接? [英] How can I remove external links from HTML using Perl?

查看:143
本文介绍了如何使用Perl从HTML中删除外部链接?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从HTML文档中删除外部链接,但保留锚点,但我没有太多运气。以下正则表达式

  $ html =〜s /< a href =。+?\.htm>( 。+)< \?/一个大于1 / $ / SIG; 

会匹配锚标记的开头和外部链接标记的结尾,例如

 < a HREF =#FN1name =01> 1< / a> 
其他html
< a href =155.htm>没有。 155
< / a> <! - 结束标记不一定在同一行上 - >

所以我最终得不到任何东西

 < a HREF =#FN1name =01> 1< / a> 
其他一些html

所有锚点都有大写的href属性,所以我知道我可以做一个区分大小写的匹配,但我不想依赖它总是在未来的情况。



是我可以改变的东西所以它只匹配一个 a 标签?

解决方案

回声Chris Lutz '的评论,我希望以下内容表明,使用解析器非常简单(特别是如果您希望能够处理您尚未见过的输入,例如< a class =external href =...> ),而不是使用 s /// 来放置脆弱的解决方案。



如果您打算采用 s /// 路线,至少说实话,确实取决于 href
$ b

编辑:受大众需求的影响;-),所有这些属性都是大写字母,而不是放在幻灯片中。这里是版本usin g HTML :: TokeParser :: Simple 。使用 HTML :: TokeParser 查看版本的编辑历史记录。

 #!/ usr / bin / perl 

use strict;使用警告;
使用HTML :: TokeParser :: Simple;

my $ parser = HTML :: TokeParser :: Simple-> new(\ * DATA);

while(my $ token = $ parser-> get_token){
if($ token-> is_start_tag('a')){
my $ href = $令牌的> GET_ATTR( 'href' 属性);
if(defined $ href and $ href!〜/ ^#/){
print $ parser-> get_trimmed_text('/ a');
$ parser-> get_token; #丢弃< / a>
next;
}
}
print $ token-> as_is;
}

__DATA__
其他html
< a href =155.htm>没有。 155
< / a> <! - 结束标记不一定在同一行上 - >
< a class =externalhref =http://example.com>您
的例子可能没有考虑到< / a>

< p>您可能没有考虑< a
href =test.html>按一下这里>>>< / a>
要么< / p>

输出:

  C:\Temp> hjk 
其他html
编号155<! - 结束标记不一定在同一行 - >
您可能没有考虑过的例子

< p>也许您没有考虑点击此处>>>
要么< / p>

注意:您检查为正确的基于正则表达式的解决方案如果链接的文件具有 .html 扩展名而不是 .htm ,则会中断。考虑到这一点,我发现你不关心大写 HREF 属性没有根据。 如果您真的想快速又脏兮兮的,您不应该打扰其他任何事情,而且应该依赖所有大写 HREF 并完成它。但是,如果您想确保您的代码能够处理更多种类的文档并且使用更长的时间,则应该使用适当的解析器。


I am trying to remove external links from an HTML document but keep the anchors but I'm not having much luck. The following regex

$html =~ s/<a href=".+?\.htm">(.+?)<\/a>/$1/sig;

will match the beginning of an anchor tag and the end of an external link tag e.g.

<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->

so I end up with nothing instead of

<a HREF="#FN1" name="01">1</a>
some other html

It just so happens that all anchors have their href attribute in uppercase, so I know I can do a case sensitive match, but I don't want to rely on it always being the case in the future.

Is the something I can change so it only matches the one a tag?

解决方案

Echoing Chris Lutz' comment, I hope the following shows that it is really straightforward to use a parser (especially if you want to be able to deal with input you have not yet seen such as <a class="external" href="...">) rather than putting together fragile solutions using s///.

If you are going to take the s/// route, at least be honest, do depend on href attributes being all upper case instead of putting up an illusion of flexibility.

Edit: By popular demand ;-), here is the version using HTML::TokeParser::Simple. See the edit history for the version using just HTML::TokeParser.

#!/usr/bin/perl

use strict; use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new(\*DATA);

while ( my $token = $parser->get_token ) {
    if ($token->is_start_tag('a')) {
        my $href = $token->get_attr('href');
        if (defined $href and $href !~ /^#/) {
            print $parser->get_trimmed_text('/a');
            $parser->get_token; # discard </a>
            next;
        }
    }
    print $token->as_is;
}

__DATA__
<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com">An example you
might not have considered</a>

<p>Maybe you did not consider <a
href="test.html">click here >>></a>
either</p>

Output:

C:\Temp> hjk
<a HREF="#FN1" name="01">1</a>
some other html
No. 155 <!-- end tag not necessarily on the same line -->
An example you might not have considered

<p>Maybe you did not consider click here >>>
either</p>

NB: The regex based solution you checked as ''correct'' breaks if the files that are linked to have the .html extension rather than .htm. Given that, I find your concern with not relying on the upper case HREF attributes unwarranted. If you really want quick and dirty, you should not bother with anything else and you should rely on the all caps HREF and be done with it. If, however, you want to ensure that your code works with a much larger variety of documents and for much longer, you should use a proper parser.

这篇关于如何使用Perl从HTML中删除外部链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆