如何使用Perl从HTML中删除外部链接? [英] How can I remove external links from HTML using Perl?
问题描述
我试图从HTML文档中删除外部链接,但保留锚点,但我没有太多运气。以下正则表达式
$ html =〜s /< a href =。+?\.htm>( 。+)< \?/一个大于1 / $ / SIG;
会匹配锚标记的开头和外部链接标记的结尾,例如
< a HREF =#FN1name =01> 1< / a>
其他html
< a href =155.htm>没有。 155
< / a> <! - 结束标记不一定在同一行上 - >
所以我最终得不到任何东西
< a HREF =#FN1name =01> 1< / a>
其他一些html
所有锚点都有大写的href属性,所以我知道我可以做一个区分大小写的匹配,但我不想依赖它总是在未来的情况。
是我可以改变的东西所以它只匹配一个 a
标签?
回声Chris Lutz '的评论,我希望以下内容表明,使用解析器非常简单(特别是如果您希望能够处理您尚未见过的输入,例如< a class =external href =...>
),而不是使用 s ///
来放置脆弱的解决方案。
如果您打算采用 编辑:受大众需求的影响;-),所有这些属性都是大写字母,而不是放在幻灯片中。这里是版本usin g HTML :: TokeParser :: Simple 。使用 HTML :: TokeParser 查看版本的编辑历史记录。 输出: 注意:您检查为正确的基于正则表达式的解决方案如果链接的文件具有 I am trying to remove external links from an HTML document but keep the anchors but I'm not having much luck. The following regex will match the beginning of an anchor tag and the end of an external link tag e.g. so I end up with nothing instead of It just so happens that all anchors have their href attribute in uppercase, so I know I can do a case sensitive match, but I don't want to rely on it always being the case in the future. Is the something I can change so it only matches the one Echoing Chris Lutz' comment, I hope the following shows that it is really straightforward to use a parser (especially if you want to be able to deal with input you have not yet seen such as If you are going to take the Edit: By popular demand ;-), here is the version using HTML::TokeParser::Simple. See the edit history for the version using just HTML::TokeParser. Output: NB: The regex based solution you checked as ''correct'' breaks if the files that are linked to have the 这篇关于如何使用Perl从HTML中删除外部链接?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋! s ///
路线,至少说实话,确实取决于 href
$ b
#!/ usr / bin / perl
use strict;使用警告;
使用HTML :: TokeParser :: Simple;
my $ parser = HTML :: TokeParser :: Simple-> new(\ * DATA);
while(my $ token = $ parser-> get_token){
if($ token-> is_start_tag('a')){
my $ href = $令牌的> GET_ATTR( 'href' 属性);
if(defined $ href and $ href!〜/ ^#/){
print $ parser-> get_trimmed_text('/ a');
$ parser-> get_token; #丢弃< / a>
next;
}
}
print $ token-> as_is;
}
__DATA__
其他html
< a href =155.htm>没有。 155
< / a> <! - 结束标记不一定在同一行上 - >
< a class =externalhref =http://example.com>您
的例子可能没有考虑到< / a>
< p>您可能没有考虑< a
href =test.html>按一下这里>>>< / a>
要么< / p>
C:\Temp> hjk
其他html
编号155<! - 结束标记不一定在同一行 - >
您可能没有考虑过的例子
< p>也许您没有考虑点击此处>>>
要么< / p>
.html
扩展名而不是 .htm
,则会中断。考虑到这一点,我发现你不关心大写 HREF
属性没有根据。 如果您真的想快速又脏兮兮的,您不应该打扰其他任何事情,而且应该依赖所有大写 HREF
并完成它。但是,如果您想确保您的代码能够处理更多种类的文档并且使用更长的时间,则应该使用适当的解析器。$html =~ s/<a href=".+?\.htm">(.+?)<\/a>/$1/sig;
<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a HREF="#FN1" name="01">1</a>
some other html
a
tag?<a class="external" href="...">
) rather than putting together fragile solutions using s///
.s///
route, at least be honest, do depend on href
attributes being all upper case instead of putting up an illusion of flexibility.#!/usr/bin/perl
use strict; use warnings;
use HTML::TokeParser::Simple;
my $parser = HTML::TokeParser::Simple->new(\*DATA);
while ( my $token = $parser->get_token ) {
if ($token->is_start_tag('a')) {
my $href = $token->get_attr('href');
if (defined $href and $href !~ /^#/) {
print $parser->get_trimmed_text('/a');
$parser->get_token; # discard </a>
next;
}
}
print $token->as_is;
}
__DATA__
<a HREF="#FN1" name="01">1</a>
some other html
<a href="155.htm">No. 155
</a> <!-- end tag not necessarily on the same line -->
<a class="external" href="http://example.com">An example you
might not have considered</a>
<p>Maybe you did not consider <a
href="test.html">click here >>></a>
either</p>
C:\Temp> hjk
<a HREF="#FN1" name="01">1</a>
some other html
No. 155 <!-- end tag not necessarily on the same line -->
An example you might not have considered
<p>Maybe you did not consider click here >>>
either</p>
.html
extension rather than .htm
. Given that, I find your concern with not relying on the upper case HREF
attributes unwarranted. If you really want quick and dirty, you should not bother with anything else and you should rely on the all caps HREF
and be done with it. If, however, you want to ensure that your code works with a much larger variety of documents and for much longer, you should use a proper parser.