使用Perl在HTML中查找Favicons [英] Find Favicons in HTML using Perl

查看:88
本文介绍了使用Perl在HTML中查找Favicons的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试使用Perl查找给定URL的图标(及其变体)(我想避免使用外部服务,例如Google的图标查找器).有一个CPAN模块,WWW :: Favicon,但是十年来一直没有更新-十年来,现在已经有了重要的变体,例如"apple-touch-icon".已经取代了古老的"ico"商标.文件.

I'm trying to look for favicons (and variants) for a given URL using Perl (I'd like to avoid using an external service such as Google's favicon finder). There's a CPAN module, WWW::Favicon, but it hasn't been updated in over a decade -- a decade in which now important variants such as "apple-touch-icon" have come to replace the venerable "ico" file.

我想我在WWW :: Mechanize中找到了解决方案,因为它可以列出给定URL中的所有链接,包括<link>标头标记.但是,我似乎找不到一种使用"find_link"链接的干净方法. rel属性的搜索方法.

I thought I found the solution in WWW::Mechanize, since it can list all of the links in a given URL, including <link> header tags. However, I cannot seem to find a clean way to use the "find_link" method to search for the rel attribute.

例如,我尝试使用"rel"作为搜索词,希望尽管它在文档中未提及,但它仍在其中,但它不起作用.此代码返回有关无效的链接查找参数"的错误.

For example, I tried using 'rel' as the search term, hoping maybe it was in there despite not being mentioned in the documentation, but it doesn't work. This code returns an error about an invalid "link-finding parameter."

my $results = $mech->find_link( 'rel' => "apple-touch-icon" );
use Data::Dumper;
say STDERR Dumper $results;

我还尝试使用其他链接查找参数,但是似乎没有一个适合于查找rel属性.

I also tried using other link-finding parameters, but none of them seem to be suited to searching out a rel attribute.

我唯一能弄清楚该怎么做的方法是遍历所有链接并查找如下所示的rel属性:

The only way I could figure out how to do it is by iterating through all links and looking for a rel attribute like this:

my $results = $mech->find_all_links(  );

foreach my $result (@{ $results }) {
    my $attrs = $result->attrs();
    #'tag' => "apple-touch-icon"
    
    foreach my $attr (sort keys %{ $attrs }) {
        if ($attrs->{'rel'} =~ /^apple-touch-icon.*$/) {
            say STDERR "I found it:" . $result->url();
        }

        # Add tests for other types of icons here.
        # E.g. "mask-icon" and "shortcut icon."

    }

}

那行得通,但是看起来很混乱.有更好的方法吗?

That works, but it seems messy. Is there a better way?

推荐答案

这就是我要使用 Mojo :: DOM .提取HTML页面后,请使用dom进行所有解析.然后,使用CSS选择器找到有趣的节点:

Here's how I'd do it with Mojo::DOM. Once you fetch an HTML page, use dom to do all the parsing. From that, use a CSS selector to find the interesting nodes:

link[rel*=icon i][href]

此CSS选择器查找同时具有relhref标签的link标签.另外,我要求rel中的值不区分大小写(i)包含(*=)"icon".如果要假定所有节点都具有href,则不要使用[href].

This CSS selector looks for link tags that have the rel and href tags at the same time. Additionally, I require that the value in rel contain (*=) "icon", case insensitively (the i). If you want to assume that all nodes will have the href, just leave off [href].

一旦有了链接列表,我将仅提取href中的值并将该列表转换为数组引用(尽管我可以使用Mojo::Collection方法完成其余工作):

Once I have the list of links, I extract just the value in href and turn that list into an array reference (although I could do the rest with Mojo::Collection methods):

use v5.10;

use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new->max_redirects(3);

my $results = $ua->get( shift )
    ->result
    ->dom
    ->find( 'link[rel*=icon i][href]' )
    ->map( attr => 'href' )
    ->to_array
    ;

say join "\n", @$results;

到目前为止效果很好:

$ perl mojo.pl https://www.perl.org
https://cdn.perl.org/perlweb/favicon.ico

$ perl mojo.pl https://www.microsoft.com
https://c.s-microsoft.com/favicon.ico?v2

$ perl mojo.pl https://leanpub.com/mojo_web_clients
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-57x57-b83f183ad6b00aa74d8e692126c7017e.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-60x60-6dc1c10b7145a2f1156af5b798565268.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-72x72-5037b667b6f7a8d5ba8c4ffb4a62ec2d.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-76x76-57860ca8a817754d2861e8d0ef943b23.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-114x114-27f9c42684f2a77945643b35b28df6e3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-120x120-3819f03d1bad1584719af0212396a6fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-144x144-a79479b4595dc7ca2f3e6f5b962d16fd.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/apple-touch-icon-152x152-aafe015ef1c22234133158a89b29daf5.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-16x16-c1207cd2f3a20fd50de0e585b4b307a3.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-32x32-e9b1d6ef3d96ed8918c54316cdea011f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-96x96-842fcd3e7786576fc20d38bbf94837fc.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-128x128-e97066b91cc21b104c63bc7530ff819f.png
https://d3g6anj9jkury9.cloudfront.net/assets/favicons/favicon-196x196-b8cab44cf725c4fa0aafdbd237cdc4ed.png

现在,如果您发现更多有趣的案例而又不容易为其编写选择器,那么问题就来了.假设并非所有的rel值都具有"icon".在他们中.您可以通过指定多个选择器(以逗号分隔)来获得更多效果,因此不必使用实验用的不区分大小写标志:

Now, the problem comes if you find more interesting cases that you can't easily write a selector for. Suppose not all of the rel values have "icon" in them. You can get a little more fancy by specifying multiple selectors separated by commas so you don't have to use the experimental case insensitivity flag:

link[rel*=icon][href], link[rel*=ICON][href]

rel中的其他值:

link[rel="shortcut icon"][href], link[rel="apple-touch-icon-precomposed"][href]

尽可能多地排列它们.

但是,您也可以不使用选择器来过滤结果.使用Mojo :: Collection的grep选择所需的节点:

But, you could also filter your results without the selectors. Use Mojo::Collection's grep to pick out the nodes that you want:

my %Interesting = ...;
my $results = $ua->get( shift )
    ->result
    ->dom
    ->find( '...' )
    ->grep( sub { exists $Interesting{ $_->attr('rel') } } )
    ->map( attr => 'href' )
    ->to_array
    ;

我在 Mojo Web客户端中有很多Mojo::DOM的示例,我想我会现在就添加此示例.

I have a lot more examples of Mojo::DOM in Mojo Web Clients, and I think I'll go add this example now.

这篇关于使用Perl在HTML中查找Favicons的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆