使用WWW :: Mechanize查找包含粗体文本的链接 [英] find links containing bold text using WWW::Mechanize

查看:91
本文介绍了使用WWW :: Mechanize查找包含粗体文本的链接的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设HTML页面的内容是

<a href="abc.com"><b>ABC</b>industry</a>
<a href="google.com">ABC Search</a>
<a href="abc.com">Movies with<b>ABC</b></a>

我只想提取包含粗体文本的链接.我该如何使用WWW :: Mechanize?

输出

ABC industry
Movies with ABC

我用过

@arr=$m->links();
foreach(@arr){print $_->text;}

但是这会找到页面中的所有URL.

解决方案

不使用额外的模块来解析页面内容,使用WWW::Mechanize很难达到目标.但是,还有其他模块可以使您轻松实现这一目标.

以下是使用 Mojo::DOM 的示例,该示例可让您根据需要选择元素将在CSS中完成. Mojolicious发行版还包含 Mojo::UserAgent ,因此您可以轻松地将代码迁移到Mojo如果您不太喜欢WWW::Mechanize.

# $html is the content of the page
my $dom = Mojo::DOM->new($html);

# extract all <b> elements that are under <a> elements (at any depth beneath the <a>)
# and get the <a> ancestors of those elements
# creates a Mojo::Collection object
my $collection = $dom->find('a b')->map(sub{ return $_->ancestors('a') } )->flatten;

$collection->each( sub {
    say "LINK: " . $_->all_text;
} );

# Use a sub to perform an action on each of the retrieved <a> elements:
$dom->find('a b')->each( sub {
    $_->ancestors('a')->each( sub {
        say "All in one: " . $_->all_text
    } )
} );

这是一个带有链接示例列表的演示:

<html>
<ul><li><a href="abc.com"><b>ABC</b> industry</a></li>
<li><a href="google.com">ABC Search</a></li>
<li>Here is <a href="#">a link 
    <span>with a span 
        <b>and a "b" tag</b> 
          even though
    </span> "b" tags are deprecated.</a> Yay!</li>
<li><a href="abc.com">Movies with <b>ABC</b></a></li></ul></html>

输出:

LINK: ABC industry
LINK: a link with a span and a "b" tag even though "b" tags are deprecated.
LINK: Movies with ABC
All in one: ABC industry
All in one: a link with a span and a "b" tag even though "b" tags are deprecated.
All in one: Movies with ABC

如果使用Mojo::UserAgent而不是WWW::Mechanize,则搜索会更加容易. Mojo::UserAgent可以get一个页面(就像WWW::Mechanize一样),并且可以使用$ua->get($url)->res->dom访问返回页面的DOM.然后,您可以在上面链接您的查询,以给出以下内容:

my $ua = Mojo::UserAgent->new();
# get the page and find the links with a <b> element in them:
$ua->get('http://my-url-here.com')
   ->res->dom('a b')->each( sub { $_->ancestors('a')->each( sub { say $_->all_text } ) } );

# example using this page:
# print the contents of divs with class 'spacer' that contain a link with a div in it:
$ua->get('http://stackoverflow.com/questions/26353298/find-links-containing-bold-text-using-wwwmechanize')
->res->dom('a div')->each( sub { 
    $_->ancestors('div.spacer')->each( sub {
        say $_->all_text
    } )
} );

输出:

1 How to use WWW::Mechanize to submit a form which isn't there in HTML?
0 How to process a simple loop in Perl's WWW::Mechanize?
0 Perl WWW::Mechanize cookie problem
1 Getting error in accessing a link using WWW::Mechanize
0 How to use output from WWW::Mechanize?
-2 Use WWW::Mechanize to login in webpage without form login but javascript using perl
3 Perl WWW::Mechanize Web Spider. How to find all links
0 Howto use WWW::Mechanize to access pages split by drop-down list
0 What is the best way to extract unique URLs and related link text via perl mechanize?
0 Perl WWW::Mechanize doesn't print results when reading input data from a data file

如果无法立即理解,Mojolicious文档中有很多示例!

有关Mojo::DOMMojo::UserAgent的有用的8分钟入门视频,请查看 Mojocast第5集. /p>

Suppose content of HTML pages is

<a href="abc.com"><b>ABC</b>industry</a>
<a href="google.com">ABC Search</a>
<a href="abc.com">Movies with<b>ABC</b></a>

I want to extract only links that contain bold text. How can i do it using WWW::Mechanize?

Output

ABC industry
Movies with ABC

I used

@arr=$m->links();
foreach(@arr){print $_->text;}

but this finds all URLs in the page.

解决方案

Without using extra modules that can parse the contents of the page, it's going to be difficult to achieve your goal with WWW::Mechanize. However, there are other modules that will allow you to achieve this very easily.

Here is an example using Mojo::DOM, which uses lets you select elements as you would do in CSS. The Mojolicious distribution also contains Mojo::UserAgent, so you could migrate your code over to Mojo fairly easily if you are not too tied to WWW::Mechanize.

# $html is the content of the page
my $dom = Mojo::DOM->new($html);

# extract all <b> elements that are under <a> elements (at any depth beneath the <a>)
# and get the <a> ancestors of those elements
# creates a Mojo::Collection object
my $collection = $dom->find('a b')->map(sub{ return $_->ancestors('a') } )->flatten;

$collection->each( sub {
    say "LINK: " . $_->all_text;
} );

# Use a sub to perform an action on each of the retrieved <a> elements:
$dom->find('a b')->each( sub {
    $_->ancestors('a')->each( sub {
        say "All in one: " . $_->all_text
    } )
} );

Here's a demonstration with a sample list of links:

<html>
<ul><li><a href="abc.com"><b>ABC</b> industry</a></li>
<li><a href="google.com">ABC Search</a></li>
<li>Here is <a href="#">a link 
    <span>with a span 
        <b>and a "b" tag</b> 
          even though
    </span> "b" tags are deprecated.</a> Yay!</li>
<li><a href="abc.com">Movies with <b>ABC</b></a></li></ul></html>

Output:

LINK: ABC industry
LINK: a link with a span and a "b" tag even though "b" tags are deprecated.
LINK: Movies with ABC
All in one: ABC industry
All in one: a link with a span and a "b" tag even though "b" tags are deprecated.
All in one: Movies with ABC

If you use Mojo::UserAgent instead of WWW::Mechanize your search can be even easier. Mojo::UserAgent can get a page (just like WWW::Mechanize), and the DOM of the returned page can be accessed using $ua->get($url)->res->dom. You can then chain your query above on this, to give the following:

my $ua = Mojo::UserAgent->new();
# get the page and find the links with a <b> element in them:
$ua->get('http://my-url-here.com')
   ->res->dom('a b')->each( sub { $_->ancestors('a')->each( sub { say $_->all_text } ) } );

# example using this page:
# print the contents of divs with class 'spacer' that contain a link with a div in it:
$ua->get('http://stackoverflow.com/questions/26353298/find-links-containing-bold-text-using-wwwmechanize')
->res->dom('a div')->each( sub { 
    $_->ancestors('div.spacer')->each( sub {
        say $_->all_text
    } )
} );

Output:

1 How to use WWW::Mechanize to submit a form which isn't there in HTML?
0 How to process a simple loop in Perl's WWW::Mechanize?
0 Perl WWW::Mechanize cookie problem
1 Getting error in accessing a link using WWW::Mechanize
0 How to use output from WWW::Mechanize?
-2 Use WWW::Mechanize to login in webpage without form login but javascript using perl
3 Perl WWW::Mechanize Web Spider. How to find all links
0 Howto use WWW::Mechanize to access pages split by drop-down list
0 What is the best way to extract unique URLs and related link text via perl mechanize?
0 Perl WWW::Mechanize doesn't print results when reading input data from a data file

There are lots of examples in the Mojolicious documentation in case this isn't immediately comprehensible!

For a helpful 8 minute introductory video to Mojo::DOM and Mojo::UserAgent check out Mojocast Episode 5.

这篇关于使用WWW :: Mechanize查找包含粗体文本的链接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆