如何用Web :: Scraper解析这个HTML? [英] How to Parse this HTML with Web::Scraper?
问题描述
我正在尝试使用 Web :: Scraper 来解析以下HTML:
I am trying to use Web::Scraper to parse the following HTML:
<div>
<p><strong>TITLE1</strong>
<br>
DESCRIPTION1
</p>
<p><strong>TITLE2</strong>
<br>
DESCRIPTION2
</p>
<p><strong>TITLE3</strong>
<br>
DESCRIPTION3
</p>
</div>
into
'test' => [
{
'name' => 'TITLE1',
'desc' => 'DESCRIPTION1 '
},
{
'name' => 'TITLE2',
'desc' => 'DESCRIPTION2 '
},
{
'name' => 'TITLE3',
'desc' => 'DESCRIPTION3 '
}
]
我有以下代码,没有多少运气。处理'p'时的'TEXT'给出了文本和strong之间的内容,例如
I have the following code but I don't have much luck. 'TEXT' when processing 'p' gives both the text and what is between "strong" for example
'test' => [
{
'name' => 'TITLE1',
'desc' => 'TITLE1 DESCRIPTION1 '
}
]
加上它只有第一个项目。
plus its only the first item.
这是我的代码。
use strict;
use Web::Scraper;
use Data::Dumper;
my $html = q[<div>
<p><strong>TITLE1</strong>
<br>
DESCRIPTION1
</p>
<p><strong>TITLE2</strong>
<br>
DESCRIPTION2
</p>
<p><strong>TITLE3</strong>
<br>
DESCRIPTION3
</p>
</div>
];
my $test = scraper {
process 'div', 'test[]' => scraper {
process 'p strong', 'name' => 'TEXT';
process 'p','desc' => 'TEXT';
};
};
my $res = $test->scrape(\$html);
print Dumper($res);
谢谢。
推荐答案
您的代码中有两点需要更改。
There are two points in your code that need changing.
要仅获取描述 -text,请使用xpath 。 // p / text()
将直接在任何 p
下给出文本节点,因此不包括 strong
。
To get only the DESCRIPTION-text, use xpath. //p/text()
will give you the text-nodes directly under any p
, so the ones inside of the strong
are not included.
要使所有块 p
显示在数组中,而不仅仅是第一个指令在 div p
上。这样就可以把 div
里的所有 p
,而不仅仅是一个 div
。
To make all blocks of p
show up in the array, and not only the first one, make the first instruction be on div p
. That way it grabs all p
inside of a div
and not only the one div
.
my $test = scraper {
process 'div p', 'test[]' => scraper {
process 'p strong', 'name' => 'TEXT';
process '//p/text()', 'desc' => ['TEXT', sub { s/^\s+|\s+$//g } ];
};
};
输出(使用 Data :: Printer ):
\ {
test [
[0] {
desc "DESCRIPTION1",
name "TITLE1"
},
[1] {
desc "DESCRIPTION2",
name "TITLE2"
},
[2] {
desc "DESCRIPTION3",
name "TITLE3"
}
]
}
这篇关于如何用Web :: Scraper解析这个HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!