如何解析网页 [英] How to Parse a webpage

查看:49
本文介绍了如何解析网页的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从 EnviroCanada 天气页面中提取以下内容.

我正在尝试按照以下方式获得每个小时.

时间 |大腿 |吐|湿度

7:00 |23 |22.9 |30

提取的 HTML 页面:

<td headers="header1" class="text-center vertical-center">7:00 </td><td headers="header2" class="media vertical-center"><span class="pull-left"><img class="media-object" height="35" width="35"src="/weathericons/small/02.png"/></span><div class="visible-xs visible-sm"><br/><br/>

<div class="media-body"><p>部分多云</p>

</td><td headers="header3m" class="metricData text-center vertical-center">23(22.9)</td><td headers="header3i" class="imperialData hidden text-center vertical-center">73(73.2)</td><td headers="header4m" class="metricData text-center vertical-center"><abbr title="West-Northwest">WNW</abbr>8/td<td headers="header4i" class="imperialData hidden text-center vertical-center"><abbr title="West-Northwest">WNW</abbr>5/td<td headers="header6" class="metricData text-center vertical-center">30</td><td headers="header6" class="imperialData hidden text-center vertical-center">87</td><td headers="header7" class="text-center vertical-center">83</td><td headers="header8" class="metricData text-center vertical-center">20</td><td headers="header8" class="imperialData hidden text-center vertical-center">68</td><td headers="header9m" class="metricData text-center vertical-center">100.7</td><td headers="header9i" class="imperialData hidden text-center vertical-center">29.7</td><td headers="header10" class="metricData text-center vertical-center">24</td><td headers="header10" class="imperialData hidden text-center vertical-center">15</td></tr>

到目前为止的代码:

使用严格;使用警告;使用 LWP::Simple;使用 HTML::TokeParser;我的 $url = "http://weather.gc.ca/past_conditions/index_e.html?station=yyz";我的 $page = get($url) ||die "无法加载 URL\n";my $parser = HTML::TokeParser->new(\$page) ||die "解析错误\n";$parser->get_tag("td") foreach();$parser->get_tag("");我的 $time = $parser->get_text();??我的 $thigh = $parser->get_text();???我的 $tlow = $parser->get_text();???我的 $humb = $parser->get_text();

我完全迷失在这里

解决方案

使用 LWP::Simple 获取页面后,您可以根据需要使用它来选择特定工具, 而不是使用通用解析器.

在这种情况下,您手上有一张桌子,我建议您使用 HTML::TableExtract.有了它,您可以通过多种方式干净地检索表格元素,然后处理它们.它可以处理多个表、使用标题、设置解析首选项等等.通常,您甚至不必查看实际的 HTML.该模块是 HTML::Parser 的子类.根据我的经验,这是一个非常好的工具.

<小时>

这里有一些基本代码,用于这个特定的页面和任务.

使用警告;使用严格;使用 LWP::Simple;使用 HTML::TableExtract;我的 $url = "http://weather.gc.ca/past_conditions/index_e.html?station=yyz";我的 $page = get($url) 或死无法加载 $url: $!";我的 $headers = [ '时间', '温度', 'Humidex'];我的 $tec = HTML::TableExtract->new(headers => $headers);$tec->parse($page);我的 $fmt = "%6s | %6s | %6s | %8s\n";printf($fmt, '时间', 'T-high', 'T-low', 'Humidex');我的 ($time, $temp_ $temp_low, $hum);foreach 我的 $rrow ($tec->rows) {# 跳过没有预期数据的行.清理前导/尾随空格.下一个 if $rrow->[0] !~/^\s*\d?\d:\d\d/;我的@row = map { s|^\s*||;s|\s*$||;$_ } @$rrow;# 根据需要处理($time, $hum) = @row[0,2];($temp_ $temp_low) = $row[1] =~/(\d+) .* \( (\d+\.\d+) \)/xs;printf($fmt, $time, $temp_ $temp_low, $hum);}

输出的前几行

<前>时间 |T-high |T-低|Humidex16:00 |29 |29.2 |3715:00 |27 |27.2 |3714:00 |26 |25.6 |33...

评论.

newheaders 属性使它只提取那些标题下的列.循环变量是对具有行元素的数组的引用.元素是单元格中的原始文本.

第一行跳过不符合预期格式的行 –一个可能的数字 \d? 后跟另一个数字,然后是 : 然后是两个数字.这是时间,3:0003:00.

为了清晰起见,arrayref $rrow 被提取到一个数组 @row 中.在特定列中查找的元素 @row[0,2] 会直接使用.$row[1] 中的一个由正则表达式解析,它捕获一个数字 (\d+) 和两个由 分隔的数字.,带有可能的中间文本 (.*).这些捕获由正则表达式返回,并分配给其他两个变量.

请参阅模块的文档,如果需要,请参阅有关参考资料 perlreftut 和正则表达式的教程perlretut.另一个有用的页面是 Data Structures Cookbook perldsc.有关其他介绍,请参阅教程.他们通常有指向更具体文档的链接.

I am attempting to extract the following from the EnviroCanada weather page.

I am trying to get for each hour as per the following.

Time | Thigh | Tlow | Humidity

7:00 | 23 | 22.9 | 30

Extracted HTML Page:

<tr>
         <td headers="header1" class="text-center vertical-center"> 7:00 </td>
        <td headers="header2" class="media vertical-center"><span class="pull-left"><img class="media-object" height="35" width="35" src="/weathericons/small/02.png" /></span><div class="visible-xs visible-sm">
            <br />
            <br />
          </div>
          <div class="media-body">
            <p>Partly Cloudy</p>
          </div>
        </td>
        <td headers="header3m" class=" metricData text-center vertical-center">23
                                            �(22.9)
                                        </td>
        <td headers="header3i" class=" imperialData hidden text-center vertical-center">73
                                            �(73.2)
                                        </td>
        <td headers="header4m" class="metricData text-center vertical-center">
          <abbr title="West-Northwest">WNW</abbr> 8</td>
        <td headers="header4i" class="imperialData hidden text-center vertical-center">
          <abbr title="West-Northwest">WNW</abbr> 5</td>
        <td headers="header6" class="metricData text-center vertical-center">30</td>
        <td headers="header6" class="imperialData hidden text-center vertical-center">87</td>
        <td headers="header7" class="text-center vertical-center">83</td>
        <td headers="header8" class="metricData text-center vertical-center">20</td>
        <td headers="header8" class="imperialData hidden text-center vertical-center">68</td>
        <td headers="header9m" class="metricData text-center vertical-center">100.7</td>
        <td headers="header9i" class="imperialData hidden text-center vertical-center">29.7</td>
        <td headers="header10" class="metricData text-center vertical-center">24</td>
        <td headers="header10" class="imperialData hidden text-center vertical-center">15</td>
      </tr>

Code so far:

use strict;
use warnings;
use LWP::Simple;
use HTML::TokeParser;


 my $url = "http://weather.gc.ca/past_conditions/index_e.html?station=yyz";
 my $page = get($url) ||
die "Could not load URL\n";


 my $parser = HTML::TokeParser->new(\$page) ||
die "Parse error\n";

 $parser->get_tag("td") foreach ();
 $parser->get_tag("");
 my $time = $parser->get_text();

  ??
 my $thigh = $parser->get_text();


 ???
 my $tlow = $parser->get_text();

 ???
 my $humid = $parser->get_text();

I'm Completely lost here

解决方案

Once you fetch the page with LWP::Simple, you can pick a specific tool depending on what needs to be done with it, instead of using a general parser.

In this case you have a table on your hands and I'd recommend HTML::TableExtract. With it you can cleanly retrieve table elements in a number of ways and then process them. It can work with multiple tables, make use of headers, set up parsing preferences, and more. Normally you don't have to even look at the actual HTML. The module is a subclass of HTML::Parser. In my experience it's been a very good tool.


Here is some basic code, for this particular page and task.

use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;

my $url = "http://weather.gc.ca/past_conditions/index_e.html?station=yyz";
my $page = get($url) or die "Can't load $url: $!";

my $headers = [ 'Time', 'Temperature', 'Humidex' ];

my $tec = HTML::TableExtract->new(headers => $headers);
$tec->parse($page);

my $fmt = "%6s | %6s | %6s | %8s\n";    
printf($fmt, 'Time', 'T-high', 'T-low', 'Humidex');    

my ($time, $temp_hi, $temp_low, $hum);

foreach my $rrow ($tec->rows) {
    # Skip rows without expected data. Clean up leading/trailing spaces.
    next if $rrow->[0] !~ /^\s*\d?\d:\d\d/;
    my @row = map { s|^\s*||; s|\s*$||; $_ } @$rrow;
    # Process as needed
    ($time, $hum) = @row[0,2];
    ($temp_hi, $temp_low) = $row[1] =~ /(\d+) .* \( (\d+\.\d+) \)/xs;
    printf($fmt, $time, $temp_hi, $temp_low, $hum);
}

The first few rows of output

  Time | T-high |  T-low |  Humidex
 16:00 |     29 |   29.2 |       37
 15:00 |     27 |   27.2 |       37
 14:00 |     26 |   25.6 |       33
...

Comments.

The headers attribute for new makes it extract columns only under those headings. The loop variable is a reference, to an array with row elements. The elements are raw text in cells.

The first line skips rows that don't have the expected format – a possible digit \d? followed by another digit, then : then two digits. This is for time, 3:00 or 03:00.

The arrayref $rrow is extracted into an array @row for clarity. The sought elements in particular columns, @row[0,2] are used as they come. The one in $row[1] is parsed by a regex, which captures a number (\d+) and then two numbers separated by a ., with possible intervening text (.*). These captures are returned by regex, and assigned to the other two variables.

See the module's documentation and, if needed, tutorials on references perlreftut and on regular expressions perlretut. Another useful page is the Data Structures Cookbook perldsc. For other introductions see Tutorials. They typically have links to more specific docs.

这篇关于如何解析网页的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
其他开发最新文章
热门教程
热门工具
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆