如何使用Perl从HTML源中提取特定表的内容? [英] How can I extract the contents of a specific table from HTML source using Perl?
问题描述
我必须解析5000个文件 - 看起来非常相同。
I have to parse 5000 files - which look pretty identical.
我喜欢使用 HTML :: TokeParser :: Simple 和 DBI ,以便进行解析工作并存储结果。
I like using HTML::TokeParser::Simple and DBI in order to do the parsing job and store the results.
我对 HTML :: TokeParser的经验不足::简单
但这个任务超过
我的脑袋。注意:我也看过这些想法 - 这似乎也是一种合适的方式。但目前我有问题需要获取相应的xpath-expressions:我试图确定需要在Perl程序中填充的相应xpath表达式。
I have little experience with HTML::TokeParser::Simple
but this task goes over
my head. Note: i also have had a look at the ideas - that seems to be also an appropiate way. But at the moment i have issues to get the correspodending xpath-expressions: I tried to determine the corresponding xpath-expressions that needs to be filled in the Perl-programme.
这就是我现在所拥有的:
This is what I have right now:
use strict;
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
#use real file name here
open(my $fh, "<", "file.html") or die $!;
$tree->parse_file($fh);
my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($type) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress_two) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($telephone) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($fax) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($internet) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($officer) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($employees) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($offices) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($worker) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($country) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($the_council)= $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
print $name->as_text;
print $type->as_text;
print $adress->as_text;
print $adress_two->as_text;
print $telephone->as_text;
print $fax->as_text;
print $internet->as_text;
print $officer->as_text;
print $employees->as_text;
print $offices->as_text;
print $worker->as_text;
print $country->as_text;
print $the_council->as_text;
这样可以吗?注意 - 我想将它存储在数据库中。
is this all right ? Note - i w ant to store this in a database.
BTW:查看其中一个示例网站:
BTW: See one of the example sites:
在灰色阴影区块中,您可以看到所需信息:需要17条线。注意 - 我有5000个不同的HTML文件 - 所有这些都是以相同的方式构建的!
in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!
这意味着我很乐意拥有一个可以运行的模板HTML :: TokeParser :: Simple和DBI。
That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.
我可以使用上面提到的代码......或者我是否需要更改它。
Can i make use of the above mentioned code... or do i have to change it.
很高兴收到你的来信!那太棒了!
Love to hear from you! That would be great!!
推荐答案
使用一些 HTML :: TableExtract magic:
Use some HTML::TableExtract magic:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TableExtract;
use YAML;
my $te = HTML::TableExtract->new( attribs => {
border => 0,
bgcolor => '#EFEFEF',
leftmargin => 15,
topmargin => 5,
});
$te->parse_file('kultus-bw.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(@$row);
print "@$row\n";
}
sub cleanup {
for ( @_ ) {
s/\s+//;
s/[\xa0 ]+\z//;
s/\s+/ /g;
}
}
输出:
Schul-/Behördenname: Abendgymnasium Ostwürttemberg
Schulart: Privatschule (04313488)
Hausadressse: Friedrichstr.70, 73430 Aalen
Postfachadresse: Keine Angabe
Telefon: 07361/680040
Fax: 07361/680040
E-Mail: Keine Angabe
Internet: www.abendgymnasium-ostwuerttemberg.de
ÜbergeordneteDienststelle: Regierungspräsidium Stuttgart Abteilung 7 Schule und Bildung
Schulleitung: Keine Angabe
Stellv.Schulleitung: Keine Angabe
AnzahlSchüler: 259
AnzahlKlassen: 8
AnzahlLehrer: Keine Angabe
Kreis: Ostalbkreis
Schulträger: <Verband/Verein> (Verband/Verein)
当然,我在运行之前保存了本地页面的副本脚本。
Of course, I saved a local copy of the page before running the script.
这篇关于如何使用Perl从HTML源中提取特定表的内容?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!