我如何从Perl中的HTML表格中提取数据? [英] How can I extract data from HTML tables in Perl?

查看:388
本文介绍了我如何从Perl中的HTML表格中提取数据?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

可能的重复:

你能否提供一个用你喜欢的解析器解析HTML的例子?

如何从HTML文件中使用Perl提取内容?






我试图在Perl中使用正则表达式来解析具有以下结构的表。第一行如下:

 < tr class =Highlight>< td>播放时间< / a> ;< / TD>< TD>< / TD>< TD>艺术家< / TD>< TD宽度= 1% >< / TD>< TD>名称< / TD>< ; TD>标签和LT; / TD>< / TR> 

在这里,我希望取出时间播放,艺术家,标题和标签,然后将它们打印到输出文件中。



任何帮助都会大大降低!

对不起...我已经尝试了许多正则表达式,例如:

  $ lines =〜/(< td>)/ 

$ lines =〜/< td>(。*)< /

$ lines =〜/>(。*)< /

我的当前程序如下所示:

 #!perl -w 

打开INPUT_FILE,<,FIRST_LINE_OF_OUTPUT.txt或死掉$ !;

打开OUTPUT_FILE,>>,PLAYLIST_TABLE.txt或死亡$ !;

my $ lines = join'',< INPUT_FILE> ;;

打印Hello 2 \\\
;

if($ lines =〜/(\S。* \S)/){
printthis is 1:\\\
;
打印$ 1;
if($ lines =〜/< td>(。*)< /){
printthis is the 2nd 1:\\\
;
打印$ 1;
打印字样:$ 1. \\\
;
$时间= $ 1;
print $ Time;
打印OUTPUT_FILE $ Time;
} else {
print2ND IF FAILED \\\
;
}
} else {
printTHIS FAILED \\\
;
}

close(INPUT_FILE);
close(OUTPUT_FILE);


解决方案

不要使用正则表达式来解析HTML。有很多CPAN模块可以更有效地为您做到这一点。




Possible duplicate:
Can you provide an example of parsing HTML with your favorite parser?
How can I extract content from HTML files using Perl?


I'm trying to use regular expressions in Perl to parse a table with the following structure. The first line is as follows:

<tr class="Highlight"><td>Time Played</a></td><td></td><td>Artist</td><td width="1%"></td><td>Title</td><td>Label</td></tr>

Here I wish to take out "Time Played", "Artist", "Title", and "Label", and print them to an output file.

Any help would be greatly apreciated!

Ok sorry... I've tried many regular expressions such as:

$lines =~ / (<td>) /
       OR
$lines =~ / <td>(.*)< /
       OR
$lines =~ / >(.*)< /

My current program looks like so:

#!perl -w

open INPUT_FILE, "<", "FIRST_LINE_OF_OUTPUT.txt" or die $!;

open OUTPUT_FILE, ">>", "PLAYLIST_TABLE.txt" or die $!;

my $lines = join '', <INPUT_FILE>;

print "Hello 2\n";

if ($lines =~ / (\S.*\S) /) {
print "this is 1: \n";
print $1;
    if ($lines =~ / <td>(.*)< / ) {
    print "this is the 2nd 1: \n";
    print $1;
    print "the word was: $1.\n";
    $Time = $1;
    print $Time;
    print OUTPUT_FILE $Time;
    } else {
    print "2ND IF FAILED\n";
    }
} else { 
print "THIS FAILED\n";
}

close(INPUT_FILE);
close(OUTPUT_FILE);

解决方案

Do NOT use regexps to parse HTML. There are a very large number of CPAN modules which do this for you much more effectively.

这篇关于我如何从Perl中的HTML表格中提取数据?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆