用Perl正则表达式进行HTML排序 [英] HTML sorting with Perl regex

查看:122
本文介绍了用Perl正则表达式进行HTML排序的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个HTML文件,其中包含一个HTML表格,其中包含指向科技论文和作者的链接以及出版年份。 html从最旧到最新排序。我需要通过解析文件并获取排序后的源代码从最新到最旧的新文件。



这是一个小型的perl脚本,它应该可以完成这项工作,但它会产生半排序结果

  local $ / = undef; 
打开(FILE,pubTable.html)或死无法打开文件:$!;
binmode FILE;
my $ html =< FILE>;
open(OUTFILE,>> sorted.html)||死不能输出文件。。
map {print OUTFILE< tr> $ _-> [0]< / tr>}
sort {$ b-> [1]< => $ a-> [1]}
map {[$ _,m |,+(\d {4})。*< / a> |]}
$ html =〜m |< TR>< / TR>(*?)| GS;
关闭(FILE);
close(OUTFILE);

以下是我的输入文件:
link

以及我得到的输出:
< a href =https://docs.google.com/open?id=0BxHnPTcuqJVmbWZzNi0tMlVZZkk =nofollow>链接



从输出中可以看到订单进展顺利,但是在1992年之后我得到了1993年,而不是在列表的开始。

解决方案

map 中存在一个正则表达式问题,因为html中有以下几行。

 < a href =http://www.icp.uni-stuttgart.de/~hilfer/publikationen/pdfo/>,{UCLA} --Report 982051,Los Angeles, 1989,; / A>< / TD> < / TR> 

 < a href =http://www.icp.uni-stuttgart.de/~hilfer/publikationen/pdfo/> Phys.Rev.Lett。,< b> 60< / b>,1514,1988< / a>< / td> < / TR> 
< a href =http://www.icp.uni-stuttgart.de/~hilfer/publikationen/pdfo/> Phys。版本B,< b> 45< / b>,7115,1992< / a>< / td> < / TR>
< a href =http://www.icp.uni-stuttgart.de/~hilfer/publikationen/pdfo/> J.Chem.Phys。,< b> 96< / b>,2269,1992< / a>< / td> < / TR>

在1989年的版本中,年底包含一个逗号,前面没有空格。正因为如此,脚本会发出很多警告,并始终将该行放在底部。



其他三行有四位数字(\d {4})后面的内容。* (年份)。因此,排序使用了其他数字(7115,2269,1514)进行排序,这些数字与年份混在一起。

您需要相应地调整正则表达式来修复这些问题。



之前:

  map {[$ _,m |,+(\ d {4})。*< / a> |]} 

After:

  map {[$ _,m |,*(\d {4}),?< / a> ; |]} 


I have an HTML file consisting of an HTML table with links to Scientific Papers and Authors and with their year of publishing. The html is sorted from oldest to newest. I need to resort the table by parsing the file and getting a new file with the sorted source code from newest to oldest.

Here is a small perl script that should be doing the job but it produces semi-sorted results

local $/=undef;
open(FILE, "pubTable.html")  or die "Couldn't open file: $!";
binmode FILE;
my $html = <FILE>; 
open (OUTFILE, ">>sorted.html") || die "Can't oupen output file.\n";
map{print OUTFILE "<tr>$_->[0]</tr>"} 
sort{$b->[1] <=> $a->[1]} 
map{[$_, m|, +(\d{4}).*</a>|]}
$html =~ m|<tr>(.*?)</tr>|gs;
close (FILE);  
close (OUTFILE);

And here is my input file: link

and what I get as an output: link

From the output you can see the order is going well but then I get the year 1993 after the year 1992 and not in the beginning of the list.

解决方案

There was a problem with the regex in the map because of the following lines in the html.

<a href="http://www.icp.uni-stuttgart.de/~hilfer/publikationen/pdfo/">,{UCLA}-Report 982051,Los Angeles,,1989,</a></td>   </tr>

and

<a href="http://www.icp.uni-stuttgart.de/~hilfer/publikationen/pdfo/">Phys.Rev.Lett., <b> 60</b>, 1514, 1988</a></td>   </tr>
<a href="http://www.icp.uni-stuttgart.de/~hilfer/publikationen/pdfo/">Phys. Rev. B, <b> 45</b>, 7115, 1992</a></td>   </tr>
<a href="http://www.icp.uni-stuttgart.de/~hilfer/publikationen/pdfo/">J.Chem.Phys., <b> 96</b>, 2269, 1992</a></td>   </tr>

In the 1989 line the year includes a comma at the end and there's no whitespace in front. Because of that, the script threw a lot of warnings and always put that line in the bottom.

The other three lines have a four-digit number (\d{4}) with something behind it .* (the year). So the sorting used the other numbers (7115, 2269, 1514) to sort and those were mixed up with the years.

You need to adjust the regex accordingly to fix those issues.

Before:

map{[$_, m|, +(\d{4}).*</a>|]}

After:

map{[$_, m|, *(\d{4}),?</a>|]}

这篇关于用Perl正则表达式进行HTML排序的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆