在Perl中使用Grep和Extract数据 [英] Grep and Extract Data in Perl

查看:128
本文介绍了在Perl中使用Grep和Extract数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我将HTML内容存储在一个变量中。如何提取页面中一组通用标签之间的数据?例如,我对数据(由 表示的数据感兴趣)保存在一行标记之间:

  ... 
< td class =jumlah> * DATA_1 *< / td>
< td class =ud>< a href => * DATA_2 *< / a>< / td>
...

然后我想将一个映射DATA_2 => DATA_1存储在一个散列表中

可能需要为使用HTML而创建的XPath模块, HTML :: TreeBuilder :: XPath

首先,您需要使用HTML :: TreeBuilder方法解析字符串。网页的内容位于名为 $ content 的变量中,如下所示:

  my $ tree = HTML :: TreeBuilder-> new; 
$ tree-> parse_file($ file_name);

现在您可以使用 XPath 表达式获取迭代器在你关心的节点上。第一个表达式获取表中的所有 td 节点 tr html 元素中的 body :

  my $ tdNodes = $ tree-> findnodes('/ html / body / table / tr / td'); 

最后,您可以迭代循环中的所有节点以找到您想要的内容:

  foreach my $ node($ tdNodes-> get_nodelist){
my $ data = $ node-> findvalue(' 。'); //节点的内容
打印$ data \\\
;
}

请参阅 HTML :: TreeBuilder 文档,了解更多关于它的方法和 NodeSet 文档,了解如何使用NodeSet结果对象。 w3schools有一个可通过的XPath教程此处

<所有这一切,你应该能够做出非常健壮的HTML解析来获取你想要的任何元素。你甚至可以在你的XPath查询中指定类,id和更多关于你想要的节点的具体细节。在我看来,使用这个修改后的XPath库解析HTML比处理一堆一次性正则表达式要快很多并且更易于维护。


I have HTML content stored in a variable. How do I extract data that is found between a set of common tags in the page? For example, I am interested in the data (represented by DATA kept between a set of tags which one line after the other:

...
<td class="jumlah">*DATA_1*</td>
<td class="ud"><a href="">*DATA_2*</a></td>
...

And then I would like to store a mapping DATA_2 => DATA_1 in a hash

解决方案

Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath.

First you'll need to parse your string using the HTML::TreeBuilder methods. Assuming your webpage's content is in a variable named $content, do it like this:

my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);

Now you can use XPath expressions to get iterators over the nodes you care about. This first expression gets all td nodes that are in a tr in a table in the body in the html element:

my $tdNodes = $tree->findnodes('/html/body/table/tr/td');

Finally you can just iterate over all the nodes in a loop to find what you want:

foreach my $node ($tdNodes->get_nodelist) {
  my $data = $node->findvalue('.'); // the content of the node
  print "$data\n";
}

See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. w3schools has a passable XPath tutorial here.

With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.

这篇关于在Perl中使用Grep和Extract数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆