在Perl中使用Grep和Extract数据 [英] Grep and Extract Data in Perl
问题描述
我将HTML内容存储在一个变量中。如何提取页面中一组通用标签之间的数据?例如,我对数据(由 表示的数据感兴趣)保存在一行标记之间:
...
< td class =jumlah> * DATA_1 *< / td>
< td class =ud>< a href => * DATA_2 *< / a>< / td>
...
然后我想将一个映射DATA_2 => DATA_1存储在一个散列表中
可能需要为使用HTML而创建的XPath模块, HTML :: TreeBuilder :: XPath 。首先,您需要使用HTML :: TreeBuilder方法解析字符串。网页的内容位于名为 $ content
的变量中,如下所示:
my $ tree = HTML :: TreeBuilder-> new;
$ tree-> parse_file($ file_name);
现在您可以使用 XPath 表达式获取迭代器在你关心的节点上。第一个表达式获取表中的所有
: td
节点 tr
$ html
元素中的
my $ tdNodes = $ tree-> findnodes('/ html / body / table / tr / td');
最后,您可以迭代循环中的所有节点以找到您想要的内容:
foreach my $ node($ tdNodes-> get_nodelist){
my $ data = $ node-> findvalue(' 。'); //节点的内容
打印$ data \\\
;
}
请参阅 HTML :: TreeBuilder 文档,了解更多关于它的方法和 NodeSet 文档,了解如何使用NodeSet结果对象。 w3schools有一个可通过的XPath教程此处。
<所有这一切,你应该能够做出非常健壮的HTML解析来获取你想要的任何元素。你甚至可以在你的XPath查询中指定类,id和更多关于你想要的节点的具体细节。在我看来,使用这个修改后的XPath库解析HTML比处理一堆一次性正则表达式要快很多并且更易于维护。
I have HTML content stored in a variable. How do I extract data that is found between a set of common tags in the page? For example, I am interested in the data (represented by DATA kept between a set of tags which one line after the other:
...
<td class="jumlah">*DATA_1*</td>
<td class="ud"><a href="">*DATA_2*</a></td>
...
And then I would like to store a mapping DATA_2 => DATA_1 in a hash
Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath.
First you'll need to parse your string using the HTML::TreeBuilder methods. Assuming your webpage's content is in a variable named $content
, do it like this:
my $tree = HTML::TreeBuilder->new;
$tree->parse_file($file_name);
Now you can use XPath expressions to get iterators over the nodes you care about. This first expression gets all td
nodes that are in a tr
in a table
in the body
in the html
element:
my $tdNodes = $tree->findnodes('/html/body/table/tr/td');
Finally you can just iterate over all the nodes in a loop to find what you want:
foreach my $node ($tdNodes->get_nodelist) {
my $data = $node->findvalue('.'); // the content of the node
print "$data\n";
}
See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. w3schools has a passable XPath tutorial here.
With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.
这篇关于在Perl中使用Grep和Extract数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!