带条件的 Perl XPath 语句 - 这可能吗? [英] Perl XPath statement with a conditional - is that possible?

查看:45
本文介绍了带条件的 Perl XPath 语句 - 这可能吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

此问题已重新表述.我正在使用 CPAN Perl 模块 WWW::Mechanize 导航网站,HTML::TreeBuilder-XPath 捕获内容和 xacobeo 在 HTML/XML 上测试我的 XPath 代码.目标是从基于 PHP 的网站调用这个 Perl 脚本,并将抓取的内容上传到数据库中.因此,如果内容缺失",仍然需要考虑.

This question has been rephrased. I am using CPAN Perl modules WWW::Mechanize to navigate a website, HTML::TreeBuilder-XPath to capture the content and xacobeo to test my XPath code on the HTML/XML. The goal is to call this Perl script from a PHP-based website and upload the scraped contents into a database. Therefore, if content is "missing" it still needs to be accounted for.

下面是一个经过测试的简化示例代码,描述了我的挑战.注意:

Below is a tested, reduced sample code depicting my challenge. Note:

  1. 这个页面是动态填充的,包含针对不同商店输出的各种ITEMS;每个商店将存在不同数量的 Products*.并且这些产品列表下面可能有也可能没有明细表.
  2. 捕获的数据必须在数组中,并且必须维护任何逐项列表(如果存在)与产品列表的关联.
  1. This page is dynamically filled and contains various ITEMS outputted for different stores; a different number of Products* will exist for each store. And those product listings may or may not have an itemized table underneath of it.
  2. The captured data has to be in arrays and the association of any itemized list (if it exists) to the Product listing has to be maintained.

下面,示例 xml 更改每个商店(如上所述),但为简洁起见,我只显示一种类型"的输出.我意识到可以将所有数据捕获到一个数组中,然后使用正则表达式来解密内容,以便将其上传到数据库中.我正在寻求更好的 XPath 知识,以帮助简化这个(和未来的)解决方案.

Below, the example xml changes per store (as described above) but for brevity I only show one "type" of output. I realize that all data can be captured into one array and then regex used to decipher the content for the purpose of uploading it into a database. I am seeking a better knowledge of XPath to help streamline this (and future) solution(s).

<!DOCTYPE XHTML>
<table id="8jd9c_ITEMS">
<tr><th style="color:red">The Products we have in stock!</th></tr>

<tr><td><span id="Product_NUTS">We have nuts!</span></td></tr>
<tr><td>
    <!--Table may or may not exist  -->
           <table>                                  
      <tr><td style="color:blue;text-indent:10px">Almonds</td></tr>
      <tr><td style="color:blue;text-indent:10px">Cashews</td></tr>
      <tr></tr>
    </table>
</td></tr>

<tr><td><span id="Product_VEGGIES">We have veggies!</span></td></tr>
<tr><td>
    <!--Table may or may not exist -->
    <table>
      <tr><td style="color:blue;text-indent:10px">Carrots</td></tr>
      <tr><td style="color:blue;text-indent:10px">Celery</td></tr>
      <tr></tr>
    </table>
</td></tr>

<tr><td><span id="Product_ALCOHOL">We have booze!</span></td></tr>
    <!--In this case, the table does not exist -->
</table>

XPath 语句:

'//table[contains(@id, "ITEMS")]/tr[position() >1]/td/span/text()'

会发现:

We have nuts!
we have veggies!
We have booze!

还有一个 XPath 语句:

And an XPath statement of:

'//table[contains(@id, "ITEMS")]/tr[position() >1]/td/table/tr/td/text()'

会发现:

Almonds
Cashews
Carrots
Celery

两个 XPath 语句可以组合:

The two XPath statements can be combined:

'//table[contains(@id, "ITEMS")]/tr[position() >1]/td/span/text() | //table[contains(@id, "ITEMS")]/tr[position() >1]/table/tr/td/text()'

查找:

We have nuts!
Almonds
Cashews
We have veggies!
Carrots
Celery
We have booze!

同样,可以使用正则表达式(在真实代码中)解密上面的数组,因为它是产品到列表的关联.但是可以使用 XPath 以保持这种关联的方式构建数组吗?

Again, the above array can be deciphered (in the real code) for it's product-to-list association using regex. But can the array be built using XPath in a manner that would keep that association?

例如(伪说,这行不通):

For example (pseudo-speak, this does not work):

'//table[contains(@id, "ITEMS")]/tr[position()>1]/td/span/text() | 
if exists('//table[contains(@id, "ITEMS")]/tr[position() >1]/table)) 
then ("NoTable") else ("TableRef") | 
Save this result into @TableRef ('//table[contains(@id, "ITEMS")]/tr[position() >1]/table/tr/td/text()')'

在 Perl 中无法构建多维数组(传统意义上的),参见 perldoc perlref 但希望类似于上述的解决方案可以创建类似的东西:

It is not possible to build multi-dimensional arrays (in the traditional sense) in Perl, see perldoc perlref But hopefully a solution similar to the above could create something like:

@ITEMS[0] => We have nuts!
@ITEMS[1] => nutsREF     <-- say, the last word of the span value + REF
@ITEMS[2] => We have veggies!
@ITEMS[3] => veggiesREF  <-- say, the last word of the span value + REF
@ITEMS[4] => We have booze!
@ITEMS[5] => NoTable     <-- value accounts for the missing info

@nutsREF[0] => Almonds
@nutsREF[1] => Cashews

@veggiesREF[0] => Carrots
@veggiesREF[1] => Celery 

在实际代码中,产品是已知的,所以 my @veggiesREFmy @nutsREF 可以在 XPath 输出的预期中定义.

In the real code the Products are known, so my @veggiesREF and my @nutsREF can be defined in anticipation of the XPath output.

我意识到 XPath if/else/then 功能在 XPath 2.0 版本中.我在 ubuntu 系统上并在本地工作,但我仍然不清楚我的 apache2 服务器是使用它还是 1.0 版本.我如何检查?

I realize the XPath if/else/then functionality is in the XPath 2.0 version. I am on a ubuntu system and working locally, but I am still not clear on whether my apache2 server is using it or the 1.0 version. How do I check that?

最后,如果您能展示如何从 PHP 表单提交调用 Perl 脚本,以及如何将 Perl 数组传递回调用 PHP 函数,那么这将有助于获得赏金.:)

Finally, if you can show how to call a Perl scrip from a PHP form submit AND how to pass back a Perl array to the calling PHP function then that would go along way to getting the bounty. :)

谢谢!

最终

紧接在此帖子下方的评论针对的是一个过于模糊的初始帖子.随后的重新发布(和赏金)得到了ikegami 的一个非常有创意的使用,它解决了伪问题,但事实证明我很难在我的实际应用程序中掌握和重用 - 这需要在各种 html 页面上多次使用.在我们对话的第 18 条评论中,我终于发现了他对 ($cat) 的含义和用法——他使用的一种未记录的 Perl 语法.对于新读者来说,理解该语法可以理解(并重新格式化)他对问题的智能解决方案.他的帖子当然满足 OP 中寻求的基本要求,但没有使用 HTML::TreeBuilder::XPath 来做到这一点.

Comments immediately below this post were directed at an initial post that was too vague. The subsequent re-post (and bounty) was responded to by ikegami with a very creative use which solved the pseudo problem, but was proving difficult for me to grasp and reuse in my real application - which entails multiple uses on various html pages. In about the 18th comment in our dialog I finally discovered his meaning and use of ($cat) - an undocumented Perl syntax that he used. For new readers, understanding that syntax makes it possible to understand (and reformat) his intelligent solution to the problem. His post certainly meets the basic requirements sought in the OP but does not use HTML::TreeBuilder::XPath to do it.

jpalecek 使用 HTML::TreeBuilder::XPath,但不会将捕获的数据放入数组以传回 PHP 函数并上传到数据库.

jpalecek uses the HTML::TreeBuilder::XPath but does not place the captured data into arrays for passing back to a PHP function and uploading into a database.

我从两位响应者那里学到了东西,希望这篇文章能帮助其他像我这样的 Perl 新手.任何最终贡献将不胜感激.

I have learned from both responders and hope this post helps others who are new to Perl, like myself. Any final contributions would be greatly appreciated.

推荐答案

如果让我猜的话,您的问题是:我如何从提供的输入中获得以下内容?"

If I were to guess, your question is: "How do I get the following from the provided input?"

my $categorized_items = {
   'We have nuts!'    => [ 'Almonds', 'Cashwes' ],
   'We have veggies!' => [ 'Carrots', 'Celery' ],
   'We have booze!'   => [ ],
};

如果是这样,我会这样做:

If so, here's how I'd do it:

use Data::Dumper qw( Dumper );
use XML::LibXML  qw( );

my $root = XML::LibXML->load_xml(IO=>\*DATA)->documentElement;

my %cat_items;
for my $cat_tr ($root->findnodes('//table[contains(@id, "ITEMS")]/tr[td/span]')) {
   my ($cat) = map $_->textContent(),
      $cat_tr->findnodes('td/span');

   my @items = map $_->textContent(),
      $cat_tr->findnodes('following-sibling::tr[position()=1]/td/table/tr/td');

   $cat_items{$cat} = \@items;
}

print(Dumper(\%cat_items));

__DATA__
...xml...

PS - 您拥有的不是有效的 HTML.

PS - What you have there isn't valid HTML.

  1. TABLE 元素不能直接放置在 TR 元素内.缺少 TD 元素.
  2. TR 元素不能为空.它必须至少有一个 TH 或 TD 元素.

这篇关于带条件的 Perl XPath 语句 - 这可能吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆