xpath:使用 xpath 从节点中提取数据 [英] xpath: extract data from a node using xpath

查看:49
本文介绍了xpath:使用 xpath 从节点中提取数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我只想提取销售排名(在本例中为 5)

I want to extract only the sales rank (which in this case is 5)

亚马逊畅销书排名: #5书籍(见图书前 100 名)

来自网页:http://www.amazon.com/Mockingjay-Hunger-Games-Book-3/dp/0439023513/ref=tmm_hrd_title_0

到目前为止,我已经解决了这个问题,它选择了Amazon Best Sellers Rank:":

So far I have gotten down to this, which selects "Amazon Best Sellers Rank:":

//li[@id='SalesRank']/b/text()

我正在使用 PHP DOMDocumentDOMXPath.

I am using PHP DOMDocument and DOMXPath.

推荐答案

可以使用纯 XPath:

You can use pure XPath:

substring-before(normalize-space(/html/body//ul/li[@id="SalesRank"]/b[1]/following-sibling::text()[1])," ")

但是,如果您的输入有点混乱,您可能会通过使用 XPath 获取父节点的文本,然后在文本上使用正则表达式来获得您想要的特定内容,从而获得更可靠的结果.

However, if your input is a bit messy you might get more reliable results by using XPath to grab the parent node's text, and then using a regex on the text to get the specific thing you want.

使用 PHP 和 DOMDocumentDOMXPath 演示这两种方法:

Demonstration of both methods using PHP with DOMDocument and DOMXPath:

// Method 1: XPath only
$xp_salesrank = 'substring-before(normalize-space(/html/body//li[@id="SalesRank"]/b[1]/following-sibling::text()[1])," ")';

// Method 2: XPath and Regex
$regex_ranktext = 'string(/html/body//li[@id="SalesRank"])';
$regex_salesrank = '/Best\s+Sellers\s+Rank:\s*(#\d+)\s+/ui';

// Test URLs
$urls = array(
    'http://rads.stackoverflow.com/amzn/click/0439023513',
    'http://www.amazon.com/Mockingjay-Final-Hunger-Games-ebook/dp/B003XF1XOQ/ref=tmm_kin_title_0?ie=UTF8&m=AG56TWVU5XWC2',
);

// Results
$ranks = array();
$ranks_regex = array();

foreach ($urls as $url) {
    $d = new DOMDocument();
    $d->loadHTMLFile($url);
    $xp = new DOMXPath($d);

    // Method 1: use pure xpath
    $ranks[] = $xp->evaluate($xp_salesrank);

    // Method 2: use xpath to get a section of text, then regex for more specific item
    // This method is probably more forgiving of bad HTML.
    $rank_regex = '';
    $ranktext = $xp->evaluate($regex_ranktext);
    if ($ranktext) {
        if (preg_match($regex_salesrank, $ranktext, $matches)) {
            $rank_regex = $matches[1];
        }
    }
    $ranks_regex[] = $rank_regex;

}

assert($ranks===$ranks_regex); // Both methods should be the same.
var_dump($ranks);
var_dump($ranks_regex);

我得到的输出是:

array(2) {
  [0]=>
  string(2) "#4"
  [1]=>
  string(2) "#3"
}
array(2) {
  [0]=>
  string(2) "#4"
  [1]=>
  string(2) "#3"
}

这篇关于xpath:使用 xpath 从节点中提取数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆