使用XPath提取节点值 [英] Extracting node values using XPath

查看:88
本文介绍了使用XPath提取节点值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

amazon.com的一部分中,我要从中提取每个项目的数据(仅节点值,而不是链接).

我要查找的值在内部,并且<span class="narrowValue">

<ul data-typeid="n" id="ref_1000">
    <li style="margin-left: -18px">
        <a href="/s/ref=sr_ex_n_0?rh=i%3Aaps%2Ck%3Ahow+to+grow+tomatoes&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358">
            <span class="expand">Any Department</span>
        </a>
    </li>
    <li style="margin-left: 8px">
        <strong>Books</strong>
    </li>
    <li style="margin-left: 6px">
        <a href="/s/ref=sr_nr_n_0?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A48&amp;bbn=1000&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358&amp;rnid=1000">
            <span class="refinementLink">Crafts, Hobbies & Home</span><span class="narrowValue">(19)</span>
        </a>
    </li>
    <li style="margin-left: 6px">
       <a href="/s/ref=sr_nr_n_1?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A10&amp;bbn=1000&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358&amp;rnid=1000">
            <span class="refinementLink">Health, Fitness & Dieting</span><span class="narrowValue">(3)</span>
        </a>
    </li>
    <li style="margin-left: 6px">
        <a href="/s/ref=sr_nr_n_2?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A6&amp;bbn=1000&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358&amp;rnid=1000">
            <span class="refinementLink">Cookbooks, Food & Wine</span><span class="narrowValue">(2)</span>
        </a>
    </li>
</ul>

我该如何使用XPath?

该代码来自链接

以下表达式应该起作用:

//*[@id='ref_1000']/li/a/span[@class='narrowValue']

为了获得更好的性能,您可以提供一条直接指向该表达式开头的路径,但是提供的表达式更灵活(假设您可能需要跨多个页面使用它).

还请记住,您的HTML解析器可能会生成与Firebug(我在此进行测试)生成的结果树不同的结果树.这是一个更加灵活的解决方案:

//*[@id='ref_1000']//span[@class='narrowValue']

灵活性会带来潜在的性能(和准确性)成本,但通常是处理标签汤的唯一选择.

There is a section of amazon.com from which I want to extract the data (node value only, not the link) for each item.

The value I'm looking for is inside and <span class="narrowValue">

<ul data-typeid="n" id="ref_1000">
    <li style="margin-left: -18px">
        <a href="/s/ref=sr_ex_n_0?rh=i%3Aaps%2Ck%3Ahow+to+grow+tomatoes&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358">
            <span class="expand">Any Department</span>
        </a>
    </li>
    <li style="margin-left: 8px">
        <strong>Books</strong>
    </li>
    <li style="margin-left: 6px">
        <a href="/s/ref=sr_nr_n_0?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A48&amp;bbn=1000&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358&amp;rnid=1000">
            <span class="refinementLink">Crafts, Hobbies & Home</span><span class="narrowValue">(19)</span>
        </a>
    </li>
    <li style="margin-left: 6px">
       <a href="/s/ref=sr_nr_n_1?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A10&amp;bbn=1000&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358&amp;rnid=1000">
            <span class="refinementLink">Health, Fitness & Dieting</span><span class="narrowValue">(3)</span>
        </a>
    </li>
    <li style="margin-left: 6px">
        <a href="/s/ref=sr_nr_n_2?rh=k%3Ahow+to+grow+tomatoes%2Cn%3A283155%2Cp_n_feature_browse-bin%3A618073011%2Cn%3A%211000%2Cn%3A6&amp;bbn=1000&amp;sort=salesrank&amp;keywords=how+to+grow+tomatoes&amp;ie=UTF8&amp;qid=1327603358&amp;rnid=1000">
            <span class="refinementLink">Cookbooks, Food & Wine</span><span class="narrowValue">(2)</span>
        </a>
    </li>
</ul>

How could I do this with XPath?

the code is from the link amazon kindle search

currently i am trying

$rank=array();

$words = $xpath->query('//ul[@id="ref_1000"]/li/a/span[@class="refinementLink"]');
foreach ($words as $word) {

        $rank[]=(trim($word->nodeValue));


 }
 var_dump($rank);

解决方案

The following expression should work:

//*[@id='ref_1000']/li/a/span[@class='narrowValue']

For better performance you could provide a direct path to the start of this expression, but the one provided is more flexible (given that you probably need this to work across multiple pages).

Keep in mind, also, that your HTML parser might generate a different result tree than the one produced by Firebug (where I tested). Here's an even more flexible solution:

//*[@id='ref_1000']//span[@class='narrowValue']

Flexibility comes with potential performance (and accuracy) costs, but it's often the only choice when dealing with tag soup.

这篇关于使用XPath提取节点值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆