带有 DOMXpath 查询/评估的太长 xpath 不返回任何内容 [英] Too long xpath with DOMXpath query/evaluate return nothing

查看:22
本文介绍了带有 DOMXpath 查询/评估的太长 xpath 不返回任何内容的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 PHP 来检索给定 URL 和 XPATH 的内容.我使用 DOMDocument/DOMXPath(带有查询或评估).

对于小 xpath,我得到了正确的结果,但对于较长的 xpath,它不起作用.(这个 xpath 看起来不错(我用 Xpather(firefox 插件)获得它们并用 YQL 重新测试它们).

你对这个奇怪的问题有什么建议吗?

代码示例:

$doc = new DOMDocument();$myXMLString = file_get_contents('http://stackoverflow.com/questions/4097230/too-long-xpath-with-domxpath-query-evaluate-return-nothing');@$doc->loadHTML($myXMLString);//@ 抑制警告//(适合不结束标记)$xpath = new DOMXPath($doc);$fullPath ="/html/body/small/path";//有用//$fullPath = "/html/body/full/path/with/lot/of/markup";//不起作用$entries = $xpath->query($fullPath);//或 ->evalutate($fullPath)(相同的行为)//$entries 返回 DOMNodeList(长路径查询时为空,//修正小路径查询)

我使用属性限制进行测试,但似乎没有改变(使用小 xpath 可以工作,时间越长它就不再工作)

示例:对于当前页面:

$fullPath = "/html/身体/div[4]/div[@id='内容']/div[@id='问题标题']/h1/a";//有效(检索问题标题)$fullPath = "/html/身体/div[4]/div[@id='内容']/div[@id='mainbar']/div[@id='问题']/桌子/tbody/tr[2]/td[2]/div[@id='comments-4097230']/桌子/tbody/tr[@id='comment-4408626']/td[2]/div/a";//不起作用//(应该从评论中检索'gaby')

<小时>

我使用 SimpleXML lib 进行测试,我的行为完全相同(小查询的结果很好,长查询没有结果).

<小时>

编辑 2:

我还通过删除一些第一个元素来剪切最长的 xpath,并且它可以工作.顺便说一句,我真的不明白为什么完全正确的 xpath 不起作用.

解决方案

让我们一步一步来:

第 1 步:复制错误.

在验证XPath确实不会返回结果后,我写了一个小脚本,看看XPath在它崩溃之前会走多远

foreach (explode('/', $fullPath) as $segment) {$xpath .= trim($segment);echo '-------------------------------------------', PHP_EOL,'正在尝试:', $xpath, PHP_EOL,'-------------------------------------------', PHP_EOL;echo $xp->evaluate("string($xpath)"), PHP_EOL;$xpath .= '/';}

返回结果的最后一件事是

/html/body/div[4]/div[@id='content']/div[@id='mainbar']/div[@id='question']/table

<小时>

第 2 步:检查标记

所以我检查了 DOMDocument::saveHTML() 返回的标记,看看它是什么样子,没有 (重新格式化为可读性):

<div class="everyonelovesstackoverflow" id="adzerk1"></div><表格><tr><td class="votecell">

然后我检查了这个页面,看看它是否被 DOM 丢弃了,或者它是否真的不存在.它不在那里.显然,Firebug 插入了它,这可以解释为什么你用 XPather 得到结果(但不是为什么你用 YQL 得到它):

第 3 步:证明检查和结论

我从 XPath 中删除了 并重新运行脚本.没问题.返回盖比".

虽然我首先怀疑 Firebug 中存在错误,但 Alejandro 评论说这也会发生在 IE 的 DeveloperTools 中.然后我怀疑这是由 JavaScript 添加的,但无法验证.经过更多研究,亚历杭德罗向我指出 为什么 firebug 添加 <tbody>

? - 实际上它既不是 Firebug 也不是 JavaScript,而是浏览器本身.

所以修改我的结论:

不要相信您在浏览器中看到的标记,因为它可能会被浏览器或其他技术修改.DOM 只会下载直接提供的内容.如果您再次遇到类似问题,您现在知道如何处理了.

<小时>

一些额外的旁注

除非您需要在将标记提供给 DOM 之前对其进行修改,否则您不必使用 file_get_contents() 来加载内容.你可以使用 DOM 的 loadHTMLFile():

$dom->loadHTMLFile('http://www.example.com/foo.htm');

另外,抑制错误的正确方法是告诉 libxml 使用它的内部错误处理程序.但您无需处理错误,只需清除它们即可.这只会影响与 libxml 相关的错误,例如解析错误(与所有 PHP 错误相反):

libxml_use_internal_errors(TRUE);libxml_clear_errors();

最后,xPath 查询可以与上下文节点相关联.因此,虽然长 XPath 在查找时间方面是有效的,但您可以简单地使用 getElementById() 来获取最深的已知节点,然后对其使用 XPath.

换句话说:

libxml_use_internal_errors(TRUE);$dom = 新的 DOMDocument;$dom->loadHTMLFile('http://www.example.com/foo.htm');libxml_clear_errors();echo $xp->evaluate('字符串(td[2]/div/a)',$dom->getElementById('comment-4408626'));

也会返回Gaby".

I am using PHP to retrieve content for a given URL and XPATH. I use DOMDocument / DOMXPath (with query or evaluate).

For small xpath, I obtain correct result, but for longer xpath, it does not work. (And this xpath seems to be good (I obtained them with Xpather (firefox plugin) and re-test them with YQL).

Do you have any advice on this curious trouble ?

Example of code:

$doc = new DOMDocument();
$myXMLString = file_get_contents('http://stackoverflow.com/questions/4097230/too-long-xpath-with-domxpath-query-evaluate-return-nothing');
@$doc->loadHTML($myXMLString); //@ to suppress warnings 
                               //(good for not ending markup)
$xpath = new DOMXPath($doc);

$fullPath ="/html/body/small/path"; //it works
//$fullPath = "/html/body/full/path/with/lot/of/markup";//does not works
$entries = $xpath->query($fullPath);
//or ->evalutate($fullPath) (same behaviour)
//$entries return DOMNodeList (empty for a long path query, 
//                             correct for a small path query)

I test with attribute restriction, but is seems to not change (with small xpath it works, with longer it do not works more)

Example : for this current page:

$fullPath = "/html
              /body
               /div[4]
                /div[@id='content']
                 /div[@id='question-header']
                  /h1
                   /a";//works (retrieve the question title)
$fullPath = "/html
              /body
               /div[4]
                /div[@id='content']
                 /div[@id='mainbar']
                  /div[@id='question']
                   /table
                    /tbody
                     /tr[2]
                      /td[2]
                       /div[@id='comments-4097230']
                        /table
                         /tbody
                          /tr[@id='comment-4408626']
                           /td[2]
                            /div
                             /a"; //does'nt work 
                                  //(should retrieve 'gaby' from comment)


Edit:

I test with SimpleXML lib, and I have exactly the same behavior (good result for small query, nothing for long query).


Edit 2:

I also cut the longest xpath by deleting some first element and it works. BTW I really do not understand why a full correct xpath does not work.

解决方案

Let's go through this step by step:

Step 1: replicating the error.

After verifying that the XPath will indeed not return a result, I wrote a little script to see how deep the XPath will go before it breaks

foreach (explode('/', $fullPath) as $segment) {
    $xpath .= trim($segment);
    echo '-------------------------------------------', PHP_EOL,
         'Trying: ', $xpath, PHP_EOL,
         '-------------------------------------------', PHP_EOL;
    echo $xp->evaluate("string($xpath)"), PHP_EOL;
    $xpath .= '/';
}

The last thing it will return a result for is

/html/body/div[4]/div[@id='content']/div[@id='mainbar']/div[@id='question']/table


Step 2: checking the markup

So I checked the markup returned by DOMDocument::saveHTML() to see what it looks like and there was no <tbody> (reformatted for readability):

<div id="question">
    <div class="everyonelovesstackoverflow" id="adzerk1"></div>
        <table>
            <tr><td class="votecell">

I then checked this very page to see if it was DOM throwing it away or if it really does not exist. It wasn't there. Apparently, Firebug inserts it, which would explain why you got the result with XPather (but not why you got it with YQL):

Step 3: proofchecking and conclusion

I removed the <tbody> from the XPath and reran the script. No problems. Returns "Gaby".

While I suspected a bug in Firebug first, Alejandro commented this would happen in IE's DeveloperTools, too. I then suspected this to be added by JavaScript but could not verify that. After some more research Alejandro pointed me to Why does firebug add <tbody> to <table>? - it's actually neither Firebug nor JavaScript though, but the browser's themselves.

So to modify my conclusion:

Dont trust markup you see rendered in the browser, because it may be modified by the browser or other technologies. DOM will only download what is is served directly. If you run into similar issues again, you now know how to approach it though.


Some additional sidenotes

Unless you need to modify the markup before feeding it to DOM, you do not have to use file_get_contents() to load the content. You can use DOM's loadHTMLFile():

$dom->loadHTMLFile('http://www.example.com/foo.htm');

Also, the proper way to suppress errors is to tell libxml to use it's internal error handler. But instead of handling the errors, you simply clear them. This will only affect errors relating to libxml, e.g. parsing errors (as opposed to all PHP errors):

libxml_use_internal_errors(TRUE);
libxml_clear_errors();

Finally, xPath queries can be done in relation to a context node. So while the long XPath is efficient in terms of lookup time, you could simply use getElementById() to get the deepest known node and then use an XPath against it.

In other words:

libxml_use_internal_errors(TRUE);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.example.com/foo.htm');
libxml_clear_errors();
echo $xp->evaluate(
    'string(td[2]/div/a)', 
    $dom->getElementById('comment-4408626'));

will return "Gaby" as well.

这篇关于带有 DOMXpath 查询/评估的太长 xpath 不返回任何内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆