HTMLAgilityPack 中的 XPath 选择无法按预期工作 [英] XPath select's in HTMLAgilityPack don't work as expected

查看:43
本文介绍了HTMLAgilityPack 中的 XPath 选择无法按预期工作的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用 C# 编写简单的屏幕抓取程序,为此我需要选择放置在一个名为aspnetForm"的单个表单中的所有输入(页面上有 2 个表单,我不想输入来自另一个表单)),并且此表单中的所有输入都放置在不同的表、div 中,或仅位于此表单的第一个子级.

所以我写了非常简单的 XPath 查询:

//form[@id='aspnetForm']//输入

它在我测试过的所有浏览器(Chrome、IE、Firefox)中都按预期工作 - 它返回了我想要的东西.

但是在 HTMLAgilityPack 中它根本不起作用 - SelectNodes 总是返回 NULL.

我为测试编写的这个查询工作正常,但返回的不是我想要的.首先选择我的表单的第一个孩子的所有输入,然后选择返回的表单:

//form[@id='aspnetForm']/input//表单[@id='aspnetForm']

是的,我知道我可以从上次查询中枚举节点,或者在其结果上创建另一个 SelectNodes,但我真的不想这样做.我想使用与浏览器中相同的查询.

HTMLAgilityPack 中的 XPath 目前是否已损坏?C# 有其他的 XPath 实现吗?

更新:测试代码:

使用 HtmlAgilityPack;使用 Microsoft.VisualStudio.TestTools.UnitTesting;命名空间 HtmlAGPTests{[测试班]公共类 XPathTests{私有常量字符串 html ="<form id=\"aspnetForm\">"+"<输入名称=\"first\" value=\"first\"/>"+

"+"<输入名称=\"秒\" 值=\"秒\"/>"+</div>"+"</form>";私有静态 HtmlNode GetHtmlDocumentNode(){var document = new HtmlDocument();document.LoadHtml(html);返回文档.DocumentNode;}[测试方法]public void TwoLevelXpathTest()//失败 - 节点实际上是 NULL.{var query = "//form[@id='aspnetForm']//input";//我想要的是var documentNode = GetHtmlDocumentNode();var inputNodes = documentNode.SelectNodes(query);Assert.IsTrue(inputNodes.Count == 2);}[测试方法]public void TwoSingleLevelXpathsTest()//有效{var formQuery = "//form[@id='aspnetForm']";var inputQuery = "//输入";var documentNode = GetHtmlDocumentNode();var formNode = documentNode.SelectSingleNode(formQuery);var inputNodes = formNode.SelectNodes(inputQuery);Assert.IsTrue(inputNodes.Count == 2);}[测试方法]public void SingleLevelXpathTest()//有效{var query = "//form[@id='aspnetForm']";var documentNode = GetHtmlDocumentNode();var formNode = documentNode.SelectSingleNode(query);Assert.IsNotNull(formNode);}}}

解决方案

测试中出现意外行为是因为 html 包含

元素.以下是相关讨论:

<块引用>

Ariman : "我发现解析任何节点后都没有任何子节点.所有应该在表单中的节点(、 等)都被创建为它的兄弟节点而不是子节点.

VikciaR :在 Html 规范中表单标签可以重叠,所以 Htmlagilitypack 处理这个节点有点不同......"

[CodePlex 讨论:FORM 对象没有子节点]

按照 VikciaR 的建议,尝试像这样修改您的测试代码初始化:

私有静态HtmlNode GetHtmlDocumentNode(){var document = new HtmlDocument();document.LoadHtml(html);//执行此行一次HtmlNode.ElementsFlags.Remove("form");返回文档.DocumentNode;}

旁注: inputQuery 测试方法 TwoSingleLevelXpathsTest() 中的值应该是 .//input.请注意开头的点 (.) 表示此查询是相对于当前节点的.否则它将从根搜索,忽略之前的 formQuery(没有点,您可以将 formQuery 更改为任何内容,只要它不返回 null,inputQuery 将始终返回相同的结果).

I'm writing simple screen scraping program in C#, for which i need to select all input's placed inside of one single form named "aspnetForm"(there is 2 forms on the page, and i don't want input's from another), and all inputs in this form placed inside different tables, div's, or just at first-child-level of this form.

So i written really simple XPath query:

//form[@id='aspnetForm']//input

It's works as expected in all browsers that i tested(Chrome, IE, Firefox) - it returns what i want.

But in HTMLAgilityPack it's not work at all - SelectNodes just always return NULL.

This queries i've written for tests works fine, but returns not what i want. First select all input's that are first-childs for my form, and second just return's form:

//form[@id='aspnetForm']/input
//form[@id='aspnetForm']

Yes, i know that i can just enumerate over nodes from last query, or make another SelectNodes on it's result, but i don't really want to do this. I want to use same query as in browsers.

Is XPath currently broken in HTMLAgilityPack? There is any alternative XPath implementations for C#?

UPDATE: Test code:

using HtmlAgilityPack;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace HtmlAGPTests
{
    [TestClass]
    public class XPathTests
    {
        private const string html =
                "<form id=\"aspnetForm\">" +
                "<input name=\"first\" value=\"first\" />" +
                "<div>" +
                    "<input name=\"second\" value=\"second\" />" +
                "</div>" +
                "</form>";

        private static HtmlNode GetHtmlDocumentNode()
        {
            var document = new HtmlDocument();
            document.LoadHtml(html);
            return document.DocumentNode;
        }

        [TestMethod]
        public void TwoLevelXpathTest()     // fail - nodes is NULL actually.
        {
            var query = "//form[@id='aspnetForm']//input";  // what i want
            var documentNode = GetHtmlDocumentNode();

            var inputNodes = documentNode.SelectNodes(query);

            Assert.IsTrue(inputNodes.Count == 2);
        }

        [TestMethod]
        public void TwoSingleLevelXpathsTest()     // works
        {
            var formQuery = "//form[@id='aspnetForm']";
            var inputQuery = "//input";
            var documentNode = GetHtmlDocumentNode();

            var formNode = documentNode.SelectSingleNode(formQuery);
            var inputNodes = formNode.SelectNodes(inputQuery);

            Assert.IsTrue(inputNodes.Count == 2);
        }

        [TestMethod]
        public void SingleLevelXpathTest()     // works
        {
            var query = "//form[@id='aspnetForm']";
            var documentNode = GetHtmlDocumentNode();

            var formNode = documentNode.SelectSingleNode(query);

            Assert.IsNotNull(formNode);
        }

    }
}

解决方案

The unexpected behavior in your test occur because the html contains <form> element. Here is related discussion :

Ariman : "I've found that after parsing any node does not have any child nodes. All nodes that should be inside the form (, , etc.) are created as it's siblings rather then children.

VikciaR : "In Html specification form tag can overlap, so Htmlagilitypack handle this node a little different..."

[CodePlex discussion : No child nodes for FORM objects ]

And as suggested by VikciaR there, try to modify your test code initialization like this :

private static HtmlNode GetHtmlDocumentNode()
{
    var document = new HtmlDocument();
    document.LoadHtml(html);
    
    //execute this line once
    HtmlNode.ElementsFlags.Remove("form");
    
    return document.DocumentNode;
}

Side note: inputQuery value in test method TwoSingleLevelXpathsTest() should be .//input. Notice the dot (.) at the beginning to indicate that this query is relative to current node. Otherwise it will search from the root, ignoring the former formQuery (without the dot, you can change formQuery to anything as long as it doesn't return null, the inputQuery will always return the same result).

这篇关于HTMLAgilityPack 中的 XPath 选择无法按预期工作的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆