使用HtmlAgilityPack获取可用的XPath及其元素名称 [英] Get avaliable XPaths and its element names using HtmlAgilityPack

查看:58
本文介绍了使用HtmlAgilityPack获取可用的XPath及其元素名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 HtmlAgilityPack 库从HTML获取所有可用的XPath表达式的函数.

I'm using a function to get all the avaliable XPath expression from an HTML, using HtmlAgilityPack library.

问题是我得到以下格式的表达式:

The problem is that I get expressions with this format:

/html[1]/body[1]/div[1]/div[1]/div[1]/div[1]/h4[1]/a[1]

我将对其进行改进以获取节点/元素的名称,如下所示:

I would improve it to get also the names of the nodes/elements, like this:

/html/body/div[@class='infolinks']/div[@class='music']/div[@class='item']/div[@class='release']/h4[1]/a[@title]

但是我不知道如何使用 HtmlAgilityPack 正确获得他们的名字.

But I don't know how to properly get their names with HtmlAgilityPack.

我该怎么做?.

注意:如果XPath的语法不正确或我误会了一切,我不会对XPath的专家表示歉意.

Note: I'm not any XPath expert sorry if the syntax of the XPaths are bad or I missunderstand things.

我正在尝试的网页源代码:

The webpage sourcecode that I'm trying:

<div class="infolinks"><input type="hidden" name="IL_IN_TAG" value="1"/></div><div id="main">

    <div class="music">

        <h2 class="boxtitle">New releases \ <small>
            <a href="/newalbums" title="New releases mp3 downloads" rel="bookmark">see all</a></small>
        </h2>

        <div class="item">

            <div class="thumb">
                <a href="http://www.mp3crank.com/curt-smith/deceptively-heavy-121861" rel="bookmark" lang="en" title="Curt Smith - Deceptively Heavy album downloads"><img width="100" height="100" alt="Mp3 downloads Curt Smith - Deceptively Heavy" title="Free mp3 downloads Curt Smith - Deceptively Heavy" src="http://www.mp3crank.com/cover-album/Curt-Smith-Deceptively-Heavy-400x400.jpg"/></a>
            </div>

            <div class="release">
                <h3>Curt Smith</h3>
                <h4>
                    <a href="http://www.mp3crank.com/curt-smith/deceptively-heavy-121861" title="Mp3 downloads Curt Smith - Deceptively Heavy">Deceptively Heavy</a>
                </h4>
                <script src="/ads/button.js"></script>
            </div>

            <div class="release-year">
                <p>Year</p>
                <span>2013</span>
            </div>

            <div class="genre">
                <p>Genre</p>
                <a href="http://www.mp3crank.com/genre/indie" rel="tag">Indie</a><a href="http://www.mp3crank.com/genre/pop" rel="tag">Pop</a>
            </div>

        </div>

        <div class="item">

            <div class="thumb">
                <a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" rel="bookmark" lang="en" title="Wolf Eyes - Lower Demos album downloads"><img width="100" height="100" alt="Mp3 downloads Wolf Eyes - Lower Demos" title="Free mp3 downloads Wolf Eyes - Lower Demos" src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg" /></a>
            </div>

            <div class="release">
                <h3>Wolf Eyes</h3>
                <h4>
                    <a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" title="Mp3 downloads Wolf Eyes - Lower Demos">Lower Demos</a>
                </h4>
                <script src="/ads/button.js"></script>
            </div>

            <div class="release-year">
                <p>Year</p>
                <span>2013</span>
            </div>

            <div class="genre">
                <p>Genre</p>
                <a href="http://www.mp3crank.com/genre/rock" rel="tag">Rock</a>
            </div>

        </div>

    </div>

</div>

获取XPath的函数:

The function to get XPaths:

Public Function GetXPaths(ByVal Document As HtmlAgilityPack.HtmlDocument) As List(Of String)

    Dim XPathList As New List(Of String)
    Dim XPath As String = String.Empty

    For Each Child As HtmlAgilityPack.HtmlNode In Document.DocumentNode.ChildNodes

        If Child.NodeType = HtmlAgilityPack.HtmlNodeType.Element Then
            GetXPaths(Child, XPathList, XPath)
        End If

    Next ' child'

    Return XPathList

End Function

Private Sub GetXPaths(ByVal Node As HtmlAgilityPack.HtmlNode,
                      ByRef XPathList As List(Of String),
                      Optional ByVal XPath As String = Nothing)

    XPath = Node.XPath

    If Not XPathList.Contains(XPath) Then
        XPathList.Add(XPath)
    End If

    For Each Child As HtmlAgilityPack.HtmlNode In Node.ChildNodes

        If Child.NodeType = HtmlAgilityPack.HtmlNodeType.Element Then
            GetXPaths(Child, XPathList, XPath)

        End If

    Next ' child

End Sub


这些是我用来检索某些值的XPath,我希望在上面的函数中或多或少地获得相同的XPath完全合格的表示形式.


And these are the XPaths that I use to retrieve some values, I would like to get more or less the same XPath fully-qualified representation on the function above.

Title = node.SelectSingleNode(".//div[@class='release']/h4/a[@title]").GetAttributeValue("title", "Unknown Title")
Cover = node.SelectSingleNode(".//div[@class='thumb']/a/img[@src]").GetAttributeValue("src", String.Empty)
Year = node.SelectSingleNode(".//div[@class='release-year']/span").InnerText
Genres = (From genre In node.SelectNodes(".//div[@class='genre']/a") Select genre.InnerText).ToArray
URL = node.SelectSingleNode(".//div[@class='release']/h4/a[@href]").GetAttributeValue("href", "Unknown URL")


推荐答案

如果相应元素具有class属性,这会将class属性过滤器附加到XPath:

This will append class attribute filter to the XPath if corresponding element has class attribute :

Private Sub GetHtmlXPaths(ByVal Node As HtmlAgilityPack.HtmlNode,
                          ByRef XPathList As List(Of String),
                          Optional ByVal XPath As String = Nothing)

    XPath &= Node.XPath.Substring(Node.XPath.LastIndexOf("/"c))

    Const ClassNameFilter As String = "[@class='{0}']"
    Dim ClassName As String = Node.GetAttributeValue("class", String.Empty)

    If Not String.IsNullOrEmpty(ClassName) Then
        XPath &= String.Format(ClassNameFilter, ClassName)
    End If

    If Not XPathList.Contains(XPath) Then
        XPathList.Add(XPath)
    End If

    For Each Child As HtmlAgilityPack.HtmlNode In Node.ChildNodes

        If Child.NodeType = HtmlAgilityPack.HtmlNodeType.Element Then
            GetHtmlXPaths(Child, XPathList, XPath)
        End If

    Next Child

End Sub

这篇关于使用HtmlAgilityPack获取可用的XPath及其元素名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆