使用HTMLAgilityPack库检索属性和跨度 [英] Retrieve attributes and span using HTMLAgilityPack library

查看:135
本文介绍了使用HTMLAgilityPack库检索属性和跨度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在这段HTML代码中:

 < div class =item> 

< div class =thumb>

 公共类Form1 

Private htmldoc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument
Private htmlnodes As HtmlAgilityPack.HtmlNodeCollection = Nothing

Private Title As String = String.Empty
Private Cover As String = String.Empty
私人流派作为String()= {String.Empty}
私人年份作为整数= -0
私人URL作为String = String.Empty

Private Sub Test()处理MyBase.Sho wn

'加载html文档。
htmldoc.LoadHtml(IO.File.ReadAllText(C:\source.html))

'选择(10项)节点。
htmlnodes = htmldoc.DocumentNode.SelectNodes(// div [@ class ='item'])

'循环遍历节点。
对于每个节点作为HtmlAgilityPack.HtmlNode在htmlnodes

Title = node.SelectSingleNode(// div [@ class ='release'])。Attributes(title)。Value
Cover = node.SelectSingleNode(// div [@ class ='thumb'])。Attributes(src)。Value
Year = CInt(node.SelectSingleNode(// div [ @ class ='release-year'])。Attributes(span)。Value)
Genres =?选择多个节点?
URL = node.SelectSingleNode(// div [@ class ='release'])。Attributes(href)。Value

Next

End Sub

End Class


解决方案

你在这里犯的错误是尝试访问你找到的那个childnode的一个属性。



当你调用 node.SelectSingleNode( // div [@ class ='release'])返回正确的div,但调用 .Attributes div 标签本身,而不是任何内部HTML元素。



可以编写XPATH查询来选择子节点,例如 // div [@ class ='release'] / a - 请参阅 http://www.w3schools.com/xpath/xpath_syntax.asp 了解XPATH的更多信息。尽管这些示例都是针对XML的,但大多数原则都适用于HTML文档。

另一种方法是在您找到的节点上使用更多的XPATH调用。我修改了你的代码,使它能够使用这种方法:

 '加载html文档。 
htmldoc.LoadHtml(IO.File.ReadAllText(C:\source.html))

'选择(10项)节点。
htmlnodes = htmldoc.DocumentNode.SelectNodes(// div [@ class ='item'])

'遍历节点。
对于每个节点作为HtmlAgilityPack.HtmlNode在htmlnodes中
$ b $ Dim releaseNode = node.SelectSingleNode(.// div [@ class ='release'])
'假设我们找到节点并且它有一个a-tag
Title = releaseNode.SelectSingleNode(.// a)。属性(title)。值
URL = releaseNode.SelectSingleNode(.//// a)。Attributes(href)。Value

Dim thumbNode = node.SelectSingleNode(.// div [@ class ='thumb'])
Cover = thumbNode。 SelectSingleNode(.// img)。Attributes(src)。Value

Dim releaseYearNode = node.SelectSingleNode(.// div [@ class ='release-year'])
年= CInt(releaseYearNode.SelectSingleNode(.// span)。InnerText)

Dim genreNode = node.SelectSingleNode(.// div [@ class ='genre'] )
Dim genreLinks = genreNode.SelectNodes(.// a)
Genres =(从n在genreLinks中选择n.InnerText).ToArray()

控制台。 WriteLine(Title:{0},Title)
Console.WriteLine(Cover:{0},C
Console.WriteLine(Year:{0},Year)
Console.WriteLine(流派:{0},String.Join(,,流派)
Console.WriteLine(URL:{0},URL)

下一个



注意,在这段代码中,我们假设文档的格式正确,并且每个节点/元素/属性都存在并且是正确的。您可能想为此添加大量错误检查,例如 If someNode is Nothing Then ....



编辑:我修改了上面的代码,以确保每个.SelectSingleNode使用前缀.// - 这确保了在有多个item节点时它可以工作,否则它会从文档中选择第一个匹配而不是当前节点。



如果你想要一个更短的XPATH解决方案,这里使用相同的代码:

 '加载html文档。 
htmldoc.LoadHtml(IO.File.ReadAllText(C:\source.html))

'选择(10项)节点。
htmlnodes = htmldoc.DocumentNode.SelectNodes(// div [@ class ='item'])

'遍历节点。
对于每个节点作为HtmlAgilityPack.HtmlNode在htmlnodes

标题= node.SelectSingleNode(.// div [@ class ='release'] / h4 / a [@title]) .Attributes(title)。Value
URL = node.SelectSingleNode(.// div [@ class ='release'] / h4 / a [@href])。Attributes(href)。 Value

Cover = node.SelectSingleNode(.// div [@ class ='thumb'] / a / img [@src])。Attributes(src)。Value

Year = CInt(node.SelectSingleNode(.// div [@ class ='release-year'] / span)。InnerText)

Dim genreLinks = node.SelectNodes( .//div[@class='genre']/a)
Genres =(从n在genreLinks中选择n.InnerText).ToArray()

Console.WriteLine(标题:{0},标题)
Console.WriteLine(Cover:{0},Cover)
Console.WriteLine(Year:{0},Year)
Console .WriteLine(流派:{0},String.Join(,,流派))
Console.WriteLine(URL:{0},URL)
Console.WriteLine()

下一个


In this piece of HTML code:

<div class="item">

    <div class="thumb">
        <a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" rel="bookmark" lang="en" title="Wolf Eyes - Lower Demos album downloads">
        <img width="100" height="100" alt="Mp3 downloads Wolf Eyes - Lower Demos" title="Free mp3 downloads Wolf Eyes - Lower Demos" src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg" /></a>
    </div>

    <div class="release">
        <h3>Wolf Eyes</h3>
        <h4>
        <a href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" title="Wolf Eyes - Lower Demos">Lower Demos</a>
        </h4>
        <script src="/ads/button.js"></script>
    </div>

    <div class="release-year">
        <p>Year</p>
        <span>2013</span>
    </div>

    <div class="genre">
        <p>Genre</p>
        <a href="http://www.mp3crank.com/genre/rock" rel="tag">Rock</a>
        <a href="http://www.mp3crank.com/genre/pop" rel="tag">Pop</a>
    </div>

</div>

I know how to parse it in other ways, but I would like to retrieve this Info using HTMLAgilityPack library:

Title : Wolf Eyes - Lower Demos
Cover : http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg
Year  : 2013
Genres: Rock, Pop
URL   : http://www.mp3crank.com/wolf-eyes/lower-demos-121866

Which are these html lines:

Title : title="Wolf Eyes - Lower Demos"
Cover : src="http://www.mp3crank.com/cover-album/Wolf-Eyes-–-Lower-Demos.jpg"
Year  : <span>2013</span>
Genre1: <a href="http://www.mp3crank.com/genre/rock" rel="tag">Rock</a>
Genre2: <a href="http://www.mp3crank.com/genre/pop" rel="tag">Pop</a>
URL   : href="http://www.mp3crank.com/wolf-eyes/lower-demos-121866" 

This is what I'm trying, but I always get an object reference not set exception when trying to select a single node, Sorry but I'm very newbie with HTML, I've tried to follow the steps of this question HtmlAgilityPack basic how to get title and link?

Public Class Form1

    Private htmldoc As HtmlAgilityPack.HtmlDocument = New HtmlAgilityPack.HtmlDocument
    Private htmlnodes As HtmlAgilityPack.HtmlNodeCollection = Nothing

    Private Title As String = String.Empty
    Private Cover As String = String.Empty
    Private Genres As String() = {String.Empty}
    Private Year As Integer = -0
    Private URL as String = String.Empty

    Private Sub Test() Handles MyBase.Shown

        ' Load the html document.
        htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))

        ' Select the (10 items) nodes.
        htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")

        ' Loop trough the nodes.
        For Each node As HtmlAgilityPack.HtmlNode In htmlnodes

            Title = node.SelectSingleNode("//div[@class='release']").Attributes("title").Value
            Cover = node.SelectSingleNode("//div[@class='thumb']").Attributes("src").Value
            Year = CInt(node.SelectSingleNode("//div[@class='release-year']").Attributes("span").Value)
            Genres = ¿select multiple nodes?
            URL = node.SelectSingleNode("//div[@class='release']").Attributes("href").Value

        Next

    End Sub

End Class

解决方案

Your mistake here it to try to access an attribute of a childnode from the one you've found.

When you call node.SelectSingleNode("//div[@class='release']") you get the correct div returned, but calling .Attributes returns just the attributes for the div tag itself, not any of the inner HTML elements.

It's possible to write XPATH queries that select the sub-node, e.g. //div[@class='release']/a - see http://www.w3schools.com/xpath/xpath_syntax.asp for more information on XPATH. Although the examples are for XML, most of the principles should apply to a HTML document.

Another approach is to use further XPATH calls on the node you've found. I've amended your code to make it work using this approach:

' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))

' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")

' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes

    Dim releaseNode = node.SelectSingleNode(".//div[@class='release']")
    'Assumes we find the node and it has a a-tag
    Title = releaseNode.SelectSingleNode(".//a").Attributes("title").Value
    URL = releaseNode.SelectSingleNode(".//a").Attributes("href").Value

    Dim thumbNode = node.SelectSingleNode(".//div[@class='thumb']")
    Cover = thumbNode.SelectSingleNode(".//img").Attributes("src").Value

    Dim releaseYearNode = node.SelectSingleNode(".//div[@class='release-year']")
    Year = CInt(releaseYearNode.SelectSingleNode(".//span").InnerText)

    Dim genreNode = node.SelectSingleNode(".//div[@class='genre']")
    Dim genreLinks = genreNode.SelectNodes(".//a")
    Genres = (From n In genreLinks Select n.InnerText).ToArray()

    Console.WriteLine("Title : {0}", Title)
    Console.WriteLine("Cover : {0}", Cover)
    Console.WriteLine("Year  : {0}", Year)
    Console.WriteLine("Genres: {0}", String.Join(",", Genres))
    Console.WriteLine("URL   : {0}", URL)

Next

Note that in this code we're assuming the document is correctly formed and that each node/element/attribute exists and is correct. You might want to add a lot of error checking to this, e.g. If someNode Is Nothing Then ....

Edit: I've amended the code above slightly, to ensure each .SelectSingleNode uses the ".//" prefix - this ensures it works if there are several "item" nodes, otherwise it selects the first match from the document not the current node.

If you want a shorter XPATH solution, here is the same code using that approach:

' Load the html document.
htmldoc.LoadHtml(IO.File.ReadAllText("C:\source.html"))

' Select the (10 items) nodes.
htmlnodes = htmldoc.DocumentNode.SelectNodes("//div[@class='item']")

' Loop through the nodes.
For Each node As HtmlAgilityPack.HtmlNode In htmlnodes

    Title = node.SelectSingleNode(".//div[@class='release']/h4/a[@title]").Attributes("title").Value
    URL = node.SelectSingleNode(".//div[@class='release']/h4/a[@href]").Attributes("href").Value

    Cover = node.SelectSingleNode(".//div[@class='thumb']/a/img[@src]").Attributes("src").Value

    Year = CInt(node.SelectSingleNode(".//div[@class='release-year']/span").InnerText)

    Dim genreLinks = node.SelectNodes(".//div[@class='genre']/a")
    Genres = (From n In genreLinks Select n.InnerText).ToArray()

    Console.WriteLine("Title : {0}", Title)
    Console.WriteLine("Cover : {0}", Cover)
    Console.WriteLine("Year  : {0}", Year)
    Console.WriteLine("Genres: {0}", String.Join(",", Genres))
    Console.WriteLine("URL   : {0}", URL)
    Console.WriteLine()

Next

这篇关于使用HTMLAgilityPack库检索属性和跨度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆