解析DL与HtmlAgilityPack [英] Parsing dl with HtmlAgilityPack

查看:122
本文介绍了解析DL与HtmlAgilityPack的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是样本HTML我尝试使用HTML敏捷性包在ASP.Net(C#)来分析。

This is the sample HTML I am trying to parse with Html Agility Pack in ASP.Net (C#).

<div class="content-div">
    <dl>
        <dt>
            <b><a href="1.html" title="1">1</a></b>
        </dt>
        <dd> First Entry</dd>
        <dt>
            <b><a href="2.html" title="2">2</a></b>
        </dt>
        <dd> Second Entry</dd>
        <dt>
            <b><a href="3.html" title="3">3</a></b>
        </dt>
        <dd> Third Entry</dd>
    </dl>
</div>

我想的价值观是:

The Values I want are :


  • 超链接 - > 1.HTML

  • 锚文本 - > 1

  • 内部文本OD DD - >首次进入

(我已在这里的第一个条目的例子,但我想为这些元素的所有条目列表中的值)的

(I have taken examples of the first entry here but I want the values for these elements for all the entries in the list )

这是code,我使用目前,

This is the code I am using currently,

var webGet = new HtmlWeb();
            var document = webGet.Load(url2);
var parsedValues=
   from info in document.DocumentNode.SelectNodes("//div[@class='content-div']")
   from content in info.SelectNodes("dl//dd")
   from link in info.SelectNodes("dl//dt/b/a")
       .Where(x => x.Attributes.Contains("href"))
   select new 
   {
       Text = content.InnerText,
       Url = link.Attributes["href"].Value,
       AnchorText = link.InnerText,
   };

GridView1.DataSource = parsedValues;
GridView1.DataBind();

的问题是,我正确获取的链接和锚文本的值,但对于它的内部文本只需要第一项的值和填充相同的值对于所有其它条目的次数的总数元素出现,然后将其与第二个开始。我可能不是我的解释那么清楚所以这里是一个示例输出,我这个code获得:

The problem is that I get the values for the link and the anchor text correctly but for the inner text of it just takes the value of the first entry and fills the same value for all other entries for the total number of times the element occurs and then it starts over with the second one. I may not be so clear in my explanation so here's a sample output I am getting with this code:

First Entry     1.html  1
First Entry     2.html  2
First Entry     3.html  3
Second Entry    1.html  1
Second Entry    2.html  2
Second Entry    3.html  3
Third Entry     1.html  1
Third Entry     2.html  2
Third Entry     3.html  3

而我试图让

First Entry      1.html     1
Second Entry     2.html     2
Third Entry      3.html     3

我是pretty新HAP和对XPath的非常少knoweledge,所以我相信我在这里做错了什么,但我不能让它即使是在IT支出小时后工作。任何帮助将是非常美联社preciated。

I am pretty new to HAP and have very little knoweledge on xpath, so I am sure I am doing something wrong here, but I couldn't make it work even after spending hours on it. Any help would be much appreciated.

推荐答案

解决方案1 ​​

我已经定义了给一个 DT 节点将返回一个 DD A功能节点后:

I have defined a function that given a dt node will return the next dd node after it:

private static HtmlNode GetNextDDSibling(HtmlNode dtElement)
{
    var currentNode = dtElement;

    while (currentNode != null)
    {
        currentNode = currentNode.NextSibling;

        if(currentNode.NodeType == HtmlNodeType.Element && currentNode.Name =="dd")
            return currentNode;
    }

    return null;
}

现在的LINQ code可以转化为:

and now the LINQ code can be transformed to:

var parsedValues =
    from info in document.DocumentNode.SelectNodes("//div[@class='content-div']")
    from dtElement in info.SelectNodes("dl/dt")
    let link = dtElement.SelectSingleNode("b/a[@href]")
    let ddElement = GetNextDDSibling(dtElement)
    where link != null && ddElement != null
    select new
    {
        Text = ddElement.InnerHtml,
        Url = link.GetAttributeValue("href", ""),
        AnchorText = link.InnerText
    };

解决方案2

如果没有额外的功能:

var infoNode = 
        document.DocumentNode.SelectSingleNode("//div[@class='content-div']");

var dts = infoNode.SelectNodes("dl/dt");
var dds = infoNode.SelectNodes("dl/dd");

var parsedValues = dts.Zip(dds,
    (dt, dd) => new
    {
        Text = dd.InnerHtml,
        Url = dt.SelectSingleNode("b/a[@href]").GetAttributeValue("href", ""),
        AnchorText = dt.SelectSingleNode("b/a[@href]").InnerText
    });

这篇关于解析DL与HtmlAgilityPack的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆