HtmlAgilityPack NextSibling.InnerText值为空白 [英] HtmlAgilityPack NextSibling.InnerText value is blank

查看:153
本文介绍了HtmlAgilityPack NextSibling.InnerText值为空白的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用HtmlAgilityPack抓取一些数据.

I am scraping some data using HtmlAgilityPack.

HTML看起来像这样:

The HTML looks like this:

<div id="id-here">
  <dl>
    <dt> Field Name </dt>
    <dd> Value for above field name </dd>
    <dt> Field Name </dt>
    <dd> Value for above field name </dd>
    <dt> Field Name </dt>
    <dd> Value for above field name </dd>
  </dl>
</div>

现在我遇到的问题是,并不总是有固定数量的字段,因此我无法像下面这样可靠地访问每个字段:

Now the problem I have is that there is not always a set number of fields so I cant reliably access each of them like:

//*[@id="id-here"]/dl[1]/dd[1]

由于dd [1]可能是一页上的名称,而另一页上的电话是用户未能填写姓名的电话,因此该字段被隐藏了.

as dd[1] may be a name on one page and a telephone on another where the user failed to fill out a name so field is hidden.

所以我像这样抓住所有DT和DD节点:

so I grab all the DT and DD nodes like so:

//*[@id="id-here"]/dl[1]/dt | //*[@id="id-here"]/dl[1]/dd

现在,我检查每个节点以查看它是否与我想要的字段匹配,并像下面这样取NextSibling值:

Now I check each node to see if it matches field I want and take the NextSibling value like so:

    foreach (HtmlNode node in details)
    {
        if (node.InnerText.Contains("Tel:")) telephone = node.NextSibling.InnerText;
        if (node.InnerText.Contains("Email:")) email = node.NextSibling.InnerText;
    }

这对于电话来说可以正常工作,但是出于某些原因,当出现"Email:"节点时,NextSibling.InnerHTML和& NextSibling.InnerText为空白,尽管下一个同级肯定具有数据.如果我真的去details中的node并查看它,则InnerHTML是整个格式化的链接,而InnerText是电子邮件地址.

This works fine for telephone but for some reason when the "Email:" node comes up, both NextSibling.InnerHTML & NextSibling.InnerText are blank although the next sibling definitely has the data. If I actually go to that node in details and look at it the InnerHTML is the entire formatted link and the InnerText is the email address.

NextSibling.InnerText是否由于A标签使其成为孩子或其他东西而无法使用?我看过调试器,只是在NextSibling下找不到我需要的信息.

Is the NextSibling.InnerText not working because the A tag is making it a child or something? I have had a look in debugger and just cant find the information I need under NextSibling.

我敢肯定,答案很简单,我只是想不通.有人让我摆脱痛苦吗?

I am sure answer is ridiculously simple, I just cant figure it out. Anyone put me out of my misery?

推荐答案

发生这种情况的原因是,如果node是一个dt元素,并且其相应的dd元素之间被空格隔开,则node.NextSibling是一个全空格的文本节点(</dt><dd>之间的空格).如果在调试器中查看它,将会看到node.NextSiblingNodeTypeHtmlNodeType.Text而不是HtmlNodeType.Element.

The reason this is happening is that if node is a dt element that is separated from its corresponding dd element by some whitespace, then node.NextSibling is an all-whitespace text node (the space between the </dt> and the <dd>). If you look at it in the debugger, you will see that node.NextSibling's NodeType is HtmlNodeType.Text and not HtmlNodeType.Element.

我建议创建一种便捷方法来获取dt节点对应的dd的文本:

I suggest creating a convenience method to get the text of a dt node's corresponding dd:

internal static string GetMatchingDdValue(HtmlNode dtNode)
{
    var found = dtNode.SelectSingleNode("following-sibling::*[1][self::dd]");
    return found == null ? "" : found.InnerText;
}

然后您可以像这样使用它:

Then you can use it like this:

if (node.InnerText.Contains("Tel:")) { telephone = GetMatchingDdValue(node); }


以下是我上面的方法中使用的有些棘手的XPath的细目分类:


Here's a breakdown of the somewhat tricky XPath used in my method above:

(a) following-sibling::*

^选择共享相同元素的所有元素 父节点作为当前节点并在其后出现.

^ Select all elements that share the same parent as the current node and occur after it.

(b) following-sibling::*[1]

^选择集合(a)中的第一个节点 (如果有的话)

^ Select the first node in set (a) (if there are any)

(c) following-sibling::*[1][self::dd] 

^选择集合(b)中的所有节点 是名称为"dd"的元素

^ Select all nodes in set (b) that are elements with the name "dd"

SelectSingleNode()选择集合(c)中的第一个节点,该节点应始终为1或0个节点.

SelectSingleNode() selects the first node in set (c), which should always either be 1 or 0 nodes.

您最有可能仅通过following-sibling::ddfollowing-sibling::*就能解决问题,但是上述路径包含一些防护措施.例如,如果由于某种原因,您拥有以下XML,而当前节点为Tel:元素:

You could most likely get by with just following-sibling::dd or following-sibling::*, but the above path contains safeguards. For example, if for some reason, you had the following XML and your current node was the Tel: element:

<dl>
  <dt>Tel:</dt>
  <dt>Address:</dt>
  <dd>50 Fake St.</dd>
</dl>

following-sibling::dd将给您结果"50 Fake St.",而following-sibling::*将给您结果"Address:".相反,在这种情况下,following-sibling::*[1][self::dd]会选择一个空节点集,因此该方法将正确地产生一个空字符串作为结果.

following-sibling::dd would give you the result "50 Fake St.", while following-sibling::* would give you the result "Address:". Instead, following-sibling::*[1][self::dd] would select an empty nodeset in this case, so the method would correctly produce an empty string as the result.

这篇关于HtmlAgilityPack NextSibling.InnerText值为空白的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆