HtmlAgilityPack NextSibling.InnerText值为空白 [英] HtmlAgilityPack NextSibling.InnerText value is blank
问题描述
我正在使用HtmlAgilityPack抓取一些数据.
I am scraping some data using HtmlAgilityPack.
HTML看起来像这样:
The HTML looks like this:
<div id="id-here">
<dl>
<dt> Field Name </dt>
<dd> Value for above field name </dd>
<dt> Field Name </dt>
<dd> Value for above field name </dd>
<dt> Field Name </dt>
<dd> Value for above field name </dd>
</dl>
</div>
现在我遇到的问题是,并不总是有固定数量的字段,因此我无法像下面这样可靠地访问每个字段:
Now the problem I have is that there is not always a set number of fields so I cant reliably access each of them like:
//*[@id="id-here"]/dl[1]/dd[1]
由于dd [1]可能是一页上的名称,而另一页上的电话是用户未能填写姓名的电话,因此该字段被隐藏了.
as dd[1] may be a name on one page and a telephone on another where the user failed to fill out a name so field is hidden.
所以我像这样抓住所有DT和DD节点:
so I grab all the DT and DD nodes like so:
//*[@id="id-here"]/dl[1]/dt | //*[@id="id-here"]/dl[1]/dd
现在,我检查每个节点以查看它是否与我想要的字段匹配,并像下面这样取NextSibling值:
Now I check each node to see if it matches field I want and take the NextSibling value like so:
foreach (HtmlNode node in details)
{
if (node.InnerText.Contains("Tel:")) telephone = node.NextSibling.InnerText;
if (node.InnerText.Contains("Email:")) email = node.NextSibling.InnerText;
}
这对于电话来说可以正常工作,但是出于某些原因,当出现"Email:"节点时,NextSibling.InnerHTML
和& NextSibling.InnerText
为空白,尽管下一个同级肯定具有数据.如果我真的去details
中的node
并查看它,则InnerHTML
是整个格式化的链接,而InnerText
是电子邮件地址.
This works fine for telephone but for some reason when the "Email:" node comes up, both NextSibling.InnerHTML
& NextSibling.InnerText
are blank although the next sibling definitely has the data. If I actually go to that node
in details
and look at it the InnerHTML
is the entire formatted link and the InnerText
is the email address.
NextSibling.InnerText
是否由于A标签使其成为孩子或其他东西而无法使用?我看过调试器,只是在NextSibling
下找不到我需要的信息.
Is the NextSibling.InnerText
not working because the A tag is making it a child or something? I have had a look in debugger and just cant find the information I need under NextSibling
.
我敢肯定,答案很简单,我只是想不通.有人让我摆脱痛苦吗?
I am sure answer is ridiculously simple, I just cant figure it out. Anyone put me out of my misery?
推荐答案
发生这种情况的原因是,如果node
是一个dt
元素,并且其相应的dd
元素之间被空格隔开,则node.NextSibling
是一个全空格的文本节点(</dt>
和<dd>
之间的空格).如果在调试器中查看它,将会看到node.NextSibling
的NodeType
是HtmlNodeType.Text
而不是HtmlNodeType.Element
.
The reason this is happening is that if node
is a dt
element that is separated from its corresponding dd
element by some whitespace, then node.NextSibling
is an all-whitespace text node (the space between the </dt>
and the <dd>
). If you look at it in the debugger, you will see that node.NextSibling
's NodeType
is HtmlNodeType.Text
and not HtmlNodeType.Element
.
我建议创建一种便捷方法来获取dt
节点对应的dd
的文本:
I suggest creating a convenience method to get the text of a dt
node's corresponding dd
:
internal static string GetMatchingDdValue(HtmlNode dtNode)
{
var found = dtNode.SelectSingleNode("following-sibling::*[1][self::dd]");
return found == null ? "" : found.InnerText;
}
然后您可以像这样使用它:
Then you can use it like this:
if (node.InnerText.Contains("Tel:")) { telephone = GetMatchingDdValue(node); }
以下是我上面的方法中使用的有些棘手的XPath的细目分类:
Here's a breakdown of the somewhat tricky XPath used in my method above:
(a) following-sibling::*
^选择共享相同元素的所有元素 父节点作为当前节点并在其后出现.
^ Select all elements that share the same parent as the current node and occur after it.
(b) following-sibling::*[1]
^选择集合(a)中的第一个节点 (如果有的话)
^ Select the first node in set (a) (if there are any)
(c) following-sibling::*[1][self::dd]
^选择集合(b)中的所有节点 是名称为"dd"的元素
^ Select all nodes in set (b) that are elements with the name "dd"
SelectSingleNode()
选择集合(c)中的第一个节点,该节点应始终为1或0个节点.
SelectSingleNode()
selects the first node in set (c), which should always either be 1 or 0 nodes.
您最有可能仅通过following-sibling::dd
或following-sibling::*
就能解决问题,但是上述路径包含一些防护措施.例如,如果由于某种原因,您拥有以下XML,而当前节点为Tel:
元素:
You could most likely get by with just following-sibling::dd
or following-sibling::*
, but the above path contains safeguards. For example, if for some reason, you had the following XML and your current node was the Tel:
element:
<dl>
<dt>Tel:</dt>
<dt>Address:</dt>
<dd>50 Fake St.</dd>
</dl>
following-sibling::dd
将给您结果"50 Fake St.",而following-sibling::*
将给您结果"Address:".相反,在这种情况下,following-sibling::*[1][self::dd]
会选择一个空节点集,因此该方法将正确地产生一个空字符串作为结果.
following-sibling::dd
would give you the result "50 Fake St.", while following-sibling::*
would give you the result "Address:". Instead, following-sibling::*[1][self::dd]
would select an empty nodeset in this case, so the method would correctly produce an empty string as the result.
这篇关于HtmlAgilityPack NextSibling.InnerText值为空白的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!