InnerText = InnerHtml - 如何使用HtmlAgilityPack提取可读文本 [英] InnerText=InnerHtml - How to extract readable text with HtmlAgilityPack
问题描述
我试图用 vb.net $ c来做这件事。 $ c>和
HtmlAgilityPack
我需要解析的标记具有InnerText = InnerHtml和两者:
名称:<! - b>&#61;< / b - > Albert E-span-> instein s< i>< 89;> 3:房间: -
/ pre>
在调试时,我可以使用Html viewer读取它:它显示:
姓名:爱因斯坦部分:3房间: -
我怎样才能得到这个到一个字符串变量?
编辑:
我使用此代码来获取节点:
Dim ElePs As HtmlNodeCollection = _
$ p如果你注意到这个乱七八糟的实际上只是html的注释,它们应该被忽略,所以只需获取文本并使用
mWPage.DocumentNode.SelectNodes(// div [@ id ='div_main'] // p)
对于每个EleP作为HtmlNode在ElePs
'这里我需要让EleP.InnerText标准化
下一个
string.Join
就足够了:
C#
var text = string.Join(,htmlDoc.DocumentNode.SelectNodes(// text()[normalize-space() ])。
Select(t => t.InnerText));
VB.net
Dim text = String.Join(,From t在htmlDoc.DocumentNode.SelectNodes(// text()[normalize-space()])
选择t.InnerText)
html是有效的,没什么不好的,它只是由没有灵魂的人写的。
根据您的更新进行操作:
Dim Eleps As HtmlNodeCollection = mWPage.DocumentNode.SelectNodes(// div [@ id ='div_main'] // p)
对于每个EleP作为HtmlNode在ElePs
'这里我需要获取EleP.InnerTextnormalized
Dim text = String.Join(,From t In EleP.SelectNodes(.// text()[normalize-space()])
选择t。 InnerText).Trim()
下一个
记下
./ /
这意味着它将查找当前节点的后代节点,这与//
不同,后者始终从顶层节点开始。I need to extract text from a very bad Html.
I'm trying to do this using
vb.net
andHtmlAgilityPack
The tag that I need to parse has InnerText = InnerHtml and both:
Name:<!--b>=</b--> Albert E<!--span-->instein s<!--i>Y</i-->ection: 3 room: -
While debuging I can read it using "Html viewer": it shows:
Name: Albert Einstein section: 3 room: -
How can I get this into a string variable?
EDIT:
I use this code to get the node:
Dim ElePs As HtmlNodeCollection = _ mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p") For Each EleP As HtmlNode In ElePs 'Here I need to get EleP.InnerText "normalized" Next
解决方案If you notice this mess is actually just html comments and they shall be ignored, so just getting the text and using
string.Join
is enough:C#
var text = string.Join("",htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]"). Select(t=>t.InnerText));
VB.net
Dim text = String.Join("", From t In htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]") Select t.InnerText)
the html is valid, nothing bad about it, its just written by someone without a soul.
based on your update this shall do:
Dim ElePs As HtmlNodeCollection = mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p") For Each EleP As HtmlNode In ElePs 'Here I need to get EleP.InnerText "normalized" Dim text = String.Join("", From t In EleP.SelectNodes(".//text()[normalize-space()]") Select t.InnerText).Trim() Next
note the
.//
it means that it will look for the descendant nodes of the current node unlike//
which will always start from the top node.这篇关于InnerText = InnerHtml - 如何使用HtmlAgilityPack提取可读文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!