InnerText = InnerHtml - 如何使用HtmlAgilityPack提取可读文本 [英] InnerText=InnerHtml - How to extract readable text with HtmlAgilityPack

查看:386
本文介绍了InnerText = InnerHtml - 如何使用HtmlAgilityPack提取可读文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述



我试图用 vb.net HtmlAgilityPack



我需要解析的标记具有InnerText = InnerHtml和两者:

 名称:<! -  b>&#61;< / b  - > Albert E-span-> instein s< i>< 89;> 3:房间: -  
/ pre>

在调试时,我可以使用Html viewer读取它:它显示:

 姓名:爱因斯坦部分:3房间: -  

我怎样才能得到这个到一个字符串变量?



编辑:



我使用此代码来获取节点:

  Dim ElePs As HtmlNodeCollection = _ 
mWPage.DocumentNode.SelectNodes(// div [@ id ='div_main'] // p)
对于每个EleP作为HtmlNode在ElePs
'这里我需要让EleP.InnerText标准化
下一个
string.Join
就足够了:

C#

  var text = string.Join(,htmlDoc.DocumentNode.SelectNodes(// text()[normalize-space() ])。
Select(t => t.InnerText));

VB.net

  Dim text = String.Join(,From t在htmlDoc.DocumentNode.SelectNodes(// text()[normalize-space()])
选择t.InnerText)

html是有效的,没什么不好的,它只是由没有灵魂的人写的。



根据您的更新进行操作:

  Dim Eleps As HtmlNodeCollection = mWPage.DocumentNode.SelectNodes(// div [@ id ='div_main'] // p)
对于每个EleP作为HtmlNode在ElePs
'这里我需要获取EleP.InnerTextnormalized
Dim text = String.Join(,From t In EleP.SelectNodes(.// text()[normalize-space()])
选择t。 InnerText).Trim()
下一个

记下 ./ / 这意味着它将查找当前节点的后代节点,这与 // 不同,后者始终从顶层节点开始。


I need to extract text from a very bad Html.

I'm trying to do this using vb.net and HtmlAgilityPack

The tag that I need to parse has InnerText = InnerHtml and both:

Name:<!--b>&#61;</b--> Albert E<!--span-->instein  s<!--i>&#89;</i-->ection: 3 room: -

While debuging I can read it using "Html viewer": it shows:

Name: Albert Einstein section: 3 room: -

How can I get this into a string variable?

EDIT:

I use this code to get the node:

Dim ElePs As HtmlNodeCollection = _
    mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p")
For Each EleP As HtmlNode In ElePs
    'Here I need to get EleP.InnerText "normalized"
Next

解决方案

If you notice this mess is actually just html comments and they shall be ignored, so just getting the text and using string.Join is enough:

C#

var text = string.Join("",htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]").
                                            Select(t=>t.InnerText));

VB.net

 Dim text = String.Join("", From t In htmlDoc.DocumentNode.SelectNodes("//text()[normalize-space()]")
                                   Select t.InnerText)

the html is valid, nothing bad about it, its just written by someone without a soul.

based on your update this shall do:

Dim ElePs As HtmlNodeCollection = mWPage.DocumentNode.SelectNodes("//div[@id='div_main']//p")
For Each EleP As HtmlNode In ElePs
    'Here I need to get EleP.InnerText "normalized"
     Dim text = String.Join("", From t In EleP.SelectNodes(".//text()[normalize-space()]")
                Select t.InnerText).Trim()
Next

note the .// it means that it will look for the descendant nodes of the current node unlike // which will always start from the top node.

这篇关于InnerText = InnerHtml - 如何使用HtmlAgilityPack提取可读文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆