用Html Agility Pack剥离所有的html标签 [英] Stripping all html tags with Html Agility Pack

查看:211
本文介绍了用Html Agility Pack剥离所有的html标签的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这样的html字符串:

 < html>< body>< p> foo< a href ='http://www.example.com'> bar< / a>巴兹< / P>< /体>< / HTML> 

我希望去除所有html标签,以便生成的字符串变为:

  foo bar baz 

从在这里的另一篇文章中,我提出了这个函数(它使用Html Agility Pack):

 公共共享函数stripTags (ByVal html As String)As String 
Dim plain As String = String.Empty
Dim htmldoc As New HtmlAgilityPack.HtmlDocument

htmldoc.LoadHtml(html)
Dim invalidNodes As HtmlAgilityPack.HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes(// html | // body | // p | // a)

如果不是htmldoc没有那么
For Each node in invalidNodes
node.ParentNode.RemoveChild(node,True)
Next
End If

返回htmldoc.DocumentNode.WriteContentTo
End Function

不幸的是,这并没有回报我期望的结果,而是给出了:

  bazbarfoo 

请问哪里出错 - 这是最好的方法吗?

问候和快乐的编码!



更新:通过下面的答案,我想出了这个函数可能对其他人有用:
$ b $ pre $ 公共共享函数stripTags(ByVal html As String)As String
Dim htmldoc As New HtmlAgilityPack.HtmlDocument
htmldoc.LoadHtml(html.Replace(< / p>,< / p>& New String(Environment.NewLine,2))。Replace(< br />,Environment.NewLine))
返回htmldoc.DocumentNode.InnerText
End Function

解决方案而不是删除所有非文本节点?它应该给你你想要的。


I have a html string like this:

<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>

I wish to strip all html tags so that the resulting string becomes:

foo bar baz

From another post here at SO I've come up with this function (which uses the Html Agility Pack):

  Public Shared Function stripTags(ByVal html As String) As String
    Dim plain As String = String.Empty
    Dim htmldoc As New HtmlAgilityPack.HtmlDocument

    htmldoc.LoadHtml(html)
    Dim invalidNodes As HtmlAgilityPack.HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes("//html|//body|//p|//a")

    If Not htmldoc Is Nothing Then
      For Each node In invalidNodes
        node.ParentNode.RemoveChild(node, True)
      Next
    End If

    Return htmldoc.DocumentNode.WriteContentTo
  End Function

Unfortunately this does not return what I expect, instead it gives:

bazbarfoo

Please, where do I go wrong - and is this the best approach?

Regards and happy coding!

UPDATE: by the answer below I came up with this function, might be usefull to others:

  Public Shared Function stripTags(ByVal html As String) As String
    Dim htmldoc As New HtmlAgilityPack.HtmlDocument
    htmldoc.LoadHtml(html.Replace("</p>", "</p>" & New String(Environment.NewLine, 2)).Replace("<br/>", Environment.NewLine))
    Return htmldoc.DocumentNode.InnerText
  End Function

解决方案

Why not just return htmldoc.DocumentNode.InnerText instead of removing all the non-text nodes? It should give you what you want.

这篇关于用Html Agility Pack剥离所有的html标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆