尝试从HTML片段中提取文本时遇到问题 [英] Running into an issue trying to extract the text from a snippet of HTML

查看:66
本文介绍了尝试从HTML片段中提取文本时遇到问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用HTML Agility包进行转换

i am using the HTML Agility pack to convert

 <font size="1">This is a test</font>

 This is a test

使用此代码:

 HtmlDocument doc = new HtmlDocument();
 doc.LoadHtml(html);
 string stripped = doc.DocumentNode.InnerText;

但是我遇到一个我有这个问题的问题:

but i ran into an issue where i have this:

 <font size="1">This is a test &amp; this is a joke</font>

和上面的代码将其转换为

and the code above converted this to

This is a test &amp; this is a joke

但我希望它将其转换为:

but i wanted it to convert it to:

This is a test & this is a joke

html敏捷包是否支持我想做的事情?为什么HTML agiligy代码默认不执行此操作,或者我做错了什么?

does the html agility pack support what i am trying to do? why doesn't the HTML agiligy code do this by default or i am doing something wrong ?

推荐答案

您可以在输出中运行HttpUtility.HtmlDecode().

但是,请注意,InnerText将包含HTML标记,这些标记可能包含在最外面的标记内.如果要删除 all 标记,则必须遍历文档树并一点一点地检索所有文本.

However, note that InnerText will include HTML tags that may be contained inside the outermost tag. If you want to remove all tags, you will have to walk the document tree and retrieve all the text bit by bit.

这篇关于尝试从HTML片段中提取文本时遇到问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆