尝试从HTML片段中提取文本时遇到问题 [英] Running into an issue trying to extract the text from a snippet of HTML
问题描述
我正在使用HTML Agility包进行转换
i am using the HTML Agility pack to convert
<font size="1">This is a test</font>
到
This is a test
使用此代码:
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string stripped = doc.DocumentNode.InnerText;
但是我遇到一个我有这个问题的问题:
but i ran into an issue where i have this:
<font size="1">This is a test & this is a joke</font>
和上面的代码将其转换为
and the code above converted this to
This is a test & this is a joke
但我希望它将其转换为:
but i wanted it to convert it to:
This is a test & this is a joke
html敏捷包是否支持我想做的事情?为什么HTML agiligy代码默认不执行此操作,或者我做错了什么?
does the html agility pack support what i am trying to do? why doesn't the HTML agiligy code do this by default or i am doing something wrong ?
推荐答案
您可以在输出中运行HttpUtility.HtmlDecode()
.
但是,请注意,InnerText
将包含HTML标记,这些标记可能包含在最外面的标记内.如果要删除 all 标记,则必须遍历文档树并一点一点地检索所有文本.
However, note that InnerText
will include HTML tags that may be contained inside the outermost tag. If you want to remove all tags, you will have to walk the document tree and retrieve all the text bit by bit.
这篇关于尝试从HTML片段中提取文本时遇到问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!