如何去除所有标签,并得到纯文本? [英] How to remove all tags and get the pure text?
问题描述
我必须在我的数据库中的用户输入的文本存储与 HTML和CSS
格式。
I had to store the user input text in my database with HTML and CSS
formats.
目前,此案:
RadEditor ,用户从复制的MSWord文本到这个编辑器然后我存储在与该格式数据库这个文本。那么当检索报告或者一些标签一些标签包装出现在文本!!数据
RadEditor ,The user copy the text from MSWord to this editor then i store this text in the database with that format . then when retrieve the data in the report or some label some tags appear wrapping the text !!
我用普通的前pression删除所有的格式,但徒劳有时,而不是成功的所有时间。
I use regular expression to remove all the formats but in vain it succeeds sometimes and not all the time .
private static Regex oClearHtmlScript = new Regex(@"<(.|\n)*?>", RegexOptions.Compiled);
public static string RemoveAllHTMLTags(string sHtml)
{
sHtml = sHtml.Replace(" ", string.Empty);
sHtml = sHtml.Replace(">", ">");
sHtml = sHtml.Replace("<", "<");
sHtml = sHtml.Replace("&", "&");
if (string.IsNullOrEmpty(sHtml))
return string.Empty;
return oClearHtmlScript.Replace(sHtml, string.Empty);
}
请问如何使用删除所有格式 HTMLAgility 或任何可靠这样才能保证文本是纯粹的?
I ask How to remove all the format using HTMLAgility or any dependable way to ensure the text is pure ?
注:
此字段的在数据库中的数据类型是<一个href=\"http://publib.boulder.ibm.com/infocenter/idshelp/v10/index.jsp?topic=/com.ibm.esqlc.doc/esqlc93.htm\"相对=nofollow> LVARCHAR
Note:
The datatype of this field in the database is Lvarchar
推荐答案
HtmlAgility包使得使用HTML容易。
HtmlAgility pack makes working with HTML easy.
HtmlDocument mainDoc = new HtmlDocument();
string htmlString = "<html><body><h1>Test</h1> more text</body></html>"
mainDoc.LoadHtml(htmlString);
string cleanText = mainDoc.DocumentNode.InnerText;
这篇关于如何去除所有标签,并得到纯文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!