如何在使用HtmlAgilityPack.HtmlDocument.LoadHtml时设置编码 [英] How to set encoding when using HtmlAgilityPack.HtmlDocument.LoadHtml
问题描述
我已经拥有HTML页面的来源所以我正在使用
I already have the source of HTML page so i am using
string html_page_source="some page source crawled before";
HtmlDocument hdMyDoc = new HtmlDocument();
hdMyDoc.LoadHtml(html_page_source);
但是我看到没有解码的字符,例如
However i see not decoded characters such as
içerisinde
göründüğünden çok
.
.
那么如何在htmldocument设置自动解码?
我如何设置默认编码来解决这个问题?
以下方法是一个好习惯吗?
So how can i set auto decode at htmldocument ?
How can i set default encoding to solve this problem ?
And would this below method a good practice ?
hdMyDoc.LoadHtml(HttpUtility.HtmlDecode(html_page_source));
C# .net 4.5最新版,WPF应用程序
C# .net 4.5 latest , WPF application
推荐答案
Html Agility Pack配备了一个名为 HtmlEntity的实用程序类
。它有一个带有以下签名的静态方法:
The Html Agility Pack is equiped with a utility class called HtmlEntity
. It has a static method with the following signature:
/// <summary>
/// Replace known entities by characters.
/// </summary>
/// <param name="text">The source text.</param>
/// <returns>The result text.</returns>
public static string DeEntitize(string text)
它支持众所周知的实体(如& nbsp;
)和编码字符,例如& ;#039;
以及。
It supports well-known entities (like
) and encoded characters such as '
as well.
从文档中提取字符串后,使用此方法进行转换HTML编码的实体返回文本字符。
在尝试加载文档之前,不要对源进行HTML解码;你将完全改变标记的含义。
Once you've extracted the string from the document, use this method to convert the HTML-encoded entities back to text characters.
Don't HTML-decode the source before trying to load the document; you'll completely change the meaning of the markup.
这篇关于如何在使用HtmlAgilityPack.HtmlDocument.LoadHtml时设置编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!