如何在使用HtmlAgilityPack.HtmlDocument.LoadHtml时设置编码 [英] How to set encoding when using HtmlAgilityPack.HtmlDocument.LoadHtml

查看:49
本文介绍了如何在使用HtmlAgilityPack.HtmlDocument.LoadHtml时设置编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经拥有HTML页面的来源所以我正在使用



I already have the source of HTML page so i am using

string html_page_source="some page source crawled before";
HtmlDocument hdMyDoc = new HtmlDocument();
hdMyDoc.LoadHtml(html_page_source);





但是我看到没有解码的字符,例如





However i see not decoded characters such as

  
içerisinde 
göründüğünden çok
.
.





那么如何在htmldocument设置自动解码?



我如何设置默认编码来解决这个问题?



以下方法是一个好习惯吗?





So how can i set auto decode at htmldocument ?

How can i set default encoding to solve this problem ?

And would this below method a good practice ?

hdMyDoc.LoadHtml(HttpUtility.HtmlDecode(html_page_source));





C# .net 4.5最新版,WPF应用程序



C# .net 4.5 latest , WPF application

推荐答案



Html Agility Pack配备了一个名为 HtmlEntity的实用程序类。它有一个带有以下签名的静态方法:


The Html Agility Pack is equiped with a utility class called HtmlEntity. It has a static method with the following signature:

/// <summary>
/// Replace known entities by characters.
/// </summary>
/// <param name="text">The source text.</param>
/// <returns>The result text.</returns>
public static string DeEntitize(string text)



它支持众所周知的实体(如& nbsp; )和编码字符,例如& ;#039; 以及。


It supports well-known entities (like &nbsp;) and encoded characters such as &#039; as well.



从文档中提取字符串后,使用此方法进行转换HTML编码的实体返回文本字符。



在尝试加载文档之前,不要对源进行HTML解码;你将完全改变标记的含义。


Once you've extracted the string from the document, use this method to convert the HTML-encoded entities back to text characters.

Don't HTML-decode the source before trying to load the document; you'll completely change the meaning of the markup.


这篇关于如何在使用HtmlAgilityPack.HtmlDocument.LoadHtml时设置编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆