在字符串中的字符从互联网上下载HTML后更改 [英] Characters in string changed after downloading HTML from the internet
本文介绍了在字符串中的字符从互联网上下载HTML后更改的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
使用下面的code,我可以从网上下载一个文件的HTML:
Using the following code, I can download the HTML of a file from the internet:
WebClient wc = new WebClient();
// ....
string downloadedFile = wc.DownloadString("http://www.myurl.com/");
但是,有时文件包含有趣的字符,如电子
到 A©
,←
到 A†
和フシギダネ
到フ•一个,·ã,®ãƒ€ラ
。
我觉得可能是与不同的UNI code型或东西,因为每个字符被转换成2个新的,也许是每个字符被劈成两半,但我有这方面的知之甚少。你认为什么是错的?
I think it may be something to do with different unicode types or something, as each character gets changed into 2 new ones, perhaps each character being split in half but I have very little knowledge in this area. What do you think is wrong?
推荐答案
下面是它支持gzip和检查编码标题和meta标签,以便正确地去code这一个包裹的下载类。
Here's a wrapped download class which supports gzip and checks encoding header and meta tags in order to decode it correctly.
实例化类,并调用 GetPage中()
。
public class HttpDownloader
{
private readonly string _referer;
private readonly string _userAgent;
public Encoding Encoding { get; set; }
public WebHeaderCollection Headers { get; set; }
public Uri Url { get; set; }
public HttpDownloader(string url, string referer, string userAgent)
{
Encoding = Encoding.GetEncoding("ISO-8859-1");
Url = new Uri(url); // verify the uri
_userAgent = userAgent;
_referer = referer;
}
public string GetPage()
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url);
if (!string.IsNullOrEmpty(_referer))
request.Referer = _referer;
if (!string.IsNullOrEmpty(_userAgent))
request.UserAgent = _userAgent;
request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
using (HttpWebResponse response = (HttpWebResponse)request.GetResponse())
{
Headers = response.Headers;
Url = response.ResponseUri;
return ProcessContent(response);
}
}
private string ProcessContent(HttpWebResponse response)
{
SetEncodingFromHeader(response);
Stream s = response.GetResponseStream();
if (response.ContentEncoding.ToLower().Contains("gzip"))
s = new GZipStream(s, CompressionMode.Decompress);
else if (response.ContentEncoding.ToLower().Contains("deflate"))
s = new DeflateStream(s, CompressionMode.Decompress);
MemoryStream memStream = new MemoryStream();
int bytesRead;
byte[] buffer = new byte[0x1000];
for (bytesRead = s.Read(buffer, 0, buffer.Length); bytesRead > 0; bytesRead = s.Read(buffer, 0, buffer.Length))
{
memStream.Write(buffer, 0, bytesRead);
}
s.Close();
string html;
memStream.Position = 0;
using (StreamReader r = new StreamReader(memStream, Encoding))
{
html = r.ReadToEnd().Trim();
html = CheckMetaCharSetAndReEncode(memStream, html);
}
return html;
}
private void SetEncodingFromHeader(HttpWebResponse response)
{
string charset = null;
if (string.IsNullOrEmpty(response.CharacterSet))
{
Match m = Regex.Match(response.ContentType, @";\s*charset\s*=\s*(?<charset>.*)", RegexOptions.IgnoreCase);
if (m.Success)
{
charset = m.Groups["charset"].Value.Trim(new[] { '\'', '"' });
}
}
else
{
charset = response.CharacterSet;
}
if (!string.IsNullOrEmpty(charset))
{
try
{
Encoding = Encoding.GetEncoding(charset);
}
catch (ArgumentException)
{
}
}
}
private string CheckMetaCharSetAndReEncode(Stream memStream, string html)
{
Match m = new Regex(@"<meta\s+.*?charset\s*=\s*(?<charset>[A-Za-z0-9_-]+)", RegexOptions.Singleline | RegexOptions.IgnoreCase).Match(html);
if (m.Success)
{
string charset = m.Groups["charset"].Value.ToLower() ?? "iso-8859-1";
if ((charset == "unicode") || (charset == "utf-16"))
{
charset = "utf-8";
}
try
{
Encoding metaEncoding = Encoding.GetEncoding(charset);
if (Encoding != metaEncoding)
{
memStream.Position = 0L;
StreamReader recodeReader = new StreamReader(memStream, metaEncoding);
html = recodeReader.ReadToEnd().Trim();
recodeReader.Close();
}
}
catch (ArgumentException)
{
}
}
return html;
}
}
这篇关于在字符串中的字符从互联网上下载HTML后更改的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文