WebClient.DownloadString() 返回带有特殊字符的字符串 [英] WebClient.DownloadString() returns string with peculiar characters

查看:21
本文介绍了WebClient.DownloadString() 返回带有特殊字符的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从网上下载的一些内容存在问题,用于我正在构建的屏幕抓取工具.

I have an issue with some content that we are downloading from the web for a screen scraping tool that I am building.

在下面的代码中,从 Web 客户端下载字符串方法返回的字符串为一些(不是全部)网站的源下载返回一些奇怪的字符.

in the code below, the string returned from the web client download string method returns some odd characters for the source download for a few (not all) web sites.

我最近添加了 http 标头,如下所示.以前在没有标题的情况下调用相同的代码达到相同的效果.我还没有尝试过Accept-Charset"标头的变体,除了基础知识之外,我对文本编码知之甚少.

I have recently added http headers as below. Previously the same code was called without the headers to the same effect. I have not tried variations on the 'Accept-Charset' header, I don't know much about text encoding other than the basics.

我所指的字符或字符序列是:

The characters, or character sequences that I refer to are:

""

""

在网络浏览器中使用查看源代码"时不会看到这些字符.什么可能导致这种情况,我该如何解决这个问题?

These characters are not seen when you use "view source" in a web browser. What could be causing this and how can I rectify the problem?

string urlData = String.Empty;
WebClient wc = new WebClient();

// Add headers to impersonate a web browser. Some web sites 
// will not respond correctly without these headers
wc.Headers.Add("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12");
wc.Headers.Add("Accept", "*/*");
wc.Headers.Add("Accept-Language", "en-gb,en;q=0.5");
wc.Headers.Add("Accept-Charset", "ISO-8859-1,utf-8;q=0.7,*;q=0.7");

urlData = wc.DownloadString(uri);

推荐答案

 是八位字节 EF BB BF 的 windows-1252 表示.那是 UTF-8 字节顺序标记,这意味着您的远程网页已编码在 UTF-8 中,但您正在阅读它,就好像它是 windows-1252.根据文档WebClient.DownloadString使用 Webclient.Encoding 作为它在将远程资源转换为字符串时的编码.将其设置为 System.Text.Encoding.UTF8 理论上应该可以正常工作.

 is the windows-1252 representation of the octets EF BB BF. That's the UTF-8 byte-order marker, which implies that your remote web page is encoded in UTF-8 but you're reading it as if it were windows-1252. According to the docs, WebClient.DownloadString uses Webclient.Encoding as its encoding when it converts the remote resource into a string. Set it to System.Text.Encoding.UTF8 and things should theoretically work.

这篇关于WebClient.DownloadString() 返回带有特殊字符的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆