HttpWebRequest:使用正确的编码接收响应 [英] HttpWebRequest: Receiving response with the right encoding

查看:134
本文介绍了HttpWebRequest:使用正确的编码接收响应的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在下载一个HTML页面,使用以下代码:

 尝试
Dim req As System .Net.HttpWebRequest = DirectCast(WebRequest.Create(URL),HttpWebRequest)
req.Method =GET
Dim resp As Net.HttpWebResponse = DirectCast(req.GetResponse(),Net.HttpWebResponse)
Dim stIn As IO.StreamReader = New IO.StreamReader(resp.GetResponseStream())
Dim strResponse As String = stIn.ReadToEnd

''清理
stIn.Close()
stIn.Dispose()
resp.Close()

返回strResponse

Catch ex As Exception
返回
结束尝试

这对大多数页面都有效,但对于某些(例如: www.gap.com),我得到错误编码的答案。

例如,在gap.com中,我将'作为?

更不用说发生如果我尝试加载google.cn ...



我在这里遗漏了什么,让.Net编码这个权利?



我最担心的是,我实际上必须读取指定编码的HTML中的元标记,然后重新读取(重新编码)整个流。 p>

任何指针都将不胜感激。






更新:



感谢John Saunders的回复,我有点接近。
HttpWebResponse.ContentEncoding属性似乎总是​​为空。但是,HttpWebResponse.CharacterSet似乎很有用,并且使用这段代码,我越来越近了:

  Dim resp As Net.HttpWebResponse = DirectCast(req.GetResponse(),Net.HttpWebResponse)
Dim respEncoding As Encoding = Encoding.GetEncoding(resp.CharacterSet)
Dim stIn As IO.StreamReader = New IO.StreamReader(resp.GetResponseStream() ,respEncoding)

现在,Google.cn完全符合所有汉字。

但是,Gap.Com仍然出错。



对于Gap.com,HttpWebResponse.CharacterSet是ISO-8859-1,编码我正在通过GetEncoding是{System.Text.Latin1Encoding},它在其身体名称中显示ISO-8859-1,并在HTML中指定了内容类型META标记charset = ISO-8859-1。



我还在做错什么?

还是GAP做错了?

解决方案

Gap的网站是错误的。具体问题是它们的页面声明了Latin1(ISO-8859-1)的编码,而页面使用的字符#146在ISO-8859-1中无效。



然而,该字符在Windows CP-1252编码(这是ISO 8859-1的超集)中有效。在CP-1252中,字符代码#146用于右引号字符。您可以在Gap.com主页上的今天的文字Youll找到小尺寸中看到一个撇号。



您可以阅读 http://en.wikipedia.org/wiki/Windows-1252 了解更多详情。事实证明,这种事情是网页上常见的问题,其中内容最初保存在CP-1252编码(例如从Word中复制/粘贴)。



道德在这里的故事:始终将国际化的文本作为Unicode存储在数据库中,并始终在您的Web服务器上发布HTML作为UTF8!


I'm currently downloading an HTML page, using the following code:

Try
    Dim req As System.Net.HttpWebRequest = DirectCast(WebRequest.Create(URL), HttpWebRequest)
    req.Method = "GET"
    Dim resp As Net.HttpWebResponse = DirectCast(req.GetResponse(), Net.HttpWebResponse)
    Dim stIn As IO.StreamReader = New IO.StreamReader(resp.GetResponseStream())
    Dim strResponse As String = stIn.ReadToEnd

    ''Clean up
    stIn.Close()
    stIn.Dispose()
    resp.Close()

    Return strResponse

Catch ex As Exception
    Return ""
End Try

This works well for most pages, but for some (eg: www.gap.com), I get the response incorrectly encoded.
In gap.com, for example, I get "’" as "?"
And not to mention what happens if I try to load google.cn...

What am I missing here, to get .Net to encode this right?

My worst fear is that i'll actually have to read the meta tag inside the HTML that specified the encoding, and then re-read (re-encode?) the whole stream.

Any pointers will be greatly appreciated.


UPDATE:

Thanks to John Saunders' reply, i'm a bit closer. The HttpWebResponse.ContentEncoding property seems to always come in empty. However, HttpWebResponse.CharacterSet seems useful, and with this code, i'm getting closer:

Dim resp As Net.HttpWebResponse = DirectCast(req.GetResponse(), Net.HttpWebResponse)
Dim respEncoding As Encoding = Encoding.GetEncoding(resp.CharacterSet)
Dim stIn As IO.StreamReader = New IO.StreamReader(resp.GetResponseStream(), respEncoding)

Now Google.cn comes in perfectly, with all the chinese characters.
However, Gap.Com is still coming in wrong.

For Gap.com, HttpWebResponse.CharacterSet is ISO-8859-1, the Encoding i'm getting through GetEncoding is {System.Text.Latin1Encoding}, which says "ISO-8859-1" in it's body name, AND the Content-Type META tag in the HTML specified "charset=ISO-8859-1".

Am I still doing something wrong?
Or is GAP doing something wrong?

解决方案

Gap's site is wrong. The specific problem is that their page claims an encoding of Latin1 (ISO-8859-1), while the page uses character #146 which is not valid in ISO-8859-1.

That character is, however, valid in the Windows CP-1252 encoding (which is a superset of ISO 8859-1). In CP-1252, character code #146 and is used for the right-quote character. You'll see this as an apostrophe in "Youll find Petites and small sizes" in today's text on the Gap.com home page.

You can read http://en.wikipedia.org/wiki/Windows-1252 for more details. Turns out this kind of thing is a common problem on web pages where the content was originally saved in the CP-1252 encoding (e.g. copy/pasted from Word).

Moral of the story here: always store internationalized text as Unicode in your database, and always emit HTML as UTF8 on your web server!

这篇关于HttpWebRequest:使用正确的编码接收响应的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆