Http Response(希伯来语)字符的一个特定站点未进行属性编码 [英] One specific site which Http Response (hebrew) characters do not come property encoded

查看:123
本文介绍了Http Response(希伯来语)字符的一个特定站点未进行属性编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

以下一段时间以来一直让我很开心。

The following has been amusing me for a while now.

首先,我一直在抓网站几个月。其中希伯来网站也没有任何问题,从接收希伯来字符时没有任何问题http 服务器。

First of all, I have been scraping sites for a couple of months. Among them hebrew sites as well, and had no problem whatsoever in receiving hebrew characters from the http server.

出于某种原因,我非常好奇要理清,以下网站是个例外。我无法正确编码字符。我尝试通过 Fiddler 来模拟我的工作请求,但无济于事。我的 c#请求标题看起来完全相同,但仍然无法读取字符。

For some reason I am very curious to sort out, the following site is an exception. I can't get the characters properly encoded. I tried emulating the working requests I do via Fiddler, but to no avail. My c# request headers look exactly the same, but still the characters will not be readable.

我不知道理解为什么我一直能够从其他网站检索希伯来字符,而从这一个特别是我不是。导致这种情况的是什么设置。

What I do not understand is why I have always been able to retrieve hebrew characters from other sites, while from this one specifically I am not. What is this setting that is causing this.

尝试以下示例。

    HttpClient httpClient = new HttpClient();
    httpClient.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0");
    //httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html;q=0.9");
    //httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Language", "en-US,en;q=0.5");
    //httpClient.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Encoding", "gzip, deflate");

    var getTask = httpClient.GetStringAsync("http://winedepot.co.il/Default.asp?Page=Sale");

    //doing it like this for the sake of the example
    var contents = getTask.Result;

    //add a breakpoint at the following line to check the contents of "contents"
    Console.WriteLine();

如上所述,此类代码适用于我尝试的任何其他以色列网站 - 例如, Ynet新闻网站

As mentioned, such code works for any other israeli site I try - say, Ynet news site, for instance.

更新:我在使用 Fiddler 进行调试时发现了响应对象,对于ynet站点(有效的) ,返回标题

Update: I figured out while "debugging" with Fiddler that the response object, for the ynet site (one which works), returns the header

Content-Type: text/html; charset=UTF-8

虽然winedepot.co.il的回复中没有此标题

while this header is absent in the response from winedepot.co.il

我尝试添加它,但仍然没有区别。

I tried adding it, but still made no difference.

 var getTask = httpClient.GetAsync("http://www.winedepot.co.il");

    var response = getTask.Result;

    var contentObj = response.Content;
    contentObj.Headers.Remove("Content-Type");
    contentObj.Headers.Add("Content-Type", "text/html; charset=UTF-8");

    var readTask = response.Content.ReadAsStringAsync();
    var contents = readTask.Result;
    Console.WriteLine();


推荐答案

您遇到的问题是网络服务器是说谎的内容类型,或者更确切地说,不够具体。

The problem you're encountering is that the webserver is lying about its content-type, or rather, not being specific enough.

第一个网站用这个标题回复:

The first site responds with this header:

Content-Type: text/html; charset=UTF-8

带有此标题的第二个:

Content-Type: text/html

这意味着在第二种情况下,您的客户将不得不对文本的实际编码进行假设。要了解有关文本编码的更多信息,请阅读绝对最低每个软件开发人员绝对必须知道Unicode和字符集(没有借口!)

This means that in the second case, your client will have to make assumptions about what encoding the text is actually in. To learn more about text encodings, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

.NET的内置HTTP客户端并没有真正做到这一点,这是可以理解的,因为它是一个难题。阅读链接文章,了解Web浏览器为了猜测编码而必须经历的麻烦,然后尝试理解为什么您不希望在可编程Web客户端中使用此逻辑。

And the built-in HTTP clients for .NET don't really do a great job at this, which is understandable, because it is a Hard Problem. Read the linked article for the trouble a web browser will have to go through in order to guess the encoding, and then try to understand why you don't want this logic in a programmable web client.

现在网站为您提供< meta http-equiv =Content-Typecontent =这里实际编码/> 标记,这是一个讨厌的解决方法,无需正确配置Web服务器。当浏览器遇到这样的标记时,它将不得不重新开始使用指定的内容类型解析文档,然后希望它是正确的。

Now the sites do provide you with a <meta http-equiv="Content-Type" content="actual encoding here" /> tag, which is a nasty workaround for not having to properly configure a web server. When a browser encounters such a tag, it will have to restart parsing the document with the specified content-type, and then hope it is correct.

步骤大致是,假设HTML有效负载:

The steps roughly are, assuming an HTML payload:


  1. 执行Web请求,将响应文档保存在二进制缓冲区中。

  2. 检查内容类型标头(如果存在),如果它不存在或不提供字符集,请对编码做一些假设。

  3. 通过解码来读取响应缓冲区,并解析生成的HTML。

  4. 遇到< meta http-equiv =Content-Type/> 标头时,丢弃所有已解码的文字,通过将二进制缓冲区解释为以指定编码编码的文本再次开始。

  1. Perform web request, keep the response document in a binary buffer.
  2. Inspect the content-type header, if present, and if it isn't present or doesn't provide a charset, do some assumption about the encoding.
  3. Read the response by decoding the buffer, and parsing the resulting HTML.
  4. When encountering a <meta http-equiv="Content-Type" /> header, discard all decoded text, and start again by interpreting the binary buffer as text encoded in the specified encoding.

C#HTTP客户端在步骤2停止,这是正确的。它们是HTTP客户端,而不是HTML显示的浏览器。他们并不关心您的有效负载是HTML,JSON,XML还是任何其他文本格式。

The C# HTTP clients stop at step 2, and rightfully so. They are HTTP clients, not HTML-displaying browsers. They don't care that your payload is HTML, JSON, XML, or any other textual format.

当内容类型响应头中没有给出字符集时, .NET HTTP客户端默认为 ISO-8859-1 编码,该编码无法显示字符集 Windows-1255(希伯来语)中的字符页面实际上是编码的(或者更确切地说,它在相同的代码点有不同的字符)。

When no charset is given in the content-type response header, the .NET HTTP clients default to the ISO-8859-1 encoding, which cannot display the characters from the character set Windows-1255 (Hebrew) that the page actually is encoded in (or rather, it has different characters at the same code points).

尝试的一些C#实现使用HttpWebResponse编码问题中提供了从元HTML元素进行编码检测的 。我不能保证他们的正确性,所以你必须自己承担风险。我知道当前最高投票的答案实际上在遇到元标记时重新发出请求,这非常愚蠢,因为无法保证第二个响应与第一个响应相同,这只是浪费带宽。

Some C# implementations that try to do encoding detection from the meta HTML element are provided in Encoding trouble with HttpWebResponse. I cannot vouch for their correctness, so you'll have to try it at your own risk. I do know that the currently highest-voted answer actually re-issues the request when it encounters the meta tag, which is quite silly, because there is no guarantee that the second response will be the same as the first, and it's just a waste of bandwidth.

你也可以做一些关于你知道某个网站或网页使用的编码的假设,然后强制编码:

You can also do some assumption about that you know the encoding being used for a certain site or page, and then force the encoding to that:

using (Stream resStream = response.GetResponseStream())
{
    StreamReader reader = new StreamReader(resStream, YourFixedEncoding);
    string content = reader.ReadToEnd();
}

或者,对于HttpClient:

Or, for HttpClient:

using (var client = new HttpClient())
{
    var response = await client.GetAsync(url);
    var responseStream = await client.ReadAsStreamAsync();
    using (var fixedEncodingReader = new StreamReader(responseStream, Encoding.GetEncoding(1255)))
    {
        string responseString = fixedEncodingReader.ReadToEnd();
    }
}

但假设特定响应或URL的编码或网站,完全不安全。绝不保证这种假设每次都是正确的。

But assuming an encoding for a particular response, or URL, or site, is entirely unsafe altogether. It is in no way guaranteed that this assumption will be correct every time.

这篇关于Http Response(希伯来语)字符的一个特定站点未进行属性编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆