WebClient html 中的汉字字符与网站中的实际汉字不同 [英] Kanji characters from WebClient html different from actual Kanji in website

查看:19
本文介绍了WebClient html 中的汉字字符与网站中的实际汉字不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以,我试图从名为 Kanji-A-Day.com,但我有一个问题.

So, I'm trying to get a portion of text from a website called Kanji-A-Day.com, but I have a problem.

你看,我试图从网站上获取每日汉字,并且我能够将 HTML 缩小到我想要的范围,但似乎字符不同..?

You see, I'm trying to get the daily kanji from the website, and I was able to narrow the HTML down to what I want, but it seems the characters are different..?

它是什么样子

应该是什么样子

更奇怪的是,我通过直接从网站复制粘贴来生成第二张图片的结果,所以这不是字体问题.

What's even more strange is that I produced the results for the second image by copying and pasting directly from the site, so it's not a font problem.

这是我用来获取角色的代码:

Here's the code I use for getting the character:

public void UpdateDailyKanji() // Called at the initialization of a new main form
{
    string kanji;
    using (WebClient client = new WebClient()) // Grab the string 
        kanji = client.DownloadString("http://www.kanji-a-day.com/level4/index.php"); 
    // Trim the HTML to just the Kanji
    kanji = kanji.Remove(0, kanji.IndexOf(@"<div class=""glyph"">") + 19);
    kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
    kanji = kanji.Trim();
    Text_DailyKanji.Text = kanji; // Set the Kanji
}

有人知道这是怎么回事吗?我猜这是一些 Unicode 的东西,但我对此知之甚少.

Does anyone know what's going on here? I'm guessing it's some Unicode thing but I don't know much about it.

提前致谢.

推荐答案

您尝试下载为字符串的页面使用 charset=EUC-JP 编码,也称为 日语 (EUC)(代码页 51932).这在页眉中明确设置.

The page you're trying to download as a string is encoded using charset=EUC-JP, also known as Japanese (EUC) (CodePage 51932). This is clearly set in the page headers.

为什么 WebClient.DownloadString 返回的字符串 使用错误的编码器编码?

Why is the string returned by WebClient.DownloadString encoded using the wrong encoder?

MSDN 文档说明了这一点:

The MSDN Docs state this:

此方法检索指定的资源.它下载后资源,该方法使用 Encoding 中指定的编码将资源转换为字符串的属性.

This method retrieves the specified resource. After it downloads the resource, the method uses the encoding specified in the Encoding property to convert the resource to a String.

因此,您必须事先知道将使用什么编码并指定它,设置WebClient.Encoding 属性.

Thus, you have to know beforehand what encoding will be used and specify it, setting the WebClient.Encoding property.

要验证这一点,请查看 .NET 参考源对于 WebClient.DownloadString 方法:

To verify this, check the .NET Reference Source for the WebClient.DownloadString method:

try {
    WebRequest request;
    byte [] data = DownloadDataInternal(address, out request);
    string stringData = GetStringUsingEncoding(request, data);
    if(Logging.On)Logging.Exit(Logging.Web, this, "DownloadString", stringData);
    return stringData;
    } finally {
        CompleteWebClientState();
    }

使用请求设置来设置编码,而不是响应设置.
结果是,下载的字符串使用默认的 CodePage 进行编码.

The encoding is set using the Request settings, not the Response ones.
The result is, the downloaded string is encoded using the default CodePage.

你现在可以做的是:

  • 下载页面两次,第一次检查WebClient编码和Html页面编码是否不匹配.
  • 使用在底层 WebResponse 中设置的正确编码重新编码字符串.
  • 不要使用 WebClient,直接使用 HttpClient 或 WebRequest.或者,如果您喜欢这个工具,可以创建一个自定义的 WebClient 类,以更直接的方式处理 WebRequest/WebResponse.

这是一种执行重新编码任务的方法:
WebClient 返回的字符串被转换为字节数组并传递给 MemoryStream,然后使用 StreamReader 重新编码,编码从 Content-Type 检索: charset 响应头.

This is a method to perform the re-encoding task:
The string returned by WebClient is converted to a Byte Array and passed to a MemoryStream, then re-encoded using a StreamReader with the Encoding retrieved from the Content-Type: charset Response Header.


现在使用 Reflection 从底层 HttpWebResponse 获取页面 Encoding.这应该避免在解析远程响应定义的原始 CharacterSet 时出错.


Now using Reflection to get the page Encoding from the underlying HttpWebResponse. This should avoid errors in parsing the original CharacterSet as defined by the remote response.

using System.IO;
using System.Net;
using System.Reflection;
using System.Text;

public string WebClient_DownLoadString(Uri uri)
{
    using (var client = new WebClient())
    {
        // If Windows 7 - Windows Server 2008 R2
        ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;

        client.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.BypassCache);
        client.Headers.Add(HttpRequestHeader.Accept, "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        client.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
        client.Headers.Add(HttpRequestHeader.KeepAlive, "keep-alive");

        string result = client.DownloadString(uri);

        var flags = BindingFlags.Instance | BindingFlags.NonPublic;
        using (var response = (HttpWebResponse)client.GetType().GetField("m_WebResponse", flags).GetValue(client))
        {
            var pageEncoding = Encoding.GetEncoding(wc_response.CharacterSet);
            byte[] bytes = client.Encoding.GetBytes(result);
            using (var ms = new MemoryStream(bytes, 0, bytes.Length))
            using (var reader = new StreamReader(ms, pageEncoding))
            {
                ms.Position = 0;
                return reader.ReadToEnd();
            };
        };
    }
}

现在您的代码应该以正确的形式获取日语字符.

Now your code should get the Japanese characters in their correct form.

Uri uri = new Uri("http://www.kanji-a-day.com/level4/index.php", UriKind.Absolute);
string kanji = WebClient_DownLoadString(uri);

kanji = kanji.Remove(0, kanji.IndexOf("<div class="glyph">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();

Text_DailyKanji.Text = kanji;

这篇关于WebClient html 中的汉字字符与网站中的实际汉字不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆