WebClient html中的日文汉字字符与网站中的实际日文汉字不同 [英] Kanji characters from WebClient html different from actual Kanji in website

查看:130
本文介绍了WebClient html中的日文汉字字符与网站中的实际日文汉字不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

因此,我正在尝试从名为 Kanji-A-Day.com ,但我遇到了问题.

So, I'm trying to get a portion of text from a website called Kanji-A-Day.com, but I have a problem.

您知道,我正在尝试从网站上获取日常日文汉字,并且能够将HTML缩小到我想要的范围,但是看起来字符有所不同..?

You see, I'm trying to get the daily kanji from the website, and I was able to narrow the HTML down to what I want, but it seems the characters are different..?

外观是什么

应该应该是什么

What it should look like

更奇怪的是,我是通过直接从站点进行复制和粘贴来生成第二张图像的结果的,所以这不是字体问题.

What's even more strange is that I produced the results for the second image by copying and pasting directly from the site, so it's not a font problem.

这是我用来获取角色的代码:

Here's the code I use for getting the character:

public void UpdateDailyKanji() // Called at the initialization of a new main form
{
    string kanji;
    using (WebClient client = new WebClient()) // Grab the string 
        kanji = client.DownloadString("http://www.kanji-a-day.com/level4/index.php"); 
    // Trim the HTML to just the Kanji
    kanji = kanji.Remove(0, kanji.IndexOf(@"<div class=""glyph"">") + 19);
    kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
    kanji = kanji.Trim();
    Text_DailyKanji.Text = kanji; // Set the Kanji
}

有人知道这是怎么回事吗?我猜这是一些Unicode的东西,但是我对此并不了解.

Does anyone know what's going on here? I'm guessing it's some Unicode thing but I don't know much about it.

谢谢.

推荐答案

要尝试以字符串形式下载的页面是使用charset=EUC-JP(也称为Japanese (EUC))进行编码的(CodePage 51932).这显然是在页面标题中设置的.

The page you're trying to download as a string is encoded using charset=EUC-JP, also known as Japanese (EUC) (CodePage 51932). This is clearly set in the page headers.

为什么 WebClient.DownloadString 使用错误的编码器进行了编码?

Why is the string returned by WebClient.DownloadString encoded using the wrong encoder?

MSDN文档说明:

此方法检索指定的资源.下载后 资源,该方法使用Encoding中指定的编码 属性,将资源转换为String.

This method retrieves the specified resource. After it downloads the resource, the method uses the encoding specified in the Encoding property to convert the resource to a String.

因此,您必须事先知道将使用哪种编码并指定它,并设置

Thus, you have to know beforehand what encoding will be used and specify it, setting the WebClient.Encoding property.

要对此进行验证,请检查 .NET参考源用于WebClient.DownloadString 方法:

To verify this, check the .NET Reference Source for the WebClient.DownloadString method:

try {
    WebRequest request;
    byte [] data = DownloadDataInternal(address, out request);
    string stringData = GetStringUsingEncoding(request, data);
    if(Logging.On)Logging.Exit(Logging.Web, this, "DownloadString", stringData);
    return stringData;
    } finally {
        CompleteWebClientState();
    }

使用请求设置而不是响应设置编码.
结果是,使用默认的CodePage对下载的字符串进行编码.

The encoding is set using the Request settings, not the Response ones.
The result is, the downloaded string is encoded using the default CodePage.

您现在可以做的是:
-下载页面两次,第一次检查WebClient编码和HTML页面编码是否不匹配.
-使用正确的编码重新编码字符串.

What you can do now is:
- Download the page twice, the first time to check whether the WebClient encoding and the Html page encoding don't match.
- Re-encode the string with the correct encoding.

这是执行后一项任务的方法:
WebClient返回的字符串将转换为字节数组,并传递给MemoryStream,然后使用StreamReader通过从Content-Type: charset响应头中检索到的编码进行重新编码.

This is a method to perform the latter task:
The string returned by WebClient is converted to a Byte Array and passed to a MemoryStream, then re-encoded using a StreamReader with the Encoding retrieved from the Content-Type: charset Response Header.


现在使用Reflection从基础HttpWebResponse获取页面Encoding.这应避免在解析由远程响应定义的原始CharacterSet时出错.


Now using Reflection to get the page Encoding from the underlying HttpWebResponse. This should avoid errors in parsing the original CharacterSet as defined by the remote response.

using System.IO;
using System.Net;
using System.Reflection;
using System.Text;

public string WebClient_DownLoadString(Uri URI)
{
    using (var client = new WebClient())
    {
        ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls12;

        client.CachePolicy = new System.Net.Cache.RequestCachePolicy(System.Net.Cache.RequestCacheLevel.BypassCache);
        client.Headers.Add(HttpRequestHeader.Accept, "ext/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        client.Headers.Add(HttpRequestHeader.AcceptLanguage, "en-US,en;q=0.8");
        client.Headers.Add(HttpRequestHeader.KeepAlive, "keep-alive");

        string result = client.DownloadString(URI);

        var flags = BindingFlags.Instance | BindingFlags.NonPublic;
        using (var response = (HttpWebResponse)client.GetType().GetField("m_WebResponse", flags).GetValue(client))
        {
            var pageEncoding = Encoding.GetEncoding(wc_response.CharacterSet);
            byte[] bytes = client.Encoding.GetBytes(result);
            using (var ms = new MemoryStream(bytes, 0, bytes.Length))
            using (var reader = new StreamReader(ms, pageEncoding))
            {
                ms.Position = 0;
                return reader.ReadToEnd();
            };
        };
    }
}

现在,您的代码应以正确的形式获取日语字符.

Now your code should get the Japanese characters in their correct form.

Uri uri = new Uri("http://www.kanji-a-day.com/level4/index.php", UriKind.Absolute);
string kanji = WebClient_DownLoadString(uri);

kanji = kanji.Remove(0, kanji.IndexOf("<div class=\"glyph\">") + 19);
kanji = kanji.Remove(kanji.IndexOf("</div>")-2);
kanji = kanji.Trim();

Text_DailyKanji.Text = kanji;

这篇关于WebClient html中的日文汉字字符与网站中的实际日文汉字不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆