使用数据库中存储的字符串编码问题 [英] Encoding issue with string stored in database

查看:147
本文介绍了使用数据库中存储的字符串编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个编码问题。我的MongoDB中有错误编码的文本。我的数据库中文本的源文件编码为ISO-8859-1。现在,在我的数据库中查看,有些字符被破坏(成为' ')。



目前,在从db中检索文本时,我尝试了以下代码。 p>

  var t = Collection.FindOne(Query.EQ(id,2014121500892)); 
string message = t [b203]。AsString;
Console.WriteLine(ChangeEncoding(message));



首次尝试:



  static string ChangeEncoding(string message)
{

System.Text.Encoding srcEnc = System.Text.Encoding.GetEncoding(ISO-8859-1);
System.Text.Encoding destEnc = System.Text.Encoding.GetEncoding(UTF-8);
byte [] bData = srcEnc.GetBytes(message);
byte [] bResult = System.Text.Encoding.Convert(srcEnc,destEnc,bData);
return destEnc.GetString(bResult);
}



第二次尝试:



  static string ChangeEncoding(string message)
{
File.WriteAllText(text.txt,message,Encoding.GetEncoding(ISO-8859-1 ));
return File.ReadAllText(text.txt);
}

db中的示例文本:


Box aus Pappef rA8-Lernk rtchen




所需结果: h3>

我希望能够在控制台中打印:


Box aus Pappe



解决方案

简版



您的数据丢失,没有一般解决方案如何恢复原始字符串。



更长版本



存储数据时应该发生什么事,编码为ISO-8859-1但以Unicode UTF8存储的字符串。以下是一个例子:

  string orig =Lernkärtchen; 
byte [] iso88891Bytes = Encoding.GetEncoding(ISO-8859-1)。GetBytes(orig);
// {76,101,114,110,107,228,114,116,99,104,101,110}
//'L','e','r',' n','k','ä','r','t','c','h','e','n'

当这些数据被传递(不知何故...)到仅使用Unicode字符串的数据库时:

  string storedValue = Encoding.UTF8.GetString(iso88891Bytes); 
byte [] dbData = Encoding.UTF8.GetBytes(storedValue);
// {76,101,114,110,107,239,191,189,114,116,99,104,101,110}
//'L','e',' r','n','k',' ','r','t','c','h','e','n'
/ pre>

问题是字节228(11100100二进制)对utf8无效,因为对于这样一个字节,其他2个字节必须跟随,其值> 127。
有关详细信息,请参阅维基百科上的UTF8,说明一章。 p>

所以发生的是以前称为字符ä的字节不能被解码为有效的Unicode字符,并被字节239,191和189代替。 11101111,10111111和10111101,其中的代码点的值为1111111111111101(0xFFFD),这是您在输出中看到的字符' '。



此字符用于正是这个目的。关于维基百科Unicode特殊字符页面,它说:


U + FFFD 用于替换未知或不可代表字符的替换字符






Btw,Unicode和UTF-8都很棒♥,别用☠!

I have an encoding problem. I have text in my MongoDB that is wrongly encoded. The source file of the texts in my db is encoded in ISO-8859-1. Now, in viewing it in my db, some characters were broken (become '�').

Currently, in retrieving text from db i tried the following codes.

var t = Collection.FindOne(Query.EQ("id", "2014121500892"));
string message = t["b203"].AsString;
Console.WriteLine(ChangeEncoding(message));

First attempt:

static string ChangeEncoding(string message)
{

    System.Text.Encoding srcEnc = System.Text.Encoding.GetEncoding("ISO-8859-1");
    System.Text.Encoding destEnc = System.Text.Encoding.GetEncoding("UTF-8");
    byte[] bData = srcEnc.GetBytes(message);
    byte[] bResult = System.Text.Encoding.Convert(srcEnc, destEnc, bData);
    return destEnc.GetString(bResult);
}

Second attempt:

static string ChangeEncoding(string message)
{
    File.WriteAllText("text.txt", message, Encoding.GetEncoding("ISO-8859-1"));
    return File.ReadAllText("text.txt");
}

Sample text in db:

Box aus Pappe f�r A8-Lernk�rtchen

Desired result:

I want to be able to print it in console as:

Box aus Pappe für A8-Lernkärtchen

解决方案

Short version

Your data is lost and there is no general solution how to recover the original strings.

Longer version

What supposedly happened when the data was stored, the strings where encoded as ISO-8859-1 but stored as Unicode UTF8. Here's an example:

string orig = "Lernkärtchen";
byte[] iso88891Bytes = Encoding.GetEncoding("ISO-8859-1").GetBytes(orig);
// { 76, 101, 114, 110, 107, 228, 114, 116, 99, 104, 101, 110 }
//  'L', 'e', 'r', 'n', 'k', 'ä', 'r', 't', 'c', 'h', 'e', 'n'

When this data was passed (somehow...) to the database which only works with Unicode strings:

string storedValue = Encoding.UTF8.GetString(iso88891Bytes);
byte[] dbData = Encoding.UTF8.GetBytes(storedValue);
// { 76, 101, 114, 110, 107, 239, 191, 189, 114, 116, 99, 104, 101, 110 }
//  'L', 'e', 'r', 'n', 'k',      '�',     'r', 't', 'c', 'h', 'e', 'n'

The problem is that the byte 228 (11100100 binary) is not valid for utf8 since for such a byte, 2 other bytes must follow which have values > 127. For details, see UTF8 on Wikipedia, chapter "Description".

So what happens is that the byte formerly known as the character 'ä' cannot be decoded into a valid unicode character and is replaced by the bytes 239, 191 and 189. Which is 11101111, 10111111 and 10111101 which results in the code point with value 1111111111111101 (0xFFFD) which is the character '�' you see in your output.

This character is used for exactly that purpose. On Wikipedia Unicode special characters page it says:

U+FFFD � replacement character used to replace an unknown or unrepresentable character

Try to revert that change? Good luck.

Btw, Unicode and UTF-8 are awesome ♥, never use anything else ☠!

这篇关于使用数据库中存储的字符串编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆