使用utf8字符时出现问题;我看到的不是我存储的 [英] Trouble with utf8 characters; what I see is not what I stored
问题描述
我试图使用utf8并遇到麻烦。
我试过这么多东西;以下是我获得的结果:
-
????
字符。即使对于欧洲文本,我得到Se?或
Señor
。 - 奇怪的乱画(Mojibake?)如
Señ或
或æ-°æμªæ-°é - »
c $ c>新浪新闻。 - 黑钻石,如Se or。
- 数据丢失或至少被截断的情况:
Se
Señor
。 - 即使我有文字正确,也没有正确排序。
我做错了什么?如何修复 代码?
这个问题困扰这个问题的参与者论坛,和许多其他。
您列出了 CHARACTER SET
的5个主要情况。
/ strong>
未来,最好使用 CHARACTER SET utf8mb4
和 COLLATION utf8mb4_unicode_520_ci
。
(管道中有一个较新版本的unicode排序规则。)
utf8mb4
是 utf8 <
在MySQL之外,UTF-8是指一个字符串,到所有大小的编码,因此有效地与MySQL的utf8mb4相同,而不是utf8。
我将尝试使用这些拼写和大小写在以下区分内部和外部MySQL。
/ em> do
- 拥有编辑器等。设置为UTF-8。
- HTML表单应以
< form accept-charset =UTF-8>
开头。 - 将您的字节编码为UTF-8。
- 将客户端中使用的编码建立为UTF-8。
- 让列/表声明
CHARACTER SET utf8mb4
(使用SHOW CREATE TABLE
HTML> -
< meta charset = UTF-8>
$ b如何支持UTF-8完全在网络应用程序中(以前称为utf8一路通过)
计算机语言的更多细节(及其以下部分)
测试数据
使用工具或
SELECT
查看数据不可信任。
这样的客户端,尤其是浏览器,太多尝试补偿不正确的编码,并显示正确的文本,即使数据库被破坏。
所以,选择一个具有一些非英语文本的表和列SELECT col,HEX )FROM tbl WHERE ...
正确存储的UTF-8的HEX将是
- 对于空格(任何语言):
20
- 对于英语:
4x
,5x
,6x
,或7x
- 对于西欧大多数地区,重音字母应为
Cxyy
- Cyrillic,Hebrew和Farsi / Arabic:
Dxyy
-
Exyyzz
- 表情符号和部分中文:
F0yyzzww
- 更多详情
遇到的问题的特定原因和修正
截断 code> Se
Señor
):
- 要存储的字节不编码为utf8mb4。
此外,请检查阅读期间的连接是否为UTF- 黑钻石带有问号(
Se or
为Señor
);
这些情况之一存在:
情况1(原始字节不是 UTF-8):
- 要存储的字节不编码为utf8。修正此问题。 c $>
-
INSERT
的连接(或SET NAMES
和SELECT
不是utf8 / utf8mb4。 - 此外,请检查数据库中的列是否
CHARACTER SET utf8
(或utf8mb4)。
案例2(原始字节是 UTF-8):
-
SELECT
的连接(或SET NAMES
)不是utf8 / utf8mb4。 - 此外,请检查数据库中的列是否
CHARACTER SET utf8
(或utf8mb4)。
黑色菱形只有在浏览器设置为
< meta charset = UTF-8>
问号(常规,而不是黑色钻石)(
Se?或
Señor
):
- 要存储的字节不编码为utf8 / utf8mb4。修正此问题。
- 数据库中的列为
CHARACTER SET utf8
(或utf8mb4)。
此外,请检查阅读期间的连接是否为UTF- Mojibake (
Señ或
Señor
):
适用于双重编码,不一定可见。)
- 要存储的字节必须是UTF -8编码。修复此问题。
-
INSERTing
和SELECTing
文本需要指定utf8或utf8mb4。修正此问题。 - 该列需要声明
CHARACTER SET utf8
(或utf8mb4)。修正此问题。 - HTML应以
< meta charset = UTF-8>
开头。
如果数据看起来正确,但不会正确排序,那么
或者您选择了错误的排序规则,
或没有排序规则符合您的需要,
或您有双重编码。
双重编码
SELECT .. HEX ..
如上所述。回到C3A9,而是显示C383C2A9
表情符号I tried to use utf8 and ran into trouble.
I have tried so many things; here are the results I have gotten:
????
instead of Asian characters. Even for European text, I gotSe?or
forSeñor
.- Strange gibberish (Mojibake?) such as
Señor
or新浪新闻
for新浪新闻
. - Black diamonds, such as Se�or.
- Finally, I got into a situation where the data was lost, or at least truncated:
Se
forSeñor
. - Even when I got text to look right, it did not sort correctly.
What am I doing wrong? How can I fix the code? Can I recover the data, if so, how?
解决方案This problem plagues the participants of this forum, and many others. You have listed the 5 main cases of
CHARACTER SET
troubles.Best Practice
Going forward, it is best to use
CHARACTER SET utf8mb4
andCOLLATION utf8mb4_unicode_520_ci
. (There is a newer version of the unicode collation in the pipeline.)utf8mb4
is a superset ofutf8
in that it handles 4-byte utf8 codes, which are needed by Emoji and some of Chinese.Outside of MySQL, "UTF-8" refers to all size encodings, hence effectively the same as MySQL's utf8mb4, not utf8. I will try to use those spellings and capitalizations to distinguish inside versus outside MySQL in the following.
Overview of what you should do
- Have your editor/etc. set to UTF-8.
- HTML forms should start like
<form accept-charset="UTF-8">
. - Have your bytes encoded as UTF-8.
- Establish UTF-8 as the encoding being used in the client.
- Have the column/table declared
CHARACTER SET utf8mb4
(Check withSHOW CREATE TABLE
.) <meta charset=UTF-8>
at the beginning of HTML
How to support UTF-8 completely in a web application (Formerly called "utf8 all the way through")
More details for computer languages (and its following sections)
Test the data
Viewing the data with a tool or with
SELECT
cannot be trusted. Too many such clients, especially browsers, try to compensate for incorrect encodings, and show you correct text even if the database is mangled. So, pick a table and column that has some non-English text and doSELECT col, HEX(col) FROM tbl WHERE ...
The HEX for correctly stored UTF-8 will be
- For a blank space (in any language):
20
- For English:
4x
,5x
,6x
, or7x
- For most of Western Europe, accented letters should be
Cxyy
- Cyrillic, Hebrew, and Farsi/Arabic:
Dxyy
- Most of Asia:
Exyyzz
- Emoji and some of Chinese:
F0yyzzww
- More details
Specific causes and fixes of the problems seen
Truncated text (
Se
forSeñor
):- The bytes to be stored are not encoded as utf8mb4. Fix this.
- Also, check that the connection during reading is UTF-8.
Black Diamonds with question marks (
Se�or
forSeñor
); one of these cases exists:Case 1 (original bytes were not UTF-8):
- The bytes to be stored are not encoded as utf8. Fix this.
- The connection (or
SET NAMES
) for theINSERT
and theSELECT
was not utf8/utf8mb4. Fix this. - Also, check that the column in the database is
CHARACTER SET utf8
(or utf8mb4).
Case 2 (original bytes were UTF-8):
- The connection (or
SET NAMES
) for theSELECT
was not utf8/utf8mb4. Fix this. - Also, check that the column in the database is
CHARACTER SET utf8
(or utf8mb4).
Black diamonds occur only when the browser is set to
<meta charset=UTF-8>
.Question Marks (regular ones, not black diamonds) (
Se?or
forSeñor
):- The bytes to be stored are not encoded as utf8/utf8mb4. Fix this.
- The column in the database is
CHARACTER SET utf8
(or utf8mb4). Fix this. - Also, check that the connection during reading is UTF-8.
Mojibake (
Señor
forSeñor
): (This discussion also applies to Double Encoding, which is not necessarily visible.)- The bytes to be stored need to be UTF-8-encoded. Fix this.
- The connection when
INSERTing
andSELECTing
text needs to specify utf8 or utf8mb4. Fix this. - The column needs to be declared
CHARACTER SET utf8
(or utf8mb4). Fix this. - HTML should start with
<meta charset=UTF-8>
.
If the data looks correct, but won't sort correctly, then either you have picked the wrong collation, or there is no collation that suits your need, or you have Double Encoding.
Double Encoding can be confirmed by doing the
SELECT .. HEX ..
described above.é should come back C3A9, but instead shows C383C2A9 The Emoji
这篇关于使用utf8字符时出现问题;我看到的不是我存储的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
- 对于空格(任何语言):