MySQL匹配具有ASCII版本的Unicode字符 [英] MySQL matching unicode characters with ascii version
问题描述
我正在运行MySQL 5.1.50,并具有一个如下所示的表:
I'm running MySQL 5.1.50 and have a table that looks like this:
organizations | CREATE TABLE `organizations` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` text CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL,
`url` text CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`phone` varchar(20) CHARACTER SET utf8 COLLATE utf8_unicode_ci DEFAULT NULL,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `id` (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=25837 DEFAULT CHARSET=utf8 |
我遇到的问题是MySQL正在将Unicode字符与ascii版本匹配.例如,当我搜索包含'é'的单词时,它将匹配具有'e'的单词,反之亦然:
The problem I'm having is that MySQL is matching unicode characters with ascii versions. For example when I search for a word with that contains an 'é', it will match the same word that has an 'e' instead, and vice versa:
mysql> SET NAMES utf8;
Query OK, 0 rows affected (0.00 sec)
mysql> SELECT id, name FROM `organizations` WHERE `name` = 'Universite de Montreal';
+-------+-------------------------+
| id | name |
+-------+-------------------------+
| 16973 | Université de Montreal |
+-------+-------------------------+
1 row in set (0.01 sec)
我从PHP和命令行控制台都得到了这些结果.如何从SELECT查询中获得准确的匹配?
I get these results both from PHP and the command line console. How can I get accurate matches from my SELECT queries?
谢谢!
推荐答案
您将name
列指定为text CHARACTER SET utf8 COLLATE utf8_unicode_ci
,这告诉MySQL考虑使用 e 和é等同于匹配和排序.排序规则和utf8_general_ci
都使很多事情等效.
You specified the name
column as text CHARACTER SET utf8 COLLATE utf8_unicode_ci
which tells MySQL to consider e and é as equivalent in matching and sorting. That collation and utf8_general_ci
both make a lot of things equivalent.
http://www.collation-charts.org/是很好的资源如何阅读图表,这非常简单.
http://www.collation-charts.org/ is a great resource once you learn how to read the charts, which is pretty easy.
如果要将 e 和é等视为不同,则必须选择其他排序规则.要找出服务器上的归类(假设您仅限于UTF-8编码):
If you want e and é etc. to be considered different then you must choose a different collation. To find out what collations are on your server (assuming you're limited to UTF-8 encoding):
mysql> show collation like 'utf8%';
然后选择使用归类图作为参考.
And choose using the collation charts as a reference.
另一个特殊的排序规则是utf8_bin
,其中没有对等项,它是二进制匹配项.
One more special collation is utf8_bin
in which there are no equivalencies, it's a binary match.
我知道,唯一不特定于语言的MySQL Unicode归类是utf8_unicode_ci
,utf8_general_ci
和utf8_bin
.他们很奇怪.排序规则的真正目的是使计算机像人们期望的那样进行匹配和排序.匈牙利语和土耳其语字典的条目根据不同的规则排序.指定排序规则后,您可以根据此类本地规则进行排序和匹配.
The only MySQL Unicode collations I'm aware of that are not language specific are utf8_unicode_ci
, utf8_general_ci
and utf8_bin
. They are rather weird. The real purpose of a collation is to make the computer match and sort as a person from somewhere would expect. Hungarian and Turkish dictionaries have their entries ordered according to different rules. Specifying a collation allows you to sort and match according to such local rules.
例如,丹麦人似乎认为e和é等效,但冰岛人却不这样做:
For example, it seems Danes consider e and é equivalent but Icelanders don't:
mysql> select _utf8'e' collate utf8_danish_ci
-> = _utf8'é' collate utf8_danish_ci as equal;
+-------+
| equal |
+-------+
| 1 |
+-------+
mysql> select _utf8'e' collate utf8_icelandic_ci
-> = _utf8'é' collate utf8_icelandic_ci as equal;
+-------+
| equal |
+-------+
| 0 |
+-------+
另一个方便的技巧是用一串您感兴趣的字符(在脚本中更容易)填充一列表,然后MySQL可以告诉您等效项:
Another handy trick is to fill a one column table with a bunch of characters you're interested in (it's easier from a script) and then MySQL can tell you the equivalencies:
mysql> create table t (c char(1) character set utf8);
mysql> insert into t values ('a'), ('ä'), ('á');
mysql> select group_concat(c) from t group by c collate utf8_icelandic_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a |
| á |
| ä |
+-----------------+
mysql> select group_concat(c) from t group by c collate utf8_danish_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,á |
| ä |
+-----------------+
mysql> select group_concat(c) from t group by c collate utf8_general_ci;
+-----------------+
| group_concat(c) |
+-----------------+
| a,ä,á |
+-----------------+
这篇关于MySQL匹配具有ASCII版本的Unicode字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!