PostgreSQL 9.1在选择语句中使用归类 [英] PostgreSQL 9.1 using collate in select statements

查看:101
本文介绍了PostgreSQL 9.1在选择语句中使用归类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个PostgreSQL 9.1数据库表 en_US.UTF-8:

I have a postgresql 9.1 database table, "en_US.UTF-8":

CREATE TABLE branch_language
(
    id serial NOT NULL,
    name_language character varying(128) NOT NULL,
    branch_id integer NOT NULL,
    language_id integer NOT NULL,
    ....
)

属性name_language包含各种语言的名称。语言由外键language_id指定。

The attribute name_language contains names in various languages. The language is specified by the foreign key language_id.

我创建了一些索引:

/* us english */
CREATE INDEX idx_branch_language_2
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."en_US" );

/* catalan */
CREATE INDEX idx_branch_language_5
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."ca_ES" );

/* portuguese */
CREATE INDEX idx_branch_language_6
    ON branch_language
    USING btree
    (name_language COLLATE pg_catalog."pt_PT" );

现在,当我进行选择时,我没有得到预期的结果。

Now when I do a select I am not getting the results I am expecting.

select name_language from branch_language
where language_id=42 -- id of catalan language
order by name_language collate "ca_ES" -- use ca_ES collation

这会生成名称列表,但顺序不符合我的预期:

This generates a list of names but not in the order I expected:

Aficions i Joguines
Agència de viatges
Aliments i Subministraments
Aparells elèctrics i il luminació
Art i Antiguitats
Articles de la llar
Bars i Restaurants
...
Tabac
Àudio, Vídeo, CD i DVD
Òptica

我希望最后两个条目出现在列表中的不同位置。

As I expected the last two entries to appear in different positions in the list.

创建索引有效。除非您要优化性能,否则我认为它们并不是真正必要的。

Creating the indexes works. I don't think they are really necessary unless you want to optimize for performance.

但是select语句似乎忽略了这一部分:整理 ca_ES。

The select statement however seems to ignore the part: collate "ca_ES".

当我选择其他排序规则时,也会存在此问题。我尝试过 es_ES和 pt_PT,但结果相似。

This problem also exists when I select other collations. I have tried "es_ES" and "pt_PT" but the results are similar.

推荐答案

我找不到您的缺陷设计。我已经尝试过。

I can't find a flaw in your design. I have tried.

我重新考虑了这个问题。考虑此 sqlfiddle上的测试用例 。它似乎工作正常。我什至在本地测试服务器(Debian Squeeze上的PostgreSQL 9.1.6)上创建了语言环境 ca_ES.utf8 并将语言环境添加到数据库集群中:

I revisited this question. Consider this test case on sqlfiddle. It seems to work just fine. I even created the locale ca_ES.utf8 in my local test server (PostgreSQL 9.1.6 on Debian Squeeze) and added the locale to my DB cluster:

CREATE COLLATION "ca_ES" (LOCALE = 'ca_ES.utf8');

我得到与上面的sqlfiddle中相同的结果。

I get the same results as can be seen in the sqlfiddle above.

请注意,归类名称是标识符,需要使用双引号将其保留为CamelCase拼写,例如 ca_ES 。也许您的系统中的其他语言环境有些混乱?检查您的可用的排序规则

Note that collation names are identifiers and need to be double-quoted to preserve CamelCase spelling like "ca_ES". Maybe there has been some confusion with other locales in your system? Check your available collations:

SELECT * FROM pg_collation;

通常,排序规则是从系统区域设置派生的。在此处阅读详细信息。如果仍然得到不正确的结果,我将尝试更新您的系统并重新生成 ca_ES 的语言环境。在Debian(和相关的Linux发行版)中,可以这样操作:

Generally, collation rules are derived from system locales. Read about the details in the manual here. If you still get incorrect results, I would try to update your system and regenerate the locale for "ca_ES". In Debian (and related Linux distributions) this can be done with:

dpkg-reconfigure locales






NFC



我有另一个想法:未标准化的UNICODE字符串

是否可能是您的‘Àudio’实际上是’̀’|| 音频 ?就是这个字符:

Could it be that your 'Àudio' is in fact '̀ ' || 'Audio'? That would be this character:

SELECT U&'\0300A';
SELECT ascii(U&'\0300A');
SELECT chr(768);

了解有关维基百科中的重音符号

您必须 SET standard_conforming_strings = TRUE 才能使用Unicode字符串,例如第一行。

Read more about the acute accent in wikipedia.
You have to SET standard_conforming_strings = TRUE to use Unicode strings like in the first line.

请注意,某些浏览器无法正确显示未规范化的Unicode字符,并且许多字体没有特殊字符的适当字形,因此您可能在此处看不到任何东西或乱码。但是UNICODE允许这种胡说八道。测试看看您得到了什么:

Note that some browsers cannot display unnormalized Unicode characters correctly and many fonts have no proper glyph for the special characters, so you may see nothing here or gibberish. But UNICODE allows for that nonsense. Test to see what you got:

SELECT octet_length('̀A')  -- returns 3 (!)
SELECT octet_length('À')  -- returns 2

如果这是您的数据库收缩的数据,则需要摆脱它或承受后果。解决方法是将您的字符串标准化为 NFC 。 Perl具有卓越的UNICODE-foo技能,您可以在plperlu函数中利用它们的库在PostgreSQL中进行操作。我这样做是为了使我免于疯狂。

If that's what your database has contracted, you need to get rid of it or suffer the consequences. The cure is to normalize your strings to NFC. Perl has superior UNICODE-foo skills, you can make use of their libraries in a plperlu function to do it in PostgreSQL. I have done that to save me from madness.

阅读这篇关于 David Wheeler在PostgreSQL中对UNICODE进行规范化

阅读有关 unicode.org上的Unicode规范化表格

这篇关于PostgreSQL 9.1在选择语句中使用归类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆