SQLite的不区分大小写的UTF-8字符串排序规则(C / C ++) [英] Case-insensitive UTF-8 string collation for SQLite (C/C++)

查看:217
本文介绍了SQLite的不区分大小写的UTF-8字符串排序规则(C / C ++)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种方法来比较和排序UTF-8字符串在C ++中以不区分大小写的方式使用它在 SQLite中的自定义归类函数

I am looking for a method to compare and sort UTF-8 strings in C++ in a case-insensitive manner to use it in a custom collation function in SQLite.


  1. 该方法应该与区域设置无关。但是,据我所知,我不会抱怨我的呼吸,排序是非常依赖于语言,所以任何工作在英语以外的语言,即使它意味着切换地点。

  2. 选项包括使用标准C或C ++库或(适用于嵌入式系统)和非GPL (适用于专有系统)第三方库。

  1. The method should ideally be locale-independent. However I won't be holding my breath, as far as I know, collation is very language-dependent, so anything that works on languages other than English will do, even if it means switching locales.
  2. Options include using standard C or C++ library or a small (suitable for embedded system) and non-GPL (suitable for a proprietary system) third-party library.

我到目前为止:


  1. strcoll 与C语言环境和 std :: collat​​e / std :: collat​​e_byname 区分大小写。 (这些文件有不区分大小写的版本吗?)

  2. 我试图使用POSIX strcasecmp,但它似乎POSIX

  1. strcoll with C locales and std::collate/std::collate_byname are case-sensitive. (Are there case-insensitive versions of these?)
  2. I tried to use a POSIX strcasecmp, but it seems to be not defined for locales other than "POSIX"


而且,结果 strcasecmp 在Linux上使用GLIBC的区域设置之间不会更改。

And, indeed, the result of strcasecmp does not change between locales on Linux with GLIBC.

#include <clocale>
#include <cstdio>
#include <cassert>
#include <cstring>


const static char *s1 = "Äaa";
const static char *s2 = "äaa";


int main() {
    printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2));
    printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2));
    assert(setlocale(LC_ALL, "en_AU.UTF-8"));
    printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2));
    printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2));
    assert(setlocale(LC_ALL, "fi_FI.UTF-8"));
    printf("strcasecmp('%s', '%s') == %d\n", s1, s2, strcasecmp(s1, s2));
    printf("strcoll('%s', '%s') == %d\n", s1, s2, strcoll(s1, s2));
}

印有:

strcasecmp('Äaa', 'äaa') == -32
strcoll('Äaa', 'äaa') == -32
strcasecmp('Äaa', 'äaa') == -32
strcoll('Äaa', 'äaa') == 7
strcasecmp('Äaa', 'äaa') == -32
strcoll('Äaa', 'äaa') == 7


P。 S。

是的,我知道 ICU ,但由于其庞大的大小,我们无法在嵌入式平台上使用它

And yes, I am aware about ICU, but we can't use it on the embedded platform due to its enormous size.

推荐答案

你真正想要的是逻辑上不可能的。没有区域设置独立,不区分大小写的字符串排序方式。简单的反例是i>I?天真的答案是否定的,但在土耳其语中,这些字符串是不相等的。 i大写为İ(U + 130拉丁语资本我用点上面)

What you really want is logically impossible. There is no locale-independent, case-insensitive way of sorting strings. The simple counter-example is "i" <> "I" ? The naive answer is no, but in Turkish these strings are unequal. "i" is uppercased to "İ" (U+130 Latin Capital I with dot above)

UTF-8字符串增加了问题的额外的复杂性。如果您有合适的语言环境,它们是完全有效的多字节char *字符串。但是C和C ++标准都没有定义这样的语言环境;检查您的供应商(太多嵌入式供应商,对不起,没有genearl答案在这里)。因此,您必须选择一个多字节编码为UTF-8的区域设置,以使mbscmp函数正常工作。这当然影响排序顺序,这是依赖于区域设置。如果你没有loc char,其中const char *是UTF-8,你根本不能使用这个技巧。 (我的理解是,微软的CRT受此影响,他们的多字节代码只处理字符多达2个字节; UTF-8需要3)

UTF-8 strings add extra complexity to the question. They're perfectly valid multi-byte char* strings, if you have an appropriate locale. But neither the C nor the C++ standard defines such a locale; check with your vendor (too many embedded vendors, sorry, no genearl answer here). So you HAVE to pick a locale whose multi-byte encoding is UTF-8, for the mbscmp function to work. This of course influences the sort order, which is locale dependent. And if you have NO locale in which const char* is UTF-8, you can't use this trick at all. (As I understand it, Microsoft's CRT suffers from this. Their multi-byte code only handles characters up to 2 bytes; UTF-8 needs 3)

wchar_t不是标准溶液。它据说是如此之宽,您不必处理多字节编码,但您的排序规则仍将取决于区域设置(LC_COLLATE)。但是,使用wchar_t意味着现在选择不使用UTF-8作为const char *的语言环境。

wchar_t is not the standard solution either. It supposedly is so wide that you don't have to deal with multi-byte encodings, but your collation will still depend on locale (LC_COLLATE) . However, using wchar_t means you now choose locales that do not use UTF-8 for const char*.

这样做基本上可以通过转换字符串以小写和比较它们。这不是完美的。你期望Lß== Lss吗?他们甚至不是相同的长度。然而,对于德国人,你必须认为他们是平等的。你能与那个一起生活吗?

With this done, you can basically write your own ordering by converting strings to lowercase and comparing them. It's not perfect. Do you expect L"ß" == L"ss" ? They're not even the same length. Yet, for a German you have to consider them equal. Can you live with that?

这篇关于SQLite的不区分大小写的UTF-8字符串排序规则(C / C ++)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆