如何比较unicode字符串忽略重音? [英] How to compare unicode strings ignoring accents?

查看:122
本文介绍了如何比较unicode字符串忽略重音?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,我正在尝试找到一种方法来比较没有重音和大小写的unicode字符串,所以例如字符串''áíé''和''AIE''应该被认为是相等的。



我尝试过boost :: locale,也是unicode规范化,但是无法正常工作。

我认为ICU会起作用,但我的老板不喜欢链接因为它的大小。



locale:斯洛伐克,charset: windows-1250

我正在使用Windows Vista,但使用 _WIN32_WINNT = 0x0501进行编译

提升: 1.48.0

IDE: VS2005



这就是我的工作:



Hello, I am trying to find a way to compare unicode strings without accents and case, so for example strings ''áíé'' and ''AIE'' should be considered equal.

I have tried boost::locale, also unicode normalisation, but can not get it working correctly.
I think that ICU would work, but my boss does not like to link with it because of its size.

locale: Slovak, charset: windows-1250
I am using Windows Vista, but compiling with _WIN32_WINNT = 0x0501
boost: 1.48.0
IDE: VS2005

This is what I do:

static boost::locale::generator gen;
std::locale::global(gen.generate(std::locale(""), ""));

// ...

std::wstring wstr_a = L"Dušan";
std::wstring wstr_b = L"Dusan";

std::wstring wstr_c = L"áíéúó";
std::wstring wstr_d = L"aieuo";

// rslt = 1 => INCORRECT
int rslt = std::use_facet<boost::locale::collator<wchar_t>>(std::locale()).compare (
  boost::locale::collator_base::primary, wstr_a, wstr_b);

// rslt1 = 0 => CORRECT
int rslt1 = std::use_facet<boost::locale::collator<wchar_t>>(std::locale()).compare (
  boost::locale::collator_base::primary, wstr_c, wstr_d);

std::wstring normalized_a = boost::locale::normalize(wstr_a, boost::locale::norm_nfd);
std::wstring normalized_b = boost::locale::normalize(wstr_b, boost::locale::norm_nfd);
std::wstring normalized_c = boost::locale::normalize(wstr_c, boost::locale::norm_nfd);
std::wstring normalized_d = boost::locale::normalize(wstr_d, boost::locale::norm_nfd);

// normalized_a = { 'D', 'u', 's', 0x030c, 'a', 'n' }
// normalized_b = { 'D', 'u', 's', 'a', 'n' }
// normalized_c = { 'a', 0x0301, 'i', 0x0301, 'e', 0x0301, 'u', 0x0301, 'o', 0x0301 }
// normalized_d = { 'a', 'i', 'e', 'u', 'o' }

// rslt2 = 1 => INCORRECT
int rslt2 = std::use_facet<boost::locale::collator<wchar_t>>(std::locale()).compare (
  boost::locale::collator_base::primary,
  normalized_a, normalized_b);

// rslt3 = 0 => CORRECT
int rslt3 = std::use_facet<boost::locale::collator<wchar_t>>(std::locale()).compare (
  boost::locale::collator_base::primary,
  normalized_c, normalized_d);





我做错了什么?

它应该有效吗?它只是助推器中的错误吗?



编辑:调试时,我发现boost在collat​​or中使用CompareStringW函数...



What am I doing wrong?
Should it work? Is it only bug in boost?

While debugging, I have found that boost uses CompareStringW function in collator...

推荐答案

您可以尝试使用 WideCharToMultiByte()使用预合成字符将字符串转换为ASCII,并比较转换后的ASCII字符串:

You can try to use WideCharToMultiByte() to convert the strings to ASCII using precomposed characters and compare the converted ASCII strings:
char lpszAscii[128];
::WideCharToMultiByte(20127, WC_COMPOSITECHECK, L"Dušan áíéúó", -1, lpszAscii, 128, NULL, NULL);
int nCompare = stricmp(lpszAscii, "Dusan aieuo")



但请注意,对于某些字符和所有符号(将被'''''替换)会失败。例如德国的''''和货币符号,如'''''。如果您仅限于代码页1250,您可以检查此代码页中的所有字符,并对失败的字符进行特殊处理(例如,用''EUR''替换欧元符号)。


But note that this fails for some characters and all symbols (will be replaced by a ''?''). Examples are the German ''ß'' and currency symbols like ''€''. If you are limited to code page 1250, you may check all characters from this code page and provide special handling for the characters that fail (e.g. replace the Euro symbol by ''EUR'').


这篇关于如何比较unicode字符串忽略重音?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆