奇怪的UTF8字符串比较 [英] Strange UTF8 string comparison

查看:91
本文介绍了奇怪的UTF8字符串比较的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在使用UTF8字符串比较时遇到了这个问题,我真的不知道该问题,它开始让我头疼.请帮帮我.
基本上,我从UTF8编码的xml文档中得到了这个字符串:'Mina Tidigareanställningar'
当我将该字符串与我输入的完全相同的字符串进行比较时:'MinaTidigareanställningar'(也在UTF8中).结果为FALSE !!!
我不知道为什么.真奇怪有人可以帮我吗?

I'm having this problem with UTF8 string comparison which I really have no idea about and it starts to give me headache. Please help me out.
Basically I have this string from a xml document encoded in UTF8: 'Mina Tidigare anställningar'
And when I compare that string with the exactly the same string which I typed myself: 'Mina Tidigare anställningar' (also in UTF8). And the result is FALSE!!!
I have no idea why. It is so strange. Can someone help me out?

推荐答案

This seems somewhat relevant. To simplify, there are several ways to get the same text in Unicode (and therefore UTF8): for example, this: ř can be written as one character ř or as two characters: r and the combining ˇ.

您最好的选择是规范化器类-对两者进行规范化字符串转换为相同的归一化形式并比较结果.

Your best bet would be the normalizer class - normalize both strings to the same normalization form and compare the results.

在其中一项注释中,显示以下字符串的十六进制表示形式:

In one of the comments, you show these hex representations of the strings:

4d696e61205469646967617265 20   616e7374 c3a4   6c6c6e696e676172  // from XML
4d696e61205469646967617265 c2a0 616e7374 61cc88 6c6c6e696e676172 // typed
        ^^-----------------^^^^1         ^^^^^^2

请注意我标记的部分,显然这个问题有两个部分.

Note the parts I marked, apparently there are two parts to this problem.

  • 首先,请观察关于字节序列"c2a0"的含义的问题-由于某种原因,您的键入将转换为XML文件具有普通空间的不可中断空间.请注意,在两种情况下,"Mina"之后都有一个正常的空格.除了用正常空间替换所有空格外,不确定在PHP中如何处理那个.

  • For the first, observe this question on the meaning of byte sequence "c2a0" - for some reason, your typing is translated to a non-breakable space where the XML file has a normal space. Note that there's a normal space in both cases after "Mina". Not sure what to do about that in PHP, except to replace all whitespace with a normal space.

对于第二种情况,就是我上面概述的情况:c3a4 ä (U + 00E4带DIAERESIS的拉丁文小写字母A"-一个字符,两个字节),而61cc88将是组合的变音符号 (U + 0308"COMBINING DIAERESIS" –两个字符,三个字节).在这里,规范化库应该有用.

As to the second, that is the case I outlined above: c3a4 is ä (U+00E4 "LATIN SMALL LETTER A WITH DIAERESIS" - one character, two bytes), whereas 61 is a (U+0061 "LATIN SMALL LETTER A" - one character, one byte) and cc88 would be the combining umlaut " (U+0308 "COMBINING DIAERESIS" - two characters, three bytes). Here, the normalization library should be useful.

这篇关于奇怪的UTF8字符串比较的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆