strlen()和UTF-8编码 [英] strlen() and UTF-8 encoding

查看:464
本文介绍了strlen()和UTF-8编码的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设UTF-8编码和PHP中的strlen(),此字符串的长度是否可能为4?

Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?

我只想了解strlen(),而不是其他功能

I'm only interested to know about strlen(), not other functions

这是字符串:

$1�2

$1�2

我已经在自己的计算机上对其进行了测试,并且已经验证了UTF-8编码,得到的答案是6.

I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.

我没有看到strlen手册中的任何内容,也没有在UTF-8上阅读过的任何内容可以解释为什么上面的某些字符计数少于一个.

I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.

PS:问题(4)来自我在Ebay上购买的ZCE的模拟测试.

PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.

推荐答案

您发布的字符串为六个字符长:$1ï½½2(美元符号,数字1,带小写字母的小写i,上下颠倒的问号,一半小数,第二位)

The string you posted is six character long: $1�2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)

如果使用该字符串的UTF-8表示形式调用strlen(),您将得到9的结果(可能是,尽管存在多个具有不同长度的表示形式).

If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).

但是,如果我们将该字符串存储为ISO 8859-1或CP1252,则将有一个六字节长的序列,该序列将作为UTF-8是合法的.将这6个字节重新解释为UTF-8将产生4个字符:$1 2(美元符号,数字1,Unicode替换字符,数字2).也就是说,单个字符'.'的UTF-8编码与三个字符�"的ISO-8859-1编码相同.

However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1�2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '�' is identical to the ISO-8859-1 encoding of the three characters "�".

当UTF-8解码器读取无效的UTF-8数据时,通常会插入替换字符.

The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.

似乎原始字符串是经过多层误解处理的;通过对非UTF-8数据使用UTF-8解码器(产生$ 1?2),然后使用您用来分析该数据的任何方法(产生$1ï½½2).

It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1�2), and then by whatever you used to analyze that data (producing $1�2).

这篇关于strlen()和UTF-8编码的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆