strlen()和UTF-8编码 [英] strlen() and UTF-8 encoding

查看：464 发布时间：2020/7/2 22:40:29 php unicode utf-8 strlen

本文介绍了strlen()和UTF-8编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设UTF-8编码和PHP中的strlen()，此字符串的长度是否可能为4?

Assuming UTF-8 encoding, and strlen() in PHP, is it possible that this string has a length of 4?

我只想了解strlen()，而不是其他功能

I'm only interested to know about strlen(), not other functions

这是字符串:

$1ï¿½2

我已经在自己的计算机上对其进行了测试，并且已经验证了UTF-8编码，得到的答案是6.

I have tested it on my own computer, and I have verified UTF-8 encoding, and the answer I get is 6.

我没有看到strlen手册中的任何内容，也没有在UTF-8上阅读过的任何内容可以解释为什么上面的某些字符计数少于一个.

I don't see anything in the manual for strlen or anything I've read on UTF-8 that would explain why some of the characters above would count for less than one.

PS:问题(4)来自我在Ebay上购买的ZCE的模拟测试.

PS: This question and answer (4) comes from a mock test for ZCE I bought on Ebay.

推荐答案

您发布的字符串为六个字符长:$1ï½½2(美元符号，数字1，带小写字母的小写i，上下颠倒的问号，一半小数，第二位)

The string you posted is six character long: $1ï¿½2 (dollar sign, digit one, lowercase i with diaeresis, upside-down question mark, one half fraction, digit two)

如果使用该字符串的UTF-8表示形式调用strlen()，您将得到9的结果(可能是，尽管存在多个具有不同长度的表示形式).

If strlen() was called with a UTF-8 representation of that string, you would get a result of nine (probably, though there are multiple representations with different lengths).

但是，如果我们将该字符串存储为ISO 8859-1或CP1252，则将有一个六字节长的序列，该序列将作为UTF-8是合法的.将这6个字节重新解释为UTF-8将产生4个字符:$1 2(美元符号，数字1，Unicode替换字符，数字2).也就是说，单个字符'.'的UTF-8编码与三个字符ï¿½"的ISO-8859-1编码相同.

However, if we were to store that string as ISO 8859-1 or CP1252 we would have a six byte long sequence that would be legal as UTF-8. Reinterpreting those 6 bytes as UTF-8 would then result in 4 characters: $1�2 (dollar sign, digit one, Unicode Replacement Character, digit 2). That is, the UTF-8 encoding of the single character '�' is identical to the ISO-8859-1 encoding of the three characters "ï¿½".

当UTF-8解码器读取无效的UTF-8数据时，通常会插入替换字符.

The replacement character often gets inserted when a UTF-8 decoder reads data that's not valid UTF-8 data.

似乎原始字符串是经过多层误解处理的；通过对非UTF-8数据使用UTF-8解码器(产生$ 1?2)，然后使用您用来分析该数据的任何方法(产生$1ï½½2).

It appears that the original string was processed through multiple layers of misinterpretation; by the use of a UTF-8 decoder on non-UTF-8 data (producing $1�2), and then by whatever you used to analyze that data (producing $1ï¿½2).

这篇关于strlen()和UTF-8编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

strlen()和UTF-8编码 [英] strlen() and UTF-8 encoding

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

strlen()和UTF-8编码 [英] strlen() and UTF-8 encoding

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭