如果只给定有效的UTF-8编码字符串作为参数,可以将str_replace安全地用于UTF-8编码的字符串吗? [英] Can str_replace be safely used on a UTF-8 encoded string if it's only given valid UTF-8 encoded strings as arguments?

查看:129
本文介绍了如果只给定有效的UTF-8编码字符串作为参数,可以将str_replace安全地用于UTF-8编码的字符串吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

PHP的str_replace()仅用于ANSI字符串,因此可以处理UTF-8字符串.但是,鉴于它是二进制安全的,如果仅将有效的UTF-8字符串作为参数,它是否可以正常工作?

PHP's str_replace() was intended only for ANSI strings and as such can mangle UTF-8 strings. However, given that it's binary-safe would it work properly if it was only given valid UTF-8 strings as arguments?

我不是在寻找替代函数,我只是想知道这个假设是否正确.

I'm not looking for a replacement function, I would just like to know if this hypothesis is correct.

推荐答案

是. UTF-8的设计故意允许此处理和其他类似的非Unicode感知处理.

Yes. UTF-8 is deliberately designed to allow this and other similar non-Unicode-aware processing.

在UTF-8中,代表有效字符的任何非ASCII字节序列始终以\xC0-\xFF范围内的字节开头.该字节可能不会出现在序列中的任何其他位置,因此您不能创建与字符的一部分匹配的有效UTF-8序列.

In UTF-8, any non-ASCII byte sequence representing a valid character always begins with a byte in the range \xC0-\xFF. This byte may not appear anywhere else in the sequence, so you can't make a valid UTF-8 sequence that matches part of a character.

对于较旧的多字节编码则不是这种情况,在这种情况下,字节序列的不同部分是无法区分的.这引起了很多问题,例如尝试替换Shift-JIS字符串中的ASCII反斜杠(其中字节\x5C可能是表示其他内容的字符序列的第二个字节).

This is not the case for older multibyte encodings, where different parts of a byte sequence are indistinguishable. This caused a lot of problems, for example trying to replace an ASCII backslash in a Shift-JIS string (where byte \x5C might be the second byte of a character sequence representing something else).

这篇关于如果只给定有效的UTF-8编码字符串作为参数,可以将str_replace安全地用于UTF-8编码的字符串吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆