是否可以将strpos与UTF-8字符串一起使用? [英] Safe to use strpos with UTF-8 strings?
问题描述
我有一堆具有不同字符集的字符串. $charset
变量包含当前字符串的字符集.
I have a bunch of strings with different charsets. The $charset
variable contains the charset of the current string.
$content = iconv($charset, 'UTF-8', $content);
完成此操作后,是否可以安全地使用strpos
,strlen
,substr
等,而不使用等效的多字节格式?我之所以这样问是因为我也经常使用preg_match
.因此,如果我使用PREG_OFFSET_CAPTURE
来获取单词在字符串中的位置,则无法使用该值与mb_substr
来删除单词之前的所有内容.
With this done, is it safe to use strpos
, strlen
, substr
etcetera and not their multibyte equivalent? I'm asking this because I use preg_match
a lot as well. So if I use PREG_OFFSET_CAPTURE
to get the position of a word in the string I can't use that value with mb_substr
to remove everything before the word.
推荐答案
这完全取决于您要执行的操作.核心strlen
和类似功能可在 bytes 上工作.他们接受并返回的每个数字都是字节计数或字节偏移量. mb_ *函数可在字符上识别编码.他们接受并返回的所有数字都是字符计数或偏移量.
That entirely depends on what you want to do. The core strlen
and similar functions work on bytes. Every number they accept and return is a byte count or byte offset. The mb_* functions work encoding-aware on characters. All numbers they accept and return are character counts or offsets.
如果您有一种安全的方式来获取字符串中的字节偏移量(安全"表示该偏移量不在多字节字符的中间),然后例如使用
If you have a safe way of getting a byte offset in a string ("safe" meaning the offset is not in the middle of a multi-byte character) and then, for example, crop everything before that offset using substr
, that'll work just fine. For instance:
$str = '漢字';
$offset = strpos($str, '字');
$cropped = substr($str, $offset);
工作正常.
但是,这行不通:
$cropped = substr($str, $offset, 1);
您不能安全地切出一个 byte 而不冒切成多字节字符的风险.
You can't safely cut out a single byte without running the risk of cutting into a multi-byte character.
这篇关于是否可以将strpos与UTF-8字符串一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!