是否可以将strpos与UTF-8字符串一起使用? [英] Safe to use strpos with UTF-8 strings?

查看:174
本文介绍了是否可以将strpos与UTF-8字符串一起使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一堆具有不同字符集的字符串. $charset变量包含当前字符串的字符集.

I have a bunch of strings with different charsets. The $charset variable contains the charset of the current string.

$content = iconv($charset, 'UTF-8', $content);

完成此操作后,是否可以安全地使用strposstrlensubstr等,而不使用等效的多字节格式?我之所以这样问是因为我也经常使用preg_match.因此,如果我使用PREG_OFFSET_CAPTURE来获取单词在字符串中的位置,则无法使用该值与mb_substr来删除单词之前的所有内容.

With this done, is it safe to use strpos, strlen, substr etcetera and not their multibyte equivalent? I'm asking this because I use preg_match a lot as well. So if I use PREG_OFFSET_CAPTURE to get the position of a word in the string I can't use that value with mb_substr to remove everything before the word.

推荐答案

这完全取决于您要执行的操作.核心strlen和类似功能可在 bytes 上工作.他们接受并返回的每个数字都是字节计数或字节偏移量. mb_ *函数可在字符上识别编码.他们接受并返回的所有数字都是字符计数或偏移量.

That entirely depends on what you want to do. The core strlen and similar functions work on bytes. Every number they accept and return is a byte count or byte offset. The mb_* functions work encoding-aware on characters. All numbers they accept and return are character counts or offsets.

如果您有一种安全的方式来获取字符串中的字节偏移量(安全"表示该偏移量不在多字节字符的中间),然后例如使用,就可以了.例如:

If you have a safe way of getting a byte offset in a string ("safe" meaning the offset is not in the middle of a multi-byte character) and then, for example, crop everything before that offset using substr, that'll work just fine. For instance:

$str     = '漢字';
$offset  = strpos($str, '字');
$cropped = substr($str, $offset);

工作正常.

但是,这行不通:

$cropped = substr($str, $offset, 1);

您不能安全地切出一个 byte 而不冒切成多字节字符的风险.

You can't safely cut out a single byte without running the risk of cutting into a multi-byte character.

这篇关于是否可以将strpos与UTF-8字符串一起使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆