PHP多字节字符串函数 [英] PHP Multibyte String Functions



今天,我遇到了php函数strpos()的问题,因为即使正确的结果显然是0,它也会返回FALSE.这是因为一个参数是用UTF-8编码的,而另一个(起源是HTTP GET)参数)显然不是.

Today I ran into a problem with the php function strpos() because it returned FALSE even if the correct result was obviously 0. This was because one parameter was encoded in UTF-8, but the other (origin is a HTTP GET parameter) obviously not.


Now I have noticed that using the mb_strpos function solved my problem.


My question is now: Is it wisely to use the PHP multibyte string functions generally to avoid theses problems in future? Should I avoid the traditional strpos, strlen, ereg, etc., etc. functions at all?


Notice: I don't want to set mbstring.func_overload global in php.ini, because this leads to other problems when using the PEAR library. I am using PHP4.



It depends on the character encoding you are using. In single-byte character encodings, or UTF-8 (where a single byte inside a character can never be mistaken for another character), then as long as the string you are searching in and the string you are using to search are in the same encoding then you can continue to use the regular string search functions.


If you are using a multi-byte encoding other than UTF-8, which does not prevent single bytes within a character from appearing like other characters, then it is never safe to do a string search using the regular string search functions. You may find false positives. This is because PHP's string comparison in functions such as strpos is per-byte, and with the exception of UTF-8 which is specifically designed to prevent this problem, multi-byte encodings suffer the problem that any subsequent byte in a character made up of more than one byte may match part of a different character.

如果要在 in 中搜索的字符串与要搜索的字符串具有不同的字符编码,则始终需要进行转换.否则,对于任何在其他编码中表示不同的字符串,您都会发现它始终返回false.您应该在输入上进行这种转换:确定应用程序将使用的字符编码,并在应用程序内保持一致.每当您收到采用不同编码的输入时,请立即进行转换.

If the string you are searching in and the string you are searching for are of different character encodings, then conversion will always be necessary. Otherwise you'll find that for any string that would be represented differently in the other encoding, it will always return false. You should do such conversion on input: decide on a character encoding your app will use, and be consistent within the application. Any time you receive input in a different encoding, convert on the way in.


