PHP str_word_count()多字节安全吗? [英] is PHP str_word_count() multibyte safe?

查看:109
本文介绍了PHP str_word_count()多字节安全吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想在UTF-8字符串上使用 str_word_count() .

I want to use str_word_count() on a UTF-8 string.

在PHP中这样安全吗?在我看来,应该这样做(尤其是考虑到没有mb_str_word_count()).

Is this safe in PHP? It seems to me that it should be (especially considering that there is no mb_str_word_count()).

但是在php.net上,很多人通过呈现自己的多字节兼容"版本功能.

But on php.net there are a lot of people muddying the water by presenting their own 'multibyte compatible' versions of the function.

所以我想我想知道...

So I guess I want to know...

  1. 鉴于str_word_count只是对所有以" "(空格)定界的字符序列进行计数,即使不必一定知道字符序列,对多字节字符串也应该是安全的,对吧?

  1. Given that str_word_count simply counts all character sequences in delimited by " " (space), it should be safe on multibyte strings, even though its not necessarily aware of the character sequences, right?

UTF-8中是否有等效的'space'字符,不是ASCII " "(空格)?#

Are there any equivalent 'space' characters in UTF-8, which are not ASCII " " (space)?#

我想这可能是问题所在.

This is where the problem might lie I guess.

推荐答案

我会说你猜对了.实际上,UTF-8中有一些空格字符,它们不是US-ASCII的一部分.为您提供此类空间的示例:

I'd say you guess right. And indeed there are space characters in UTF-8 which are not part of US-ASCII. To give you an example of such spaces:

  • Unicode Character 'NO-BREAK SPACE' (U+00A0): 2 Bytes in UTF-8: 0xC2 0xA0 (c2a0)

也许还有:

  • Unicode Character 'NEXT LINE (NEL)' (U+0085): 2 Bytes in UTF-8: 0xC2 0x85 (c285)
  • Unicode Character 'LINE SEPARATOR' (U+2028): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)
  • Unicode Character 'PARAGRAPH SEPARATOR' (U+2029): 3 Bytes in UTF-8: 0xE2 0x80 0xA8 (e280a8)

无论如何,第一个-'NO-BREAK SPACE'(U + 00A0)-是一个很好的例子,因为它也是Latin-X字符集的一部分.并且PHP手册已经暗示str_word_count 依赖于语言环境.

Anyway, the first one - the 'NO-BREAK SPACE' (U+00A0) - is a good example as it is also part of Latin-X charsets. And the PHP manual already provides a hint that str_word_count would be locale dependent.

如果要进行测试,可以将语言环境设置为UTF-8,传入包含\xA0序列的无效字符串,并且如果该字符串仍视为断字字符,则该功能显然不适用UTF-8安全,因此不是多字节安全(根据问题未定义):

If we want to put this to a test, we can set the locale to UTF-8, pass in an invalid string containing a \xA0 sequence and if this still counts as word-breaking character, that function is clearly not UTF-8 safe, hence not multibyte safe (as same non-defined as per the question):

<?php
/**
 * is PHP str_word_count() multibyte safe?
 * @link https://stackoverflow.com/q/8290537/367456
 */

echo 'New Locale: ', setlocale(LC_ALL, 'en_US.utf8'), "\n\n";

$test   = "aword\xA0bword aword";
$result = str_word_count($test, 2);

var_dump($result);

输出:

New Locale: en_US.utf8

array(3) {
  [0]=>
  string(5) "aword"
  [6]=>
  string(5) "bword"
  [12]=>
  string(5) "aword"
}

如该演示所示不要对此感到奇怪或抱怨,大多数情况下,如果您读到某个函数在PHP中是特定于语言环境的,就可以运行一生,然后找到一个不是该语言的函数),我在这里利用它来证明它绝不会对UTF做任何事情- 8个字符的编码.

As this demo shows, that function totally fails on the locale promise it gives on the manual page (I do not wonder nor moan about this, most often if you read that a function is locale specific in PHP, run for your life and find one that is not) which I exploit here to demonstrate that it by no means does anything regarding the UTF-8 character encoding.

对于UTF-8,您应该查看PCRE扩展:

Instead for UTF-8 you should take a look into the PCRE extension:

PCRE特别了解PHP中的Unicode和UTF-8.如果您精心制作正则表达式模式,它也可能会非常快.

PCRE has a good understanding of Unicode and UTF-8 in PHP in specific. It can also be quite fast if you craft the regular expression pattern carefully.

这篇关于PHP str_word_count()多字节安全吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆