不使用preg_match()在PHP中进行UTF-8验证 [英] UTF-8 validation in PHP without using preg_match()

查看:83
本文介绍了不使用preg_match()在PHP中进行UTF-8验证的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要验证一些以UTF-8编码的用户输入.许多建议使用以下代码:

I need to validate some user input that is encoded in UTF-8. Many have recommended using the following code:

preg_match('/\A(
     [\x09\x0A\x0D\x20-\x7E]
   | [\xC2-\xDF][\x80-\xBF]
   |  \xE0[\xA0-\xBF][\x80-\xBF]
   | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}
   |  \xED[\x80-\x9F][\x80-\xBF]
   |  \xF0[\x90-\xBF][\x80-\xBF]{2}
   | [\xF1-\xF3][\x80-\xBF]{3}
   |  \xF4[\x80-\x8F][\x80-\xBF]{2}
  )*\z/x', $string);

这是来自 http://www.w3的正则表达式.org/International/questions/qa-forms-utf-8 .一切正常,直到我发现PHP的一个错误至少在2006年以来就存在.如果$ string太长,则Preg_match()会导致段错误.似乎没有任何解决方法.您可以在此处查看错误提交: http://bugs.php.net/bug. php?id = 36463

It's a regular expression taken from http://www.w3.org/International/questions/qa-forms-utf-8 . Everything was ok until I discovered a bug in PHP that seems to have been around at least since 2006. Preg_match() causes a seg fault if the $string is too long. There doesn't seem to be any workaround. You can view the bug submission here: http://bugs.php.net/bug.php?id=36463

现在,为了避免使用preg_match,我创建了一个函数,其功能与上述正则表达式完全相同.我不知道这个问题在Stack Overflow上是否合适,但我想知道我做的功能是否正确.在这里:

Now, to avoid using preg_match I've created a function that does the exact same thing as the regular expression above. I don't know if this question is appropriate here at Stack Overflow, but I would like to know if the function I've made is correct. Here it is:

编辑[13.01.2010]: 如果有人感兴趣,我发布的先前版本中有几个错误.下面是我函数的最终版本.

EDIT [13.01.2010]: If anyone is interested, there were several bugs in the previous version I've posted. Below is the final version of my function.

function check_UTF8_string(&$string) {
    $len = mb_strlen($string, "ISO-8859-1");
    $ok = 1;

    for ($i = 0; $i < $len; $i++) {
        $o = ord(mb_substr($string, $i, 1, "ISO-8859-1"));

        if ($o == 9 || $o == 10 || $o == 13 || ($o >= 32 && $o <= 126)) {

        }
        elseif ($o >= 194 && $o <= 223) {
            $i++;
            $o2 = ord(mb_substr($string, $i, 1, "ISO-8859-1"));
            if (!($o2 >= 128 && $o2 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o == 224) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $i += 2;
            if (!($o2 >= 160 && $o2 <= 191) || !($o3 >= 128 && $o3 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif (($o >= 225 && $o <= 236) || $o == 238 || $o == 239) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $i += 2;
            if (!($o2 >= 128 && $o2 <= 191) || !($o3 >= 128 && $o3 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o == 237) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $i += 2;
            if (!($o2 >= 128 && $o2 <= 159) || !($o3 >= 128 && $o3 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o == 240) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
            $i += 3;
            if (!($o2 >= 144 && $o2 <= 191) ||
                !($o3 >= 128 && $o3 <= 191) ||
                !($o4 >= 128 && $o4 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o >= 241 && $o <= 243) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
            $i += 3;
            if (!($o2 >= 128 && $o2 <= 191) ||
                !($o3 >= 128 && $o3 <= 191) ||
                !($o4 >= 128 && $o4 <= 191)) {
                $ok = 0;
                break;
            }
        }
        elseif ($o == 244) {
            $o2 = ord(mb_substr($string, $i + 1, 1, "ISO-8859-1"));
            $o3 = ord(mb_substr($string, $i + 2, 1, "ISO-8859-1"));
            $o4 = ord(mb_substr($string, $i + 3, 1, "ISO-8859-1"));
            $i += 5;
            if (!($o2 >= 128 && $o2 <= 143) ||
                !($o3 >= 128 && $o3 <= 191) ||
                !($o4 >= 128 && $o4 <= 191)) {
                $ok = 0;
                break;
            }
        }
        else {
            $ok = 0;
            break;
        }
    }

    return $ok;
}

是的,很长.我希望我已经正确理解了该正则表达式是如何工作的.也希望对其他人有帮助.

Yes, it's very long. I hope I've understood correctly how that regular expression works. Also hope it will be of help to others.

提前谢谢!

推荐答案

您始终可以使用如果您想经常使用它并可能在某个时候进行更改:

If you want to use it a lot and possibly change it at sometime:

1)首先在配置文件中设置要使用的编码

1) First set the encoding you want to use in your config file

/* Set internal character encoding to UTF-8 */
mb_internal_encoding("UTF-8");

2)检查字符串

if(mb_check_encoding($string))
{
    // do something
}

或者,如果您不打算对其进行更改,则始终可以直接将编码直接放入函数中:

Or, if you don't plan on changing it, you can always just put the encoding straight into the function:

if(mb_check_encoding($string, 'UTF-8'))
{
    // do something
}

这篇关于不使用preg_match()在PHP中进行UTF-8验证的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆