我是否在我的PHP应用程序中正确支持UTF-8? [英] Am I correctly supporting UTF-8 in my PHP apps?

查看:76
本文介绍了我是否在我的PHP应用程序中正确支持UTF-8?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想确保我对UTF-8的了解都是正确的.我已经尝试使用UTF-8一段时间了,但我不断遇到越来越多的错误和其他怪异的东西,这使得拥有100%UTF-8站点似乎几乎是不可能的.我似乎总是想念某个地方.也许这里的某人可以更正我的列表或确定该列表,这样我就不会错过任何重要的事情.

数据库

每个站点都必须将数据存储在某个地方.无论您使用什么PHP设置,都必须配置数据库.如果无法访问配置文件,请确保在连接后立即"设置名称'utf8'".另外,请确保在所有表上使用 utf8_ unicode_ ci .假设MySQL用于数据库,则您将不得不更改其他数据库.

正则表达式

我做了很多更复杂的正则表达式.我必须记住使用"/u"修饰符,以便 PCRE不会破坏我的琴弦.但是,即使那样,仍然仍然有显然仍然存在的问题.

字符串函数

所有默认字符串函数(strlen(),strpos()等)应替换为多字节字符串函数,它查看字符而不是字节.

标题 您应该确保服务器为浏览器返回正确的标头,以了解您要使用的字符集(就像必须告诉MySQL一样).

header('Content-Type:text/html; charset = utf-8');

放置正确的<页面标题中的meta>标记.尽管实际的标头会覆盖此标头,但它们应该有所不同.

<meta http-equiv="Content-Type" content="text/html;charset=utf-8">

问题

我是否需要在页面加载时将我从用户代理收到的所有内容(HTML表单的& URI)转换为UTF-8,或者是否可以仅保留字符串/值并仍然通过它们运行它们?功能正常吗?

如果我确实需要将所有内容都转换为UTF-8,那么我应该采取什么步骤? mb_detect_encoding 似乎是为此而建的,但我坚持看到人们抱怨它并不总是有效. mb_check_encoding 似乎也有问题格式错误的UTF-8好的字符串.

PHP是否根据使用的编码方式(例如文件类型)将字符串存储在内存中的方式有​​所不同,还是仍像常规字符串一样存储,并且某些字符的解释不同(例如& amps vs& amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; amp; ;在HTML中). 是否编写了UTF-16来解决UTF-8中的限制?就像UTF-8的字符空间用完了吗? (Y2(UTF)k?)

功能

这是我发现的几个自定义PHP函数,但是我没有任何方法来验证它们是否确实有效.也许有人举了一个我可以使用的例子.首先是 convertToUTF8(),然后是see_utf8来自wordpress.

function seems_utf8($str) {
    $length = strlen($str);
    for ($i=0; $i < $length; $i++) {
        $c = ord($str[$i]);
        if ($c < 0x80) $n = 0; # 0bbbbbbb
        elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
        elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
        elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
        elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
        elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
        else return false; # Does not match any model
        for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
            if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
                return false;
        }
    }
    return true;
}

function is_utf8($str) {
    $c=0; $b=0;
    $bits=0;
    $len=strlen($str);
    for($i=0; $i<$len; $i++){
        $c=ord($str[$i]);
        if($c > 128){
            if(($c >= 254)) return false;
            elseif($c >= 252) $bits=6;
            elseif($c >= 248) $bits=5;
            elseif($c >= 240) $bits=4;
            elseif($c >= 224) $bits=3;
            elseif($c >= 192) $bits=2;
            else return false;
            if(($i+$bits) > $len) return false;
            while($bits > 1){
                $i++;
                $b=ord($str[$i]);
                if($b < 128 || $b > 191) return false;
                $bits--;
            }
        }
    }
    return true;
}

如果有人感兴趣,我发现了一个很棒的示例页面,可以使用解决方案

页面加载时,我是否需要将从用户代理(HTML表单的& URI)收到的所有内容转换为UTF-8

不.用户代理应以UTF-8格式提交数据;否则,您将失去Unicode的优势.

确保用户代理以UTF-8格式提交的方法是提供包含以UTF-8编码提交的表单的页面.使用Content-Type标头(如果要保存表单并独立工作,也可以使用meta http-equiv).

我听说您也应该将表单也标记为UTF-8(accept-charset ="UTF-8")

不要.在HTML标准中,这是个不错的主意,但IE从来没有做到这一点.本来应该声明允许的字符集的排他性列表,但是IE将其视为按字段尝试的其他字符集的列表.因此,如果您有ISO-8859-1页面和"accept-charset ="UTF-8""表单,则IE首先会尝试将字段编码为ISO-8859-1,如果非8859-1字符,然后然后使用UTF-8.

但是,由于IE不会告诉您它是否使用过ISO-8859-1或UTF-8,因此这绝对对您没有用.您将不得不分别针对每个字段猜测正在使用哪种编码!没用处.忽略该属性,并将您的页面用作UTF-8;这是您目前能做的最好的事情.

如果UTF字符串编码不正确,将会出问题

如果让这样的序列进入浏览器,则可能会遇到麻烦.有些超长序列"会以比所需的更长的字节序列来编码低编号的代码点.这意味着,如果您通过按字节顺序查找ASCII字符来过滤<",则可能会漏掉一个,并将脚本元素放入您认为是安全的文本中.

在Unicode成立之初,就禁止使用过长的序列,但是微软花了很长时间才把它们弄得一团糟:IE一直将字节序列'\ xC0 \ xBC'解释为'<',直到IE6. Service Pack1.Opera在版本7之前也出错了.幸运的是,这些较旧的浏览器正在逐渐消失,但是仍然值得过滤长序列,以防那些浏览器仍在使用中(或者新的白痴浏览器使这些浏览器成为现实).将来有同样的错误).您可以执行此操作,并使用仅允许正确的UTF-8通过的正则表达式来修复其他错误序列,例如

seems_utf8

与正则表达式相比,效率非常低!

此外,请确保在所有表上都使用utf8_unicode_ci.

实际上,您可以在没有这种情况的情况下逃脱现实,将MySQL视为只存储字节的存储,而在脚本中仅将它们解释为UTF-8.使用utf8_unicode_ci的优点是,它将使用有关非ASCII字符的知识来整理(排序并进行不区分大小写的比较). ŕ"和Ŕ"是相同的字符.如果您使用非UTF8归类,则应坚持二进制(区分大小写)匹配.

无论选择哪种方式,请始终如一地进行操作:对表使用与连接相同的字符集.您要避免的是脚本和数据库之间的有损字符集转换.

I would like to make sure that everything I know about UTF-8 is correct. I have been trying to use UTF-8 for a while now but I keep stumbling across more and more bugs and other weird things that make it seem almost impossible to have a 100% UTF-8 site. There is always a gotcha somewhere that I seem to miss. Perhaps someone here can correct my list or OK it so I don't miss anything important.

Database

Every site has to store there data somewhere. No matter what your PHP settings are you must also configure the DB. If you can't access the config files then make sure to "SET NAMES 'utf8'" as soon as you connect. Also, make sure to use utf8_ unicode_ ci on all of your tables. This assumes MySQL for a database, you will have to change for others.

Regex

I do a LOT of regex that is more complex than your average search-replace. I have to remember to use the "/u" modifier so that PCRE doesn't corrupt my strings. Yet, even then there are still problems apparently.

String Functions

All of the default string functions (strlen(), strpos(), etc.) should be replaced with Multibyte String Functions that look at the character instead of the byte.

Headers You should make sure that your server is returning the correct header for the browser to know what charset you are trying to use (just like you must tell MySQL).

header('Content-Type: text/html; charset=utf-8');

It is also a good idea to put the correct < meta > tag in the page head. Though the actual header will override this should they differ.

<meta http-equiv="Content-Type" content="text/html;charset=utf-8">

Questions

Do I need to convert everything that I receive from the user agent (HTML form's & URI) to UTF-8 when the page loads or if I can just leave the strings/values as they are and still run them through these functions without a problem?

If I do need to convert everything to UTF-8 - then what steps should I take? mb_detect_encoding seems to be built for this but I keep seeing people complain that it doesn't always work. mb_check_encoding also seems to have a problem telling a good UTF-8 string from a malformed one.

Does PHP store strings in memory differently depending on what encoding it is using (like file types) or is it still stored like a regular sting with some of the chars being interpreted differently (like & amp; vs & in HTML). chazomaticus answers this question:

In PHP (up to PHP5, anyway), strings are just sequences of bytes. There is no implied or explicit character set associated with them; that's something the programmer must keep track of.

If a give a non-UTF-8 string to a mb_* function will it ever cause a problem?

If a UTF string is improperly encoded will something go wrong (like a parsing error in regex?) or will it just mark an entity as bad (html)? Is there ever a chance that improperly encoded strings will result in function returning FALSE because the string is bad?

I have heard that you should mark you forms as UTF-8 also (accept-charset="UTF-8") but I am not sure what the benefit is..?

Was UTF-16 written to address a limit in UTF-8? Like did UTF-8 run out of space for characters? (Y2(UTF)k?)

Functions

Here are are a couple of the custom PHP functions I have found but I haven't any way to verify that they actually work. Perhaps someone has an example which I can use. First is convertToUTF8() and then seems_utf8 from wordpress.

function seems_utf8($str) {
    $length = strlen($str);
    for ($i=0; $i < $length; $i++) {
        $c = ord($str[$i]);
        if ($c < 0x80) $n = 0; # 0bbbbbbb
        elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
        elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
        elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
        elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
        elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
        else return false; # Does not match any model
        for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
            if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
                return false;
        }
    }
    return true;
}

function is_utf8($str) {
    $c=0; $b=0;
    $bits=0;
    $len=strlen($str);
    for($i=0; $i<$len; $i++){
        $c=ord($str[$i]);
        if($c > 128){
            if(($c >= 254)) return false;
            elseif($c >= 252) $bits=6;
            elseif($c >= 248) $bits=5;
            elseif($c >= 240) $bits=4;
            elseif($c >= 224) $bits=3;
            elseif($c >= 192) $bits=2;
            else return false;
            if(($i+$bits) > $len) return false;
            while($bits > 1){
                $i++;
                $b=ord($str[$i]);
                if($b < 128 || $b > 191) return false;
                $bits--;
            }
        }
    }
    return true;
}

If anyone is interested I found a great example page to use when testing UTf-8.

解决方案

Do I need to convert everything that I receive from the user agent (HTML form's & URI) to UTF-8 when the page loads

No. The user agent should be submitting data in UTF-8 format; if not you are losing the benefit of Unicode.

The way to ensure a user-agent submits in UTF-8 format is to serve the page containing the form it's submitting in UTF-8 encoding. Use the Content-Type header (and meta http-equiv too if you intend the form to be saved and work standalone).

I have heard that you should mark you forms as UTF-8 also (accept-charset="UTF-8")

Don't. It was a nice idea in the HTML standard, but IE never got it right. It was supposed to state an exclusive list of allowable charsets, but IE treats it as a list of additional charsets to try, on a per-field basis. So if you have an ISO-8859-1 page and an "accept-charset="UTF-8"" form, IE will first try to encode a field as ISO-8859-1, and if there's a non-8859-1 character in there, then it'll resort to UTF-8.

But since IE does not tell you whether it has used ISO-8859-1 or UTF-8, that's of absolutely no use to you. You would have to guess, for each field separately, which encoding was in use! Not useful. Omit the attribute and serve your pages as UTF-8; that's the best you can do at the moment.

If a UTF string is improperly encoded will something go wrong

If you let such a sequence get through to the browser you could be in trouble. There are ‘overlong sequences’ which encode an low-numbered codepoint in a longer sequence of bytes than is necessary. This means if you are filtering ‘<’ by looking for that ASCII character in a sequence of bytes, you could miss one, and let a script element into what you thought was safe text.

Overlong sequences were banned back in the early days of Unicode, but it took Microsoft a very long time to get their shit together: IE would interpret the byte sequence ‘\xC0\xBC’ as a ‘<’ up until IE6 Service Pack 1. Opera also got it wrong up to (about, I think) version 7. Luckily these older browsers are dying out, but it's still worth filtering overlong sequences in case those browsers are still about now (or new idiot browsers make the same mistake in future). You can do this, and fix other bad sequences, with a regex that allows only proper UTF-8 through, such as this one from W3.

If you are using mb_ functions in PHP, you might be insulated from these issues. I can't say for sure as mb_* was unusable fragile when I was still writing PHP.

In any case, this is also a good time to remove control characters, which are a large and generally unappreciated source of bugs. I would remove chars 9 and 13 from submitted string in addition to the others the W3 regex takes out; it is also worth removing plain newlines for strings you know aren't supposed to be multiline textboxes.

Was UTF-16 written to address a limit in UTF-8?

No, UTF-16 is a two-byte-per-codepoint encoding that's used to make indexing Unicode strings easier in-memory (from the days when all of Unicode would fit in two bytes; systems like Windows and Java still do it that way). Unlike UTF-8 it is not compatible with ASCII, and is of little-to-no use on the Web. But you occasionally meet it in saved files, usually ones saved by Windows users who have been misled by Windows's description of UTF-16LE as "Unicode" in Save-As menus.

seems_utf8

This is very inefficient compared to the regex!

Also, make sure to use utf8_unicode_ci on all of your tables.

You can actually sort of get away without this, treating MySQL as a store for nothing but bytes and only interpreting them as UTF-8 in your script. The advantage of using utf8_unicode_ci is that it will collate (sort and do case-insensitive compares) with knowledge about non-ASCII characters, so eg. ‘ŕ’ and ‘Ŕ’ are the same character. If you use a non-UTF8 collation you should stick to binary (case-sensitive) matching.

Whichever you choose, do it consistently: use the same character set for your tables as you do for your connection. What you want to avoid is a lossy character set conversion between your scripts and the database.

这篇关于我是否在我的PHP应用程序中正确支持UTF-8?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆