迁移php应用程序以处理UTF-8 [英] Migrating a php application to handle UTF-8

查看:68
本文介绍了迁移php应用程序以处理UTF-8的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用php开发多语言应用程序.

I am working on a multi-language app in php.

一切都很好,直到最近我被要求支持汉字.我为支持UTF-8字符而采取的措施如下:

All was fine until recently I was asked to support Chinese characters. The actions I took to support UTF-8 characters are the following:

  • 所有数据库表现在都是UTF-8

  • All DB tables are now UTF-8

HTML模板包含标签<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

HTML templates contain the tag <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

控制器发出一个标头,该标头指定用于http响应的编码(utf-8)

The controllers send out a header specifying the encoding (utf-8) to use for the http response

一切都很好,直到我开始进行一些字符串操作(substr之类的操作)

All was good until I started making some string manipulations (substr and the likes)

使用中文时,它将不起作用,因为中文表示为多字节,因此,如果您使用普通的子字符串(substr),则会在分配的字节之一的中间自动切出一个字母",然后按f * ck向上屏幕上的结果.

With chinese it won't work because the chinese is represented as multibytes and hence if you do a normal substring (substr) it will prolly cut a "letter" in the middle of one of the bytes allocated and f*ck up the result on screen.

我通过将其添加到引导程序中解决了所有问题

I fixed ALL my problems by adding this in the bootstrap

mb_internal_encoding("UTF-8");

,并将所有strlensubstrstrstr替换为它们的mb_.

and replacing all the strlen, substr, strstr with their mb_ counterparts.

要在php中完全支持UTF-8,我还需要做什么?

推荐答案

除了替换这些功能之外,还有更多其他功能.

There's a little more to it than just replacing those functions.

正则表达式

您应该将utf8标志添加到所有PCRE正则表达式中,这些正则表达式可以具有包含非Ascii字符的字符串,以便将模式解释为实际字符而不是字节.

You should add the utf8 flag to all of your PCRE regular expressions that can have strings which contain non-Ascii chars, so that the patterns are interpreted as the actual characters rather than bytes.

$subject = "Helló";
$pattern = '/(l|ó){2,3}/u'; //The u flag indicates the pattern is UTF8
preg_match($pattern, substr($subject,3), $matches, PREG_OFFSET_CAPTURE);

此外,您应该使用 Unicode字符类,而不是如果您希望正则表达式对非拉丁字母正确,那么标准Perl会是这样吗?

Also you should use the Unicode character classes rather than the standard Perl ones if you want your regular expressions to be correct for non-Latin alphabets?

  • \ p {L}代替\ w,代表任何字母"字符.
  • \ p {Z}(而不是\ s)代表任何空格"字符.
  • \ p {N}代替\ d表示任何数字"字符,例如阿拉伯数字

有许多不同的Unicode字符类,其中一些对于习惯于使用拉丁字母进行读写的人来说是非常不寻常的.例如,某些字符与前一个字符结合使用新字形.可以在此处阅读.

There are a lot of different Unicode character classes, some of which are quite unusual to someone used to reading and writing in a Latin alphabet. For example some characters combine with the previous character to make a new glyph. More explanation of them can be read here.

尽管mbstring扩展中包含正则表达式函数,但不建议使用它们.标准的PCRE功能与UTF8标志一起正常工作.

Although there are regular expression functions in the mbstring extension, they are not recommended for use. The standard PCRE functions work fine with the UTF8 flag.

功能替换

尽管您的列表是一个开始,但到目前为止,我发现需要用多字节版本替换的功能的列表更长.这是功能及其替换功能的列表,其中一些功能未在PHP中定义,但可从此处在中获得. Github为mb_extra .

Although your list is a start, the list of function I have found so far that need to be replaced with multibyte versions is longer. This is the list of functions with their replacement functions, some of which are not defined in PHP, but are available from here on Github as mb_extra.

$unsafeFunctions = array(
    'mail'      => 'mb_send_mail',
    'split'     => null, //'mb_split', deprecated function - just don't use it
    'stripos'   => 'mb_stripos',
    'stristr'   => 'mb_stristr',
    'strlen'    => 'mb_strlen',
    'strpos'    => 'mb_strpos',
    'strrpos'   => 'mb_strrpos',
    'strrchr'   => 'mb_strrchr',
    'strripos'  => 'mb_strripos',
    'strstr'    => 'mb_strstr',
    'strtolower'    => 'mb_strtolower',
    'strtoupper'    => 'mb_strtoupper',
    'substr_count'  => 'mb_substr_count',
    'substr'        => 'mb_substr',
    'str_ireplace'  => null,
    'str_split'     => 'mb_str_split', //TODO - check this works
    'strcasecmp'    => 'mb_strcasecmp', //TODO - check this works
    'strcspn'       => null, //TODO - implement alternative
    'strrev'        => 'mb_strrev', //TODO - check this works
    'strspn'        => null, //TODO - implement alternative
    'substr_replace'=> 'mb_substr_replace',
    'lcfirst'       => null,
    'ucfirst'       => 'mb_ucfirst',
    'ucwords'       => 'mb_ucwords',
    'wordwrap'      => null,
);

MySQL

尽管您会认为将字符类型设置为utf8会在MySQL中为您提供UTF-8支持,但事实并非如此.

Although you would have thought that setting the character type to utf8 would give you UTF-8 support in MySQL, it does not.

它仅提供对UTF-8的支持,UTF-8编码为3个字节,又称为基本的多语言平面.但是,人们正在积极使用需要4个字节编码的字符,包括大多数 Emoji字符,也称为补充性多语言平面

It only gives you support for UTF-8 that are encoded in up to 3 bytes aka the Basic Multi-lingual Plane. However people are actively using characters that require 4 bytes to encode, including most of the Emoji characters, also know as the Supplementary Multilingual Plane

要支持这些功能,通常应使用:

To support these you should in general use:

  • utf8mb4-用于字符编码.
  • utf8mb4_unicode_ci-用于字符排序.

在特定情况下,可能还有一些适合您的排序规则集,但总的来说,请遵循最正确的排序规则集.

For specific scenarios there are alternative collation sets that may be appropriate for you, but in general stick to the collation set that is most correct.

在MySQL配置文件中应设置字符集和排序规则的位置的列表为:

The list of places where you should set the character set and collation in your MySQL config file are:

[mysql]
default-character-set=utf8mb4

[client]
default-character-set=utf8mb4

[mysqld]
init-connect='SET NAMES utf8mb4'
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci

SET NAMES可能并非在所有情况下都必须使用-只是在很小的速度损失下就更安全了.

The SET NAMES may not be required in all circumstances - but it is safer on at only a small speed penalty.

PHP INI文件

尽管您说过在引导脚本中设置了mb_internal_encoding,但最好在PHP ini文件中进行设置,并设置所有推荐参数:

Although you said you have set mb_internal_encoding in your bootstrap script, it is much better to do this in the PHP ini file, and also set all the recommended parameters:

mbstring.language   = Neutral   ; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding  = UTF-8 ; Set default internal encoding to UTF-8
mbstring.encoding_translation = On  ;  HTTP input encoding translation is enabled
mbstring.http_input     = auto  ; Set HTTP input character set dectection to auto
mbstring.http_output    = UTF-8 ; Set HTTP output encoding to UTF-8
mbstring.detect_order   = auto  ; Set default character encoding detection order to auto
mbstring.substitute_character = none ; Do not print invalid characters
default_charset      = UTF-8 ; Default character set for auto content type header

帮助浏览器为表单选择UTF8

  • 您需要在表单上设置 accept-charset UTF-8,告诉浏览器将它们提交为UTF8.

  • You need to set accept-charset on your forms to be UTF-8 to tell browsers to submit them as UTF8.

将UTF8字符添加到您的

Add a UTF8 character to your form in a hidden field, to stop Internet Explorer (5, 6, 7 and 8) from submitting a form as something other than UTF8.

其他

  • 如果您使用的是Apache,请设置"AddDefaultCharset utf-8"

  • If you're using Apache set "AddDefaultCharset utf-8"

就像您说的那样,只是为了提醒任何人阅读答案,请在标头中也设置meta内容类型.

As you said you're doing, but just to remind anyone reading the answer, set the meta content-type as well in the header.

应该就可以了.尽管值得阅读"每个程序员绝对需要了解的内容,但是肯定需要了解与文本配合使用的编码和字符集"页面,我认为最好在任何地方使用UTF-8 ,这样就不必花费很多精力努力处理不同的字符集.

That should be about it. Although it's worth reading the "What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text" page, I think it is preferable to use UTF-8 everywhere and so not have to spend any mental effort on handling different character sets.

这篇关于迁移php应用程序以处理UTF-8的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆