UTF8工作流程PHP,MySQL总结 [英] UTF8 workflow PHP, MySQL summarized
问题描述
我正在为具有完全不同的字母的国际客户工作,因此我试图最终获得PHP和MySQL之间完整工作流的概述,以确保正确插入所有字符编码.我已经阅读了一堆关于此的教程,但仍然有疑问(有很多东西要学习),并认为我可能会把所有内容放到这里并问.
I am working for international clients who have all very different alphabets and so I am trying to finally get an overview of a complete workflow between PHP and MySQL that would ensure all character encodings to be inserted correctly. I have read a bunch of tutorials on this but still have questions(there is much to learn) and thought I might just put it all together here and ask.
PHP
header('Content-Type:text/html; charset=UTF-8');
mb_internal_encoding('UTF-8');
HTML
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<form accept-charset="UTF-8"> .. </form>
(尽管后者是可选的,但这是一个建议,但我相信我宁愿建议不要做任何事情)
MySQL
CREATE database_name DEFAULT CHARACTER SET utf8;
或ALTER database_name DEFAULT CHARACTER SET utf8;
和/或使用utf8_general_ci
作为MySQL连接排序规则.
CREATE database_name DEFAULT CHARACTER SET utf8;
or ALTER database_name DEFAULT CHARACTER SET utf8;
and/or use utf8_general_ci
as MySQL connection collation.
(这是重要注意事项在这里,如果使用varchar,这会增加数据库的大小)
(it is important to note here that this will increase the database size if it uses varchar)
连接
mysql_query("SET NAMES 'utf8'");
mysql_query("SET CHARACTER_SET utf8");
业务逻辑
使用 mb_detect_encoding()
检测是否为UTF8并使用 ivon()
进行转换.
验证UTF8和UTF16的序列太长
detect if not UTF8 with mb_detect_encoding()
and convert with ivon()
.
validating overly long sequences of UTF8 and UTF16
$body=preg_replace('/[\x00-\x08\x10\x0B\x0C\x0E-\x19\x7F]|(?<=^|[\x00-\x7F])[\x80-\xBF]+|([\xC0\xC1]|[\xF0-\xFF])[\x80-\xBF]*|[\xC2-\xDF]((?![\x80-\xBF])|[\x80-\xBF]{2,})|[\xE0-\xEF](([\x80-\xBF](?![\x80-\xBF]))|(?![\x80-\xBF]{2})|[\x80-\xBF]{3,})/','�',$body);
$body=preg_replace('/\xE0[\x80-\x9F][\x80-\xBF]|\xED[\xA0-\xBF][\x80-\xBF]/S','?', $body);
问题
-
在PHP 5.3和更高版本中,
-
是
mb_internal_encoding('UTF-8')
所必需的,如果是的话,这是否意味着我必须使用所有多字节函数而不是像mb_substr()
这样的核心函数,而不是substr()
?
is
mb_internal_encoding('UTF-8')
necessary in PHP 5.3 and higher and if so does this mean I have to use all multi byte functions instead of its core functions likemb_substr()
instead ofsubstr()
?
是否仍然需要检查格式错误的输入字符串?如果是,那么可靠的函数/类是什么呢?我可能不想剥离不良数据,对音译不了解.
is it still necessary to check for malformed input stings and if so what is a reliable function/class to do so? I possibly do not want to strip bad data and don't know enough about transliteration.
应该真的是utf8_general_ci
还是utf8_bin
?
上述工作流程中是否缺少某些内容?
is there something missing in the above workflow?
来源:
http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/
http://webcollab.sourceforge.net/unicode.html
http://stackoverflow.com/a/3742879/1043231
http://www.adayinthelifeof.nl/2010/12/04/about-using-utf-8-fields-in-mysql/
http://akrabat.com/php/utf8-php-and-mysql/
推荐答案
-
mb_internal_encoding('UTF-8')
本身不执行任何操作,它仅为每个mb_
函数设置默认的编码参数.如果您不使用任何mb_
函数,则没有任何区别.如果需要的话,进行设置很有意义,这样就不必每次都单独传递$encoding
参数. - IMO
mb_detect_encoding
几乎没有用,因为从根本上不可能准确地检测未知文本的编码.您应该要么因为对文本的规范而知道文本的编码是什么,要么需要解析适当的元数据,例如标头或指定了编码的元标记. - 使用
mb_check_encoding
来检查文本块是否在您期望的编码格式中有效是足够的.如果不是,则将其丢弃并抛出适当的错误. -
关于:
mb_internal_encoding('UTF-8')
doesn't do anything by itself, it only sets the default encoding parameter for eachmb_
function. If you're not using anymb_
function, it doesn't make any difference. If you are, it makes sense to set it so you don't have to pass the$encoding
parameter each time individually.- IMO
mb_detect_encoding
is mostly useless since it's fundamentally impossible to accurately detect the encoding of unknown text. You should either know what encoding a blob of text is in because you have a specification about it, or you need to parse appropriate meta data like headers or meta tags where the encoding is specified. - Using
mb_check_encoding
to check if a blob of text is valid in the encoding you expect it to be in is typically sufficient. If it's not, discard it and throw an appropriate error. Regarding:
这是否意味着我必须使用所有多字节函数而不是其核心函数
does this mean I have to use all multi byte functions instead of its core functions
如果要处理包含多字节字符的字符串,那么可以,您需要使用
mb_
函数来避免得到错误的结果.核心字符串功能仅在字节级别上起作用,而不在字符级别上起作用,这是在使用字符串时通常需要的字符级别.If you are manipulating strings that contain multibyte characters, then yes, you need to use the
mb_
functions to avoid getting wrong results. The core string functions only work on a byte level, not a character level, which is what you typically want when working with strings.这篇关于UTF8工作流程PHP,MySQL总结的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!