UTF-8文件中的PHP源代码;如何正确解释? [英] PHP source code in UTF-8 files; how to interpret properly?

查看:71
本文介绍了UTF-8文件中的PHP源代码;如何正确解释?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我构建了用于分析源代码的工具.此类工具必须正确读取源代码文件,尤其是在字符编码方面.例如,字符串文字中精确的字节字符串是什么?" (PHP文字和HTML文本).

I build tools to analyze source code. Such tools have to read the source code files correctly, especially as regards character encodings. For example, "What is the precise string of bytes in a string literal?" (both PHP literals, and HTML text).

我的错误理解可能是PHP源文件仅是8位字符(也就是说,PHP引擎以[正确]的方式读取它们,因为它们只应包含8位字符).但是,哪种编码中的八位字符? (我想匹配ISO-8859-1(-x?)[有人可以引用章节和经文吗?].也就是说,变音符号的目的是变音符号,对吗?接下来,可以用HTML编写PHP脚本了.以及直接用于大多数欧洲国家/地区的字符串.

My perhaps erroneous understanding is that PHP source files are 8-bit character only (that is, the PHP engine reads them that way [right]?, since they are only supposed to contain 8 bit characters). But, eight bit characters in which encoding? (I presume intended to match ISO-8859-1 (-x?) [can somebody quote chapter and verse?]. That is, an umlaut is intended to be an umlaut, right? Following this, one can write PHP scripts with HTML and strings for most European nations/character sets straightforwardly.

但是很明显,这对于Unicode是有问题的.据我所知,大多数PHP应用程序本质上都是通过处理包含UTF-8字节序列的字符串来处理Unicode的,这些字符串可以插入8位PHP字符串中.然后,如果您告诉服务器正在生成UTF-8文本,则可以生成HTML包含Unicode UTF-8序列的脚本.

But it is clear this is problematic with Unicode. As far as I can tell, most PHP applications deal with Unicode essentially by having strings containing UTF-8 byte sequences which can be inserted in 8-bit PHP strings. Following this, one can generate scripts whose HTML contains Unicode UTF-8 sequences, if you tell your server you are generating UTF-8 text.

在上述情况下,可以将PHP文件读取为8位字符文本,在我看来,这与该语言相匹配.

For the above situations, one can read the PHP file as 8-bit character text, and this seems to me to match the language.

让我感到困惑的是,PHP源文件编码为UTF-8(Joomla程序包中包含约1800个源文件, 其中有10个是UTF-8,其余则不是).在UTF-8渲染中正确显示的所有(非ASCII)欧洲字符实际上都编码为多字节序列.我想这类用作UTF-8的页面将正确呈现HTML.但是,显然不能在文本编辑器中正确呈现的欧洲字符或其他Unicode字符的任何字符串比较都将无法进行.字符串文字将不包含它们似乎包含的内容.程序员是否使用UTF-8文件,因为这正是编辑器提供的?他们是故意这样做吗?还是仅仅是事故对大多数工作都没有关系?

What puzzles me are PHP source files encoded as UTF-8 (the Joomla package has ~1800 source files, of which some 10 are UTF-8 and the rest are not). Any (non-ASCII) European characters that show correctly in a UTF-8 rendering are actually encoded as multibyte sequences. I suppose such pages served as UTF-8 will have the HTML rendered correctly. But any string comparisons for European characters or other Unicode characters that apparently render correctly in a text editor simply won't work. And string literals will not contain what they appear to contain. Do programmers use UTF-8 files because that's what editors offer? Are they doing this on purpose? Or is just an accident that doesn't matter for most work?

那么,应该如何阅读PHP源文件? (特别是用什么字符编码?)一个可能的答案是,始终使用ISO-8859-1 8位代码,而不考虑实际的内容或BOM(我看到很多带有UTF-8 BOM标记的PHP文件).另一个答案是UTF-8(如果已标记).

[我们的工具可以读写任意编码. 简单"工具是一个文件中的字符读取编码,用另一种编码方式写入相同的代码点.以这种方式读取UTF-8 PHP文件会使我们陷入编写ISO8859-1等效文件的麻烦,因为许多UTF-8代码点(例如欧元符号)无法用ISO8859-x编码.]

[Our tools read and write arbitrary encodings. A "trivial" tool is read-file-in-one-character encoding, write identical code points in another encoding. Reading UTF-8 PHP files that way, gets us into trouble writing ISO8859-1 equivalent files, because many UTF-8 code points (e.g., the euro symbol) cannot be encoded in ISO8859-x.]

编辑8月30日:现在,我们检查PHP文件,以查看是否具有UTF-8 BOM或看起来具有合法的UTF-8序列.在这两种情况下,我们都将文件读取为UTF-8.否则,我们默认将其读取为ISO8859-1.现在,如果我们对其进行修改,我们将保留文件编码. (正确处理所有这些工作).这似乎是一个安全的策略,但可能与PHP程序员所期望的有所不同.

EDIT Aug 30: We now check PHP files to see if the have UTF-8 BOMs, or appear to have UTF-8 sequences that are all legal. In either of these cases, we read the file as UTF-8; otherwise we read it as ISO8859-1 by default. We now preserve the file encoding if we modify it. (Getting all this right is quite a lot of work). This seems to be a safe strategy, but that may be different than what PHP programmers are expecting.

推荐答案

TL; DR

ASCII

直到PHP 5.4,PHP解释器才完全不在乎PHP文件的字符集,这由

Until PHP 5.4, the PHP interpreter didn't at all care about the charset of PHP files, as evidenced by the fact that the zend.script_encoding ini directive only appeared in that version. It always treated it as ASCII basically.

例如,当PHP需要识别一个函数名称时,恰好包含的字符超出了ASCII-7bit(好吧,任何带有任何标签的带标签的实体,但我明白了...),它只是在寻找符号表中具有相同字节序列的函数-以一种方式编写的变音符号(或其他形式)与以另一种方式编写的变音符号将被区别对待.试试吧.为了向后兼容,如果未设置zend.script_encoding,这仍然是默认行为.还要注意显示什么是有效标识符的正则表达式,您可以看到它是字符集无关的(嗯...除了拉丁字母(在ASCII-7bit范围内),而是显示字节.

When PHP needs to identify, for example, a function name, that happens to contain characters beyond ASCII-7bit (well, any labeled entity with any label really, but you get my point...), it merely looks for a function in the symbol table with the same byte sequence - an umlaut (or whatever...) written in one way would be treated differently than an umlaut written in another way. Try it. For backwards compatibility, if zend.script_encoding is not set, this is still the default behavior. Also take note of the regex showing what is a valid identifier, which you can see is charset neutral (well... except latin letters, which are in the ASCII-7bit range), but shows you bytes instead.

这也将我们引到 declare(编码)构造.如果您在文件中看到THAT,那就是该特定文件应遵循的确定字符集(仅).使用其他东西,直到遇到一个,如果看到不止一个,则在声明语句后兑现第二个.

This leads us also to the declare(encoding) construct. If you see THAT in a file, that's the definitive charset to honor for that particular file (ONLY). Use something else until you encounter one, and if you see more than one - honor the second one after its declare statement.

如果没有...

在静态上下文中(即,当您不知道有效的ini设置时),当字符集很重要时,您需要回退到其他内容(理想情况下是用户定义的内容),否则,请仅处理超出此范围的字符ASCII-7位元作为纯二进制文件,并以类似统一的代码点的方式显示它们.

In a static context (i.e. when you don't know the effective ini settings), you'd need to fallback to something else (something that's user defined, ideally) when the charset is important, or otherwise just treat characters beyond ASCII-7bit as pure binary, and display them in some uniform code-point-like fashion.

在动态上下文中(例如,如果您可以暂时重命名文件,请在该位置使用该名称创建一个临时文件;使其回显zend.script_encoding的值;恢复正常文件),您应该使用zend.script_encoding值(如果可用),否则应使用其他值(就像在静态上下文中一样).

In a dynamic context (e.g. if you could for example rename the file for a moment, create a temporary file at that place, with that name; have it echo the value of zend.script_encoding; restore back the normal file), you should use the zend.script_encoding value if available, and fallback to something else (just as in a static context) otherwise.

相同的处理方式适用于字符串,HTML片段和PHP文件的任何其他内容-它将作为二进制字符串读取,除了对PHP词法分析器很重要的某些ASCII字符(即字节),例如序列"<?php(注意,所有字符均为ASCII字符...);用单引号引起来的撇号;等等-解释器本身并不关心字符串的字符集,如果必须在屏幕上显示字符串的内容,则应使用上述方法来找出实现此目的的最佳方法.

The same treatment applies to strings, HTML fragments and any other contents of a PHP file - it's just read as a binary string, except certain ASCII characters (i.e. bytes) that are important to the PHP lexer, such as the sequence "<?php" (notice that all are ASCII characters...); an apostrophe within a single quoted string; etc. - The interpreter itself doesn't care about a string's charset, and if you must display a string's contents on screen, you should use the above means to figure out the best way to do so.

极端情况(在评论中要求):

Edge cases (requested in comments):

对允许的编码方式有限制吗?

Is there a restriction on what encoding are allowed?

似乎在任何地方都没有任何允许的编码列表,或者至少我找不到.鉴于这是--enable-zend-multibyte编译设置的后继者,所有类型的UTF编码肯定会在该列表中.即使其他(ANSI)编码对PHP本身没有影响,也不应阻止您使用该值作为提示.

There doesn't seem to be any list of allowed encodings anywhere, or at least I can't find one. Given that this is the successor of the --enable-zend-multibyte compile setting, UTF encodings of all flavors are sure to be in that list. Even if other (ANSI) encodings don't have an effect on PHP itself, that shouldn't deter you from using that value as a hint.

如果源文件是UTF-16(声明的8位ascii字符之间为空的8位字节),"declare(encoding)"如何工作?

How does "declare(encoding)" work if the source file is UTF-16 (null 8 bit bytes between 8 bit ascii chars for the declaration)?

一直使用

zend.script_encoding,直到遇到clarify(encoding).如果未设置,则假定为ASCII.即使在UTF-16文件中也不应该成为问题...对吗? (我不使用UTF-16)

zend.script_encoding is used until a declare(encoding) is encountered. If it's not set, ASCII is assumed. This shouldn't be a problem even in a UTF-16 file... right? (I don't use UTF-16)

如果.ini或文件设置为UTF-8或其他格式,那么是否仅从x41-xFF范围内的代码点获取标识符,而不是从x100上的代码点获取标识符?

If the .ini or the file setting is UTF-8 or otherwise, then identifiers are presumably taken only from code points in range x41-xFF, but not from code points x100 up?

我没有尝试提供无效的UTF-8字节来告诉您该答案,而且手册也没有说明任何问题.我认为PHP执行将失败,并出现解析错误.或者至少应该如此.就您的工具而言,无论如何,它都应该报告无效的UTF-8序列,因为即使PHP允许,这仍然是一个质量检查问题.

I haven't tried supplying invalid UTF-8 bytes to tell you the answer to that one, nor does the manual ever state anything on the question. I would assume that PHP execution will fail with a parse error on that. Or at least it should. As far as your tool is concerned, it should report the invalid UTF-8 sequence anyway, since even if PHP allows it, that's still a QA problem.

对于UTF编码,字符串中的字符是否表示为其UTF代码点(这没有意义,因为PHP字符串似乎只有8位字符)?

For UTF encodings, are characters in strings represented as their UTF code point (that makes no sense since PHP strings seem only have 8 bit characters)?

不.字符串和非PHP内容中的字符仍被视为字节序列,您可以通过查看strlen()的输出并查看它与mb_strlen()的区别来确认,mb_strlen()是尊重编码的字节(好吧……它确实尊重mbstring.internal_encoding设置,但仍然如此.)

No. Characters in strings and non-PHP content are still treated as just a sequence of bytes, which you can confirm by looking at the output of strlen(), and seeing how it differs from mb_strlen(), which is the one that respects encoding (well... it respects the mbstring.internal_encoding setting to be exact, but still).

如果没有,将编码设置为UTF意味着什么?

If not, what does it mean to set the encoding to UTF something?

AFAIK,它会影响符号表中的查找.设置了UTF,以不同的方式或以不同的UTF风格编写的变音符号以相同的UTF代码点结束...它们都将收敛于相同的符号,而不是没有声明(编码)的情况,在逐个字节的情况下,而是执行字节比较.我在这里说"AFAIK",因为坦率地说,我自己从来没有使用过这样的实验...我是一个做个好人,'一切都有效-UTF-8'-er".

AFAIK, it affects lookups in the symbol table. With UTF set, umlauts written in different ways, or in different UTF flavors that end up with the same UTF code points... they would all converge on the same symbol, as opposed to without declare(encoding), where byte-by-byte comparrison is done instead. And I say "AFAIK" here, because frankly, I've never used such experiments myself... I'm a "do gooddy 'everything-as-valid-UTF-8'-er".

这篇关于UTF-8文件中的PHP源代码;如何正确解释?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆