如何确定文件编码? [英] How can I be sure of the file encoding?
问题描述
我有一个我用VIM创建的PHP文件,但我不知道它的编码。
I have a PHP file that I created with VIM, but I'm not sure which is its encoding.
当我使用终端并检查编码命令 file -bi foo
(我的操作系统是Ubuntu 11.04),它给我下一个结果:
When I use the terminal and check the encoding with the command file -bi foo
(My operating system is Ubuntu 11.04) it gives me the next result:
code> text / html; charset = us-ascii
text/html; charset=us-ascii
但是,当我用gedit打开文件时,它说它的编码是UTF-8。
But, when I open the file with gedit it says its encoding is UTF-8.
哪一个是正确的?我想要的文件以UTF-8编码。
Which one is correct? I want the file to be encoded in UTF-8.
我的猜测是文件中没有BOM,命令 file -bi
读取文件,没有找到任何UTF-8字符,因此它假定它是ascii,但实际上它是以UTF-8编码的。
My guess is that there's no BOM in the file and that the command file -bi
reads the file and doesn't find any UTF-8 characters, so it assumes that it's ascii, but in reality it's encoded in UTF-8.
推荐答案
首先,请注意,ASCII是UTF-8的子集,因此,如果您的文件只包含ASCII字符,那么说它是以ASCII编码的
Well, first of all, note that ASCII is a subset of UTF-8, so if your file contains only ASCII characters, it's correct to say that it's encoded in ASCII and it's correct to say that it's encoded in UTF-8.
这是正确的说,文件
通常只检查文件开头的短片段以确定其类型,因此如果存在非ASCII字符,但它们超出文件的初始段,则可能将其声明为us-ascii。另一方面,gedit可能会说,该文件是UTF-8,即使它是ASCII,因为UTF-8是gedit的首选字符编码,并且打算使用UTF-8保存文件,如果你要添加任何非ASCII字符您的编辑会话。再次,如果这是gedit说的话,这不会错。
That being said, file
typically only examines a short segment at the beginning of the file to determine its type, so it might be declaring it us-ascii if there are non-ASCII characters but they are beyond the initial segment of the file. On the other hand, gedit might say that the file is UTF-8 even if it's ASCII because UTF-8 is gedit's preferred character encoding and it intends to save the file with UTF-8 if you were to add any non-ASCII characters during your edit session. Again, if that's what gedit is saying, it wouldn't be wrong.
现在到您的问题:
-
运行此命令:
Run this command:
tr -d \\000-\\177 < your-file | wc -c
如果输出为0,则文件只包含ASCII字符。
If the output says "0", then the file contains only ASCII characters. It's in ASCII (and it's also valid UTF-8) End of story.
执行此命令
iconv -f utf-8 -t ucs-4 < your-file >/dev/null
如果您遇到错误, UTF-8(或至少,它的一些部分已损坏)。
If you get an error, the file does not contain valid UTF-8 (or at least, some part of it is corrupted).
如果没有得到任何错误,该文件极有可能是UTF-8。这是因为UTF-8具有的属性使得在任何其他常用的字符编码中将典型文本误认为有效的UTF-8。
If you get no error, the file is extremely likely to be UTF-8. That's because UTF-8 has properties that make it very hard to mistake typical text in any other commonly used character encoding for valid UTF-8.
这篇关于如何确定文件编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!