Python打开(“x”,“r”)功能,我如何知道或控制文件应该具有哪些编码? [英] Python open("x", "r") function, how do I know or control which encoding the file is supposed to have?

查看:143
本文介绍了Python打开(“x”,“r”)功能,我如何知道或控制文件应该具有哪些编码?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如果一个python脚本使用 open(filename,r)函数打开并随后读取文本文件的内容,我怎么能请注意,该文件应该具有哪些编码?

If a python script uses the open("filename", "r") function to open, and subsequently read, the contents of a text file, how can I tell which encoding this file is supposed to have?

请注意,由于我从我自己的程序中执行此脚本,如果有任何方法通过环境变量来控制那么对我来说这是足够好的。

Note that since I'm executing this script from my own program, if there is any way to control this through environment variables, then that is good enough for me.

这是Python 2.7。

This is Python 2.7 by the way.

问题来自Mercurial,它可以给出一个文件列表,例如通过磁盘上的文件添加到存储库中,而不是将它们传递到命令行。

The code in question comes from Mercurial, it can be given a list of files to, say, add to the repository, through a file on disk, instead of passing them on the command line.

所以基本上,而不是这样:

So basically, instead of this:

hg add A B C

我可以将A,B和C写入一个文件,每个之间加上换行符,然后执行以下操作:

I can write out A, B and C to a file, with newlines between each, and then execute the following:

hg add listfile:input.txt

最终读取这个文件的代码是这样的:

The code that ends up reading this file is this:

files = open(name, 'r').read().split(delimiter)

所以我的问题。当我问我应该使用哪个编码时,我在IRC上给出的答案是:

Hence my question. The answer I was given on IRC when I asked which encoding I should use was this:


它与使用的编码相同在传递文件参数时的命令行

it is the same encoding than the one you use on command line when passing a file argument

我认为这是我执行Mercurial时使用的相同编码(HG)。因为我不知道是哪个编码,我只是给.NET Process对象提供一切,我在这里问。

I take this to mean that it is the same encoding I "use" when I execute Mercurial (hg). Since I have no idea which encoding that is, I just give everything to the .NET Process object, I ask here.

推荐答案

你不行读取文件与其编码无关;您需要提前知道编码才能正确解读您读取的字节。

You can't. Reading a file is independent of its encoding; you'll need to know the encoding in advance in order to properly interpret the bytes you read in.

例如,如果您知道文件是以UTF-8编码的:
$ b

For example, if you know the file is encoded in UTF-8:

with open('filename', 'rb') as f:
    contents = f.read().decode('utf-8-sig')    # -sig deals with BOM, if present

或如果你知道这个文件只有ASCII:

Or if you know the file is ASCII only:

with open('filename', 'r') as f:
    contents = f.read()    # results in a str object

如果你真的不知道文件的编码,那么显然不能保证你可以正确读取它;但是,您可以使用 chardet

If you really don't know the encoding of the file, then there's obviously no guarantee that you can read it properly; however, you can guess at the encoding using a tool like chardet.

更新

我想我现在明白你的问题。我以为你有一个你需要编写代码的文件,但似乎你需要编写一个文件的代码; - )

I think I understand your question now. I thought you had a file you needed to write code for, but it seems you have code you need to write a file for ;-)

有问题的代码可能只有使用简单的ASCII处理(可能的字符串后来被转换,但不太可能我认为)。因此,您需要制作一个仅包含ASCII(codepoint< 128)字符的文本文件,并确保它以ASCII编码(即不是UTF-16或任何类似的东西)保存。考虑到Mercurial处理可以包含Unicode字符的文件名,这有点不幸。

The code in question probably only deals properly with plain ASCII (it's possible the strings are converted later, but unlikely I think). So you'll want to make a text file that contains only ASCII (codepoint < 128) characters, and make sure it is saved in an ASCII encoding (i.e. not UTF-16 or anything like that). This is a little unfortunate considering that Mercurial deals with filenames, which can contain Unicode characters.

这篇关于Python打开(“x”,“r”)功能,我如何知道或控制文件应该具有哪些编码?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆