非ASCII编码文件中的换行符 [英] Newline characters in non ASCII encoded files

查看:78
本文介绍了非ASCII编码文件中的换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用Python 2.6读取带有Windows行尾('\r\n')的latin2编码文件。

I'm using Python 2.6 to read latin2 encoded file with windows line endings ('\r\n').

import codecs

file = codecs.open('stackoverflow_secrets.txt', encoding='latin2', mode='rt')
line = file.readline()
print(repr(line))

输出: u'登录: yabcok\n'

file = codecs.open('stackoverflow_secrets.txt', encoding='latin2', mode='r')
line = file.readline()
print(repr(line))

file = codecs.open('stackoverflow_secrets.txt', encoding='latin2', mode='rb')
line = file.readline()
print(repr(line))

输出: u'password:l1x1%Dm\r\n'

我的问题:


  1. 为什么文本模式不是默认模式?文档另有说明。 codecs 模块是否通常与二进制文件一起使用?

  2. 为什么不从readline()输出中删除换行符?

  3. 是否可以为非ASCII编码的文件指定换行符。

  1. Why text mode is not the default? Documentation states otherwise. Is codecs module commonly used with binary files?
  2. Why newline chars aren't stripped from readline() output? This is annoying and redundant.
  3. Is there a way to specify newline character for files not ASCII encoded.


推荐答案

您确定您的示例正确吗?编解码器模块的文档说:

Are you sure that your examples are correct? The documentation of the codecs module says:


注意:即使未指定二进制模式,文件始终以二进制模式打开。这样做是为了避免由于使用8位值进行编码而导致的数据丢失。这意味着在读写时不会自动转换'\n'。

Note: Files are always opened in binary mode, even if no binary mode was specified. This is done to avoid data loss due to encodings using 8-bit values. This means that no automatic conversion of '\n' is done on reading and writing.

在我的系统上,使用拉丁文- 2个编码文件+ DOS行尾, rt, r和 rb之间没有区别(免责声明:我在Linux上使用2.5)。

On my system, with a Latin-2 encoded file + DOS line endings, there's no difference between "rt", "r" and "rb" (Disclaimer: I'm using 2.5 on Linux).

open 的文档中也没有提到 t标志,因此这种行为似乎有些奇怪。

The documentation for open also mentions no "t" flag, so that behavior seems a little strange.

不会从行中删除换行符,因为不是 readline 返回的所有行都可能以换行符结尾。如果文件不以换行符结尾,则最后一行不包含换行符。 (显然,我无法提出更好的解释。)

Newline characters are not stripped from lines because not all lines returned by readline may end in newlines. If the file does not end with a newline, the last line does not carry one. (I obviously can't come up with a better explanation).

换行符根据编码而不同(至少在使用ASCII表示0的字符之间没有区别) -127),仅基于平台。您可以在打开文件时在模式下指定 U,Python会检测到任何形式的换行符,包括Windows,Mac或Unix。

Newline characters do not differ based on the encoding (at least not among the ones which use ASCII for 0-127), only based on the platform. You can specify "U" in the mode when opening the file and Python will detect any form of newline, either Windows, Mac or Unix.

这篇关于非ASCII编码文件中的换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆