Python 限制 readlines() 的换行符 [英] Python restrict newline characters for readlines()
问题描述
我正在尝试拆分使用混合换行符 LF
、CRLF
和 NEL
的文本.我需要最好的方法将 NEL
字符排除在场景之外.
是否可以在分割行时指示 readlines()
排除 NEL?我可能能够 read()
并在循环中仅匹配 LF
和 CRLF
分割点.
有没有更好的解决方案?
我用 codecs.open()
打开文件以打开 utf-8
文本文件.
并且在使用 readlines()
时,它确实在 NEL 字符处拆分:
文件内容为:
"u'Line 1 \\x85 Line 1.1\\r\\nLine 2\\r\\nLine 3\\r\\n'"
file.readlines()
will only ever split on \n
, \r
或 \r\n
取决于操作系统以及是否启用了通用换行符支持.
U+0085 NEXT LINE (NEL) 在该上下文中不被识别为换行符,并且您无需执行任何特殊操作即可让 file.readlines()
忽略它.
Python 通常构建时支持通用换行;提供 'U'
将文件作为文本文件打开,但行可以由以下任何一种终止:Unix 行尾约定 '\n'
,Macintosh 约定 '\r'
或 Windows 约定 '\r\n'
.所有这些外部表示都被 Python 程序视为 '\n'
.如果 Python 是在没有通用换行符的情况下构建的,则支持 'U'
的模式与普通文本模式相同.请注意,如此打开的文件对象还有一个名为 newlines 的属性,其值为 None(如果尚未看到换行符)、'\n'
、'\r'
、'\r\n'
或包含所有看到的换行符类型的元组.
一种解释文本流的方式,其中以下所有内容都被识别为行尾:Unix 行尾约定 '\n'
,Windows 约定 '\r\n'
和旧的 Macintosh 约定 '\r'
.请参阅 PEP 278 和 PEP 3116,以及用于额外用途的 str.splitlines()
.
不幸的是,codecs.open()
打破了这条规则;文档 含糊地提到了被询问的特定编解码器:
行结束符是使用编解码器的解码器方法实现的,如果 keepends 为真,则将其包含在列表条目中.
代替 codecs.open()
,使用 io.open()
以正确的编码打开文件,然后逐行处理:
with io.open(filename, encoding=correct_encoding) as f:行 = f.open()
io
是新的 I/O 基础设施,它完全取代了 Python 3 中的 Python 2 系统.它只处理 \n
, \r
和 \r\n
:
codecs.open()
结果是由于代码使用了 str.splitlines()
正在使用,其中 一个文档错误;拆分 unicode 字符串时,它将拆分 Unicode 标准认为是换行符的任何内容(这是 相当复杂的问题).这种方法的文档没有解释这一点;它声称只根据通用换行规则进行拆分.
I am trying to split a text which uses a mix of new line characters LF
, CRLF
and NEL
. I need the best method to exclude NEL
character out of the scene.
Is there an option to instruct readlines()
to exlude NEL while splitting lines? I may be able to read()
and go for matching only LF
and CRLF
split points on a loop.
Is there any better solution?
I open the file with codecs.open()
to open utf-8
text file.
And while using readlines()
, it does split at NEL characters:
The file contents are:
"u'Line 1 \\x85 Line 1.1\\r\\nLine 2\\r\\nLine 3\\r\\n'"
file.readlines()
will only ever split on \n
, \r
or \r\n
depending on the OS and if universal newline support is enabled.
U+0085 NEXT LINE (NEL) is not recognised as a newline splitter in that context, and you don't need to do anything special to have file.readlines()
ignore it.
Quoting the open()
function documentation:
Python is usually built with universal newlines support; supplying
'U'
opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention'\n'
, the Macintosh convention'\r'
, or the Windows convention'\r\n'
. All of these external representations are seen as'\n'
by the Python program. If Python is built without universal newlines support a mode with'U'
is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen),'\n'
,'\r'
,'\r\n'
, or a tuple containing all the newline types seen.
and the universal newlines glossary entry:
A manner of interpreting text streams in which all of the following are recognized as ending a line: the Unix end-of-line convention
'\n'
, the Windows convention'\r\n'
, and the old Macintosh convention'\r'
. See PEP 278 and PEP 3116, as well asstr.splitlines()
for an additional use.
Unfortunately, codecs.open()
breaks with this rule; the documentation vaguely alludes to the specific codec being asked:
Line-endings are implemented using the codec’s decoder method and are included in the list entries if keepends is true.
Instead of codecs.open()
, use io.open()
to open the file in the correct encoding, then process the lines one by one:
with io.open(filename, encoding=correct_encoding) as f:
lines = f.open()
io
is the new I/O infrastructure that replaces the Python 2 system entirely in Python 3. It handles just \n
, \r
and \r\n
:
>>> open('/tmp/test.txt', 'wb').write(u'Line 1 \x85 Line 1.1\r\nLine 2\r\nLine 3\r\n'.encode('utf8'))
>>> import codecs
>>> codecs.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85', u' Line 1.1\r\n', u'Line 2\r\n', u'Line 3\r\n']
>>> import io
>>> io.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85 Line 1.1\n', u'Line 2\n', u'Line 3\n']
The codecs.open()
result is due to the code using str.splitlines()
being used, which has a documentation bug; when splitting a unicode string, it'll split on anything that the Unicode standard deems to be a line break (which is quite a complex issue). The documentation for this method is falling short of explaining this; it claims to only split according to the Universal Newline rules.
这篇关于Python 限制 readlines() 的换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!