Python 限制 readlines() 的换行符 [英] Python restrict newline characters for readlines()

查看:83
本文介绍了Python 限制 readlines() 的换行符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试拆分使用混合换行符 LFCRLFNEL 的文本.我需要最好的方法将 NEL 字符排除在场景之外.

是否可以在分割行时指示 readlines() 排除 NEL?我可能能够 read() 并在循环中仅匹配 LFCRLF 分割点.

有没有更好的解决方案?

我用 codecs.open() 打开文件以打开 utf-8 文本文件.

并且在使用 readlines() 时,它确实在 NEL 字符处拆分:

文件内容为:

"u'Line 1 \\x85 Line 1.1\\r\\nLine 2\\r\\nLine 3\\r\\n'"

解决方案

file.readlines() will only ever split on \n, \r\r\n 取决于操作系统以及是否启用了通用换行符支持.

U+0085 NEXT LINE (NEL) 在该上下文中不被识别为换行符,并且您无需执行任何特殊操作即可让 file.readlines() 忽略它.

引用 open() 函数文档:

<块引用>

Python 通常构建时支持通用换行;提供 'U' 将文件作为文本文件打开,但行可以由以下任何一种终止:Unix 行尾约定 '\n',Macintosh 约定 '\r' 或 Windows 约定 '\r\n'.所有这些外部表示都被 Python 程序视为 '\n'.如果 Python 是在没有通用换行符的情况下构建的,则支持 'U' 的模式与普通文本模式相同.请注意,如此打开的文件对象还有一个名为 newlines 的属性,其值为 None(如果尚未看到换行符)、'\n''\r''\r\n' 或包含所有看到的换行符类型的元组.

通用换行符词汇表条目:

<块引用>

一种解释文本流的方式,其中以下所有内容都被识别为行尾:Unix 行尾约定 '\n',Windows 约定 '\r\n' 和旧的 Macintosh 约定 '\r'.请参阅 PEP 278PEP 3116,以及用于额外用途的 str.splitlines().

不幸的是,codecs.open() 打破了这条规则;文档 含糊地提到了被询问的特定编解码器:

<块引用>

行结束符是使用编解码器的解码器方法实现的,如果 keepends 为真,则将其包含在列表条目中.

代替 codecs.open(),使用 io.open() 以正确的编码打开文件,然后逐行处理:

 with io.open(filename, encoding=correct_encoding) as f:行 = f.open()

io 是新的 I/O 基础设施,它完全取代了 Python 3 中的 Python 2 系统.它只处理 \n, \r\r\n:

<预><代码>>>>open('/tmp/test.txt', 'wb').write(u'Line 1 \x85 Line 1.1\r\nLine 2\r\nLine 3\r\n'.encode('utf8'))>>>导入编解码器>>>codecs.open('/tmp/test.txt', encoding='utf8').readlines()[u'第1行\x85'、u'第1.1行\r\n'、u'第2行\r\n'、u'第3行\r\n']>>>导入 io>>>io.open('/tmp/test.txt', encoding='utf8').readlines()[u'第1行\x85第1.1行\n'、u'第2行\n'、u'第3行\n']

codecs.open() 结果是由于代码使用了 str.splitlines() 正在使用,其中 一个文档错误;拆分 unicode 字符串时,它将拆分 Unicode 标准认为是换行符的任何内容(这是 相当复杂的问题).这种方法的文档没有解释这一点;它声称只根据通用换行规则进行拆分.

I am trying to split a text which uses a mix of new line characters LF, CRLF and NEL. I need the best method to exclude NEL character out of the scene.

Is there an option to instruct readlines() to exlude NEL while splitting lines? I may be able to read() and go for matching only LF and CRLF split points on a loop.

Is there any better solution?

I open the file with codecs.open() to open utf-8 text file.

And while using readlines(), it does split at NEL characters:

The file contents are:

"u'Line 1 \\x85 Line 1.1\\r\\nLine 2\\r\\nLine 3\\r\\n'"

解决方案

file.readlines() will only ever split on \n, \r or \r\n depending on the OS and if universal newline support is enabled.

U+0085 NEXT LINE (NEL) is not recognised as a newline splitter in that context, and you don't need to do anything special to have file.readlines() ignore it.

Quoting the open() function documentation:

Python is usually built with universal newlines support; supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'. All of these external representations are seen as '\n' by the Python program. If Python is built without universal newlines support a mode with 'U' is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen), '\n', '\r', '\r\n', or a tuple containing all the newline types seen.

and the universal newlines glossary entry:

A manner of interpreting text streams in which all of the following are recognized as ending a line: the Unix end-of-line convention '\n', the Windows convention '\r\n', and the old Macintosh convention '\r'. See PEP 278 and PEP 3116, as well as str.splitlines() for an additional use.

Unfortunately, codecs.open() breaks with this rule; the documentation vaguely alludes to the specific codec being asked:

Line-endings are implemented using the codec’s decoder method and are included in the list entries if keepends is true.

Instead of codecs.open(), use io.open() to open the file in the correct encoding, then process the lines one by one:

with io.open(filename, encoding=correct_encoding) as f:
    lines = f.open()

io is the new I/O infrastructure that replaces the Python 2 system entirely in Python 3. It handles just \n, \r and \r\n:

>>> open('/tmp/test.txt', 'wb').write(u'Line 1 \x85 Line 1.1\r\nLine 2\r\nLine 3\r\n'.encode('utf8'))
>>> import codecs
>>> codecs.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85', u' Line 1.1\r\n', u'Line 2\r\n', u'Line 3\r\n']
>>> import io
>>> io.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85 Line 1.1\n', u'Line 2\n', u'Line 3\n']

The codecs.open() result is due to the code using str.splitlines() being used, which has a documentation bug; when splitting a unicode string, it'll split on anything that the Unicode standard deems to be a line break (which is quite a complex issue). The documentation for this method is falling short of explaining this; it claims to only split according to the Universal Newline rules.

这篇关于Python 限制 readlines() 的换行符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆