检测行结尾 [英] Detecting line endings

查看：67 发布时间：2019/6/5 4:33:49 python

本文介绍了检测行结尾的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

大家好，

我正在尝试检测文本文件中使用的行结尾。我*可能*是* b $ b首先将文件解码为unicode（可以使用

多字节编码进行编码） - 这就是为什么我不让Python处理

行结尾。

以下是安全和理智的：

text = open（''test .txt''，''rb''）。read（）

如果编码：

text = text.decode（encoding）

结束=''\ n''#default

if''\\\\ n''in text：

text = text.replace（''\\ \\\'''，''\ n''）

结束=''\\\\ n'

elif''\ n ''在文中：

结束=''\ n''

elif''\ r''in text：

text = text.replace（''\'''，''\ n''）

结束=''\ r''

我担心的是如果''\ n''*并不表示Mac上的换行符，

那么它可能存在于人体中文本的y - 并提前触发``结束=

''\ n''``？

一切顺利，

Fuzzyman
http：/ /www.voidspace.org.uk/python/index.shtml

解决方案

Fuzzyman启发我们：
我担心的是，如果''\ n''*并不表示Mac上的换行符，那么它可能存在于文本正文中 - 并触发` '结束=
''\ n''``过早？

我会计算出''\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ \\ n''，''\ n''没有先前的

''\ r''和''\'''没有关注''\ n''，并让多数决定。

Sybren

-

世界的问题是愚蠢。并不是说应该对愚蠢的死刑进行处罚，但为什么我们不要仅仅拿掉

安全标签来解决问题呢？ br />
Frank Zappa

Sybren Stuvel写道：
Fuzzyman启发我们：< blockquote class =post_quotes>我担心的是，如果''\ n''*并不表示Mac上的换行符，那么它可能存在于文本正文中 - 并触发'`结束=
''\ n''``过早？
我会计算''\\ n''''''\\'''的出现次数'没有先前的'/'''和''\\''''''''''''''''''''''''''''''''''''''''''''''''''''''' >
听起来很合理，小文件的边缘情况应该被诅咒。 :-)

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Sybren
-
世界的问题是愚蠢。不是说应该对愚蠢的死刑进行处罚，但为什么我们不把所有的安全标签都拿走，让问题自行解决？
Frank Zappa

Sybren Stuvel写道：
Fuzzyman启发我们：
我的担心是，如果''\ n''*并不表示Mac上的换行符，那么它可能存在于文本正文中 - 并触发``结束=
''\ n''``过早？
我会计算''\\ n''，''\ n''的出现次数而没有先行
''\''和''\'''没有关注''\ n''，让多数人决定。

这就是我提出的。正如您从文档字符串中看到的那样，

会在出现平局时尝试合理（-ish）的事情，或者根本没有行结尾。

欢迎评论/更正。我知道测试不是很有用

（因为他们没有*断言*他们不会告诉你它是否会中断），

但你可以看到发生了什么：

导入重新

导入os

rn = re.compile（' '\\'n''）

r = re.compile（''\ r（？！\ n）''）

n = re.compile （''（？<！\r）\ n''）

＃每行结束的（正则表达式，文字，优先级）序列

line_ending = [（n，''\ n''，3），（rn，''\ r \ n''，2），（r，''\ r''，1）]

def find_ending（text，default = os.linesep）：

"""

给定一段文字，使用简单的启发式确定行结束使用

。

如果没有找到行结尾，则返回分配给默认值的值。

这默认为``os.linesep``，结尾为

机器的原生行。

如果两个结局之间有一个平局，优先链是

``''\ n''，''\\\ n'n''，''\ r''``` 。

"""

results = [（len（exp.findall（text）），priority，literal）

exp，literal，line_ending中的优先级]

results.sort（）

打印结果

如果不是总和（m的[m [0]）结果]）：

返回默认值

否则：

返回结果[-1] [ - 1]

if __name__ ==''__ main__''：

tests = [

''hello\\\
goodbye \ nmy fish \ n''，

''hello \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ fish \'''，

''hello\rgoodbye \ n''，

''''，

'' \\\\\\'n'，

''\ n \ n \\\\\\\ n br />
''\ n \ nn \\ r \\\ n'，

''\ n \\\\\\\\\'''，

]

参加测试：

print repr（entry）

print repr（find_ending（entry））

打印

一切顺利，

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml Sybren
-
世界的问题是愚蠢。不是说应该对愚蠢的死刑进行处罚，但为什么我们不把所有的安全标签都拿走，让问题自行解决？
Frank Zappa

Hello all,

I''m trying to detect line endings used in text files. I *might* be
decoding the files into unicode first (which may be encoded using
multi-byte encodings) - which is why I''m not letting Python handle the
line endings.

Is the following safe and sane :

text = open(''test.txt'', ''rb'').read()
if encoding:
text = text.decode(encoding)
ending = ''\n'' # default
if ''\r\n'' in text:
text = text.replace(''\r\n'', ''\n'')
ending = ''\r\n''
elif ''\n'' in text:
ending = ''\n''
elif ''\r'' in text:
text = text.replace(''\r'', ''\n'')
ending = ''\r''
My worry is that if ''\n'' *doesn''t* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
''\n''`` prematurely ?

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

解决方案

Fuzzyman enlightened us with:
My worry is that if ''\n'' *doesn''t* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
''\n''`` prematurely ?

I''d count the number of occurences of ''\r\n'', ''\n'' without a preceding
''\r'' and ''\r'' without following ''\n'', and let the majority decide.

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don''t we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa

Sybren Stuvel wrote:
Fuzzyman enlightened us with:
My worry is that if ''\n'' *doesn''t* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
''\n''`` prematurely ?
I''d count the number of occurences of ''\r\n'', ''\n'' without a preceding
''\r'' and ''\r'' without following ''\n'', and let the majority decide.

Sounds reasonable, edge cases for small files be damned. :-)

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don''t we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa

Sybren Stuvel wrote:
Fuzzyman enlightened us with:
My worry is that if ''\n'' *doesn''t* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
''\n''`` prematurely ?
I''d count the number of occurences of ''\r\n'', ''\n'' without a preceding
''\r'' and ''\r'' without following ''\n'', and let the majority decide.

This is what I came up with. As you can see from the docstring, it
attempts to sensible(-ish) things in the event of a tie, or no line
endings at all.

Comments/corrections welcomed. I know the tests aren''t very useful
(because they make no *assertions* they won''t tell you if it breaks),
but you can see what''s going on :

import re
import os

rn = re.compile(''\r\n'')
r = re.compile(''\r(?!\n)'')
n = re.compile(''(?<!\r)\n'')

# Sequence of (regex, literal, priority) for each line ending
line_ending = [(n, ''\n'', 3), (rn, ''\r\n'', 2), (r, ''\r'', 1)]
def find_ending(text, default=os.linesep):
"""
Given a piece of text, use a simple heuristic to determine the line
ending in use.

Returns the value assigned to default if no line endings are found.
This defaults to ``os.linesep``, the native line ending for the
machine.

If there is a tie between two endings, the priority chain is
``''\n'', ''\r\n'', ''\r''``.
"""
results = [(len(exp.findall(text)), priority, literal) for
exp, literal, priority in line_ending]
results.sort()
print results
if not sum([m[0] for m in results]):
return default
else:
return results[-1][-1]

if __name__ == ''__main__'':
tests = [
''hello\ngoodbye\nmy fish\n'',
''hello\r\ngoodbye\r\nmy fish\r\n'',
''hello\rgoodbye\rmy fish\r'',
''hello\rgoodbye\n'',
'''',
''\r\r\r \n\n'',
''\n\n \r\n\r\n'',
''\n\n\r \r\r\n'',
''\n\r \n\r \n\r'',
]
for entry in tests:
print repr(entry)
print repr(find_ending(entry))
print

All the best,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don''t we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa

这篇关于检测行结尾的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

检测行结尾 [英] Detecting line endings

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

检测行结尾 [英] Detecting line endings

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭