检测行结尾 [英] Detecting line endings

查看:67
本文介绍了检测行结尾的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

大家好,


我正在尝试检测文本文件中使用的行结尾。我*可能*是* b $ b首先将文件解码为unicode(可以使用

多字节编码进行编码) - 这就是为什么我不让Python处理

行结尾。


以下是安全和理智的:


text = open(''test .txt'',''rb'')。read()

如果编码:

text = text.decode(encoding)

结束=''\ n''#default

if''\\\\ n''in text:

text = text.replace(''\\ \\\''',''\ n'')

结束=''\\\\ n'

elif''\ n ''在文中:

结束=''\ n''

elif''\ r''in text:

text = text.replace(''\''',''\ n'')

结束=''\ r''

我担心的是如果''\ n''*并不表示Mac上的换行符,

那么它可能存在于人体中文本的y - 并提前触发``结束=

''\ n''``?


一切顺利,


Fuzzyman
http:/ /www.voidspace.org.uk/python/index.shtml

解决方案

Fuzzyman启发我们:

我担心的是,如果''\ n''*并不表示Mac上的换行符,那么它可能存在于文本正文中 - 并触发` '结束=
''\ n''``过早?




我会计算出''\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ \\ n'',''\ n''没有先前的

''\ r''和''\'''没有关注''\ n'',并让多数决定。


Sybren

-

世界的问题是愚蠢。并不是说应该对愚蠢的死刑进行处罚,但为什么我们不要仅仅拿掉

安全标签来解决问题呢? br />
Frank Zappa




Sybren Stuvel写道:

Fuzzyman启发我们:< blockquote class =post_quotes>我担心的是,如果''\ n''*并不表示Mac上的换行符,那么它可能存在于文本正文中 - 并触发'`结束=
''\ n''``过早?
我会计算''\\ n''''''\\'''的出现次数'没有先前的'/'''和''\\''''''''''''''''''''''''''''''''''''''''''''''''''''''' >
听起来很合理,小文件的边缘情况应该被诅咒。 :-)


Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

Sybren
-
世界的问题是愚蠢。不是说应该对愚蠢的死刑进行处罚,但为什么我们不把所有的安全标签都拿走,让问题自行解决?
Frank Zappa





Sybren Stuvel写道:

Fuzzyman启发我们:

我的担心是,如果''\ n''*并不表示Mac上的换行符,那么它可能存在于文本正文中 - 并触发``结束=
''\ n''``过早?
我会计算''\\ n'',''\ n''的出现次数而没有先行
''\''和''\'''没有关注''\ n'',让多数人决定。




这就是我提出的。正如您从文档字符串中看到的那样,

会在出现平局时尝试合理(-ish)的事情,或者根本没有行结尾。


欢迎评论/更正。我知道测试不是很有用

(因为他们没有*断言*他们不会告诉你它是否会中断),

但你可以看到发生了什么:


导入重新

导入os


rn = re.compile(' '\\'n'')

r = re.compile(''\ r(?!\ n)'')

n = re.compile (''(?<!\r)\ n'')

#每行结束的(正则表达式,文字,优先级)序列

line_ending = [(n,''\ n'',3),(rn,''\ r \ n'',2),(r,''\ r'',1)]

def find_ending(text,default = os.linesep):

"""

给定一段文字,使用简单的启发式确定行结束使用



如果没有找到行结尾,则返回分配给默认值的值。

这默认为``os.linesep``,结尾为

机器的原生行。


如果两个结局之间有一个平局,优先链是

``''\ n'',''\\\ n'n'',''\ r''``` 。

"""

results = [(len(exp.findall(text)),priority,literal)

exp,literal,line_ending中的优先级]

results.sort()

打印结果

如果不是总和(m的[m [0])结果]):

返回默认值

否则:

返回结果[-1] [ - 1]

if __name__ ==''__ main__'':

tests = [

''hello\\\
goodbye \ nmy fish \ n'',

''hello \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ fish \''',

''hello\rgoodbye \ n'',

'''',

'' \\\\\\'n',

''\ n \ n \\\\\\\ n br />
''\ n \ nn \\ r \\\ n',

''\ n \\\\\\\\\''',

]

参加测试:

print repr(entry)

print repr(find_ending(entry))

打印


一切顺利,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml Sybren
-
世界的问题是愚蠢。不是说应该对愚蠢的死刑进行处罚,但为什么我们不把所有的安全标签都拿走,让问题自行解决?
Frank Zappa




Hello all,

I''m trying to detect line endings used in text files. I *might* be
decoding the files into unicode first (which may be encoded using
multi-byte encodings) - which is why I''m not letting Python handle the
line endings.

Is the following safe and sane :

text = open(''test.txt'', ''rb'').read()
if encoding:
text = text.decode(encoding)
ending = ''\n'' # default
if ''\r\n'' in text:
text = text.replace(''\r\n'', ''\n'')
ending = ''\r\n''
elif ''\n'' in text:
ending = ''\n''
elif ''\r'' in text:
text = text.replace(''\r'', ''\n'')
ending = ''\r''
My worry is that if ''\n'' *doesn''t* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
''\n''`` prematurely ?

All the best,

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml

解决方案

Fuzzyman enlightened us with:

My worry is that if ''\n'' *doesn''t* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
''\n''`` prematurely ?



I''d count the number of occurences of ''\r\n'', ''\n'' without a preceding
''\r'' and ''\r'' without following ''\n'', and let the majority decide.

Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don''t we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa



Sybren Stuvel wrote:

Fuzzyman enlightened us with:

My worry is that if ''\n'' *doesn''t* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
''\n''`` prematurely ?
I''d count the number of occurences of ''\r\n'', ''\n'' without a preceding
''\r'' and ''\r'' without following ''\n'', and let the majority decide.



Sounds reasonable, edge cases for small files be damned. :-)

Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don''t we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa





Sybren Stuvel wrote:

Fuzzyman enlightened us with:

My worry is that if ''\n'' *doesn''t* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
''\n''`` prematurely ?
I''d count the number of occurences of ''\r\n'', ''\n'' without a preceding
''\r'' and ''\r'' without following ''\n'', and let the majority decide.



This is what I came up with. As you can see from the docstring, it
attempts to sensible(-ish) things in the event of a tie, or no line
endings at all.

Comments/corrections welcomed. I know the tests aren''t very useful
(because they make no *assertions* they won''t tell you if it breaks),
but you can see what''s going on :

import re
import os

rn = re.compile(''\r\n'')
r = re.compile(''\r(?!\n)'')
n = re.compile(''(?<!\r)\n'')

# Sequence of (regex, literal, priority) for each line ending
line_ending = [(n, ''\n'', 3), (rn, ''\r\n'', 2), (r, ''\r'', 1)]
def find_ending(text, default=os.linesep):
"""
Given a piece of text, use a simple heuristic to determine the line
ending in use.

Returns the value assigned to default if no line endings are found.
This defaults to ``os.linesep``, the native line ending for the
machine.

If there is a tie between two endings, the priority chain is
``''\n'', ''\r\n'', ''\r''``.
"""
results = [(len(exp.findall(text)), priority, literal) for
exp, literal, priority in line_ending]
results.sort()
print results
if not sum([m[0] for m in results]):
return default
else:
return results[-1][-1]

if __name__ == ''__main__'':
tests = [
''hello\ngoodbye\nmy fish\n'',
''hello\r\ngoodbye\r\nmy fish\r\n'',
''hello\rgoodbye\rmy fish\r'',
''hello\rgoodbye\n'',
'''',
''\r\r\r \n\n'',
''\n\n \r\n\r\n'',
''\n\n\r \r\r\n'',
''\n\r \n\r \n\r'',
]
for entry in tests:
print repr(entry)
print repr(find_ending(entry))
print

All the best,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don''t we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa




这篇关于检测行结尾的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆