检测行结尾 [英] Detecting line endings
问题描述
大家好,
我正在尝试检测文本文件中使用的行结尾。我*可能*是* b $ b首先将文件解码为unicode(可以使用
多字节编码进行编码) - 这就是为什么我不让Python处理
行结尾。
以下是安全和理智的:
text = open(''test .txt'',''rb'')。read()
如果编码:
text = text.decode(encoding)
结束=''\ n''#default
if''\\\\ n''in text:
text = text.replace(''\\ \\\''',''\ n'')
结束=''\\\\ n'
elif''\ n ''在文中:
结束=''\ n''
elif''\ r''in text:
text = text.replace(''\''',''\ n'')
结束=''\ r''
我担心的是如果''\ n''*并不表示Mac上的换行符,
那么它可能存在于人体中文本的y - 并提前触发``结束=
''\ n''``?
一切顺利,
Fuzzyman
http:/ /www.voidspace.org.uk/python/index.shtml
Fuzzyman启发我们:我担心的是,如果''\ n''*并不表示Mac上的换行符,那么它可能存在于文本正文中 - 并触发` '结束=
''\ n''``过早?
我会计算出''\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ \\ n'',''\ n''没有先前的
''\ r''和''\'''没有关注''\ n'',并让多数决定。
Sybren
-
世界的问题是愚蠢。并不是说应该对愚蠢的死刑进行处罚,但为什么我们不要仅仅拿掉
安全标签来解决问题呢? br />
Frank Zappa
Sybren Stuvel写道:Fuzzyman启发我们:< blockquote class =post_quotes>我担心的是,如果''\ n''*并不表示Mac上的换行符,那么它可能存在于文本正文中 - 并触发'`结束=
''\ n''``过早?
我会计算''\\ n''''''\\'''的出现次数'没有先前的'/'''和''\\''''''''''''''''''''''''''''''''''''''''''''''''''''''' >
听起来很合理,小文件的边缘情况应该被诅咒。 :-)
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
Sybren
-
世界的问题是愚蠢。不是说应该对愚蠢的死刑进行处罚,但为什么我们不把所有的安全标签都拿走,让问题自行解决?
Frank Zappa
Sybren Stuvel写道:Fuzzyman启发我们:我的担心是,如果''\ n''*并不表示Mac上的换行符,那么它可能存在于文本正文中 - 并触发``结束=
''\ n''``过早?
我会计算''\\ n'',''\ n''的出现次数而没有先行
''\''和''\'''没有关注''\ n'',让多数人决定。
这就是我提出的。正如您从文档字符串中看到的那样,
会在出现平局时尝试合理(-ish)的事情,或者根本没有行$
结尾。
欢迎评论/更正。我知道测试不是很有用
(因为他们没有*断言*他们不会告诉你它是否会中断),
但你可以看到发生了什么:
导入重新
导入os
rn = re.compile(' '\\'n'')
r = re.compile(''\ r(?!\ n)'')
n = re.compile (''(?<!\r)\ n'')
#每行结束的(正则表达式,文字,优先级)序列
line_ending = [(n,''\ n'',3),(rn,''\ r \ n'',2),(r,''\ r'',1)]
def find_ending(text,default = os.linesep):
"""
给定一段文字,使用简单的启发式确定行结束使用
。
如果没有找到行结尾,则返回分配给默认值的值。
这默认为``os.linesep``,结尾为
机器的原生行。
如果两个结局之间有一个平局,优先链是
``''\ n'',''\\\ n'n'',''\ r''``` 。
"""
results = [(len(exp.findall(text)),priority,literal)
exp,literal,line_ending中的优先级]
results.sort()
打印结果
如果不是总和(m的[m [0])结果]):
返回默认值
否则:
返回结果[-1] [ - 1]
>
if __name__ ==''__ main__'':
tests = [
''hello\\\
goodbye \ nmy fish \ n'',
''hello \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\ fish \''',
''hello\rgoodbye \ n'',
'''',
'' \\\\\\'n',
''\ n \ n \\\\\\\ n br />
''\ n \ nn \\ r \\\ n',
''\ n \\\\\\\\\''',
]
参加测试:
print repr(entry)
print repr(find_ending(entry))
打印
一切顺利,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml Sybren
-
世界的问题是愚蠢。不是说应该对愚蠢的死刑进行处罚,但为什么我们不把所有的安全标签都拿走,让问题自行解决?
Frank Zappa
Hello all,
I''m trying to detect line endings used in text files. I *might* be
decoding the files into unicode first (which may be encoded using
multi-byte encodings) - which is why I''m not letting Python handle the
line endings.
Is the following safe and sane :
text = open(''test.txt'', ''rb'').read()
if encoding:
text = text.decode(encoding)
ending = ''\n'' # default
if ''\r\n'' in text:
text = text.replace(''\r\n'', ''\n'')
ending = ''\r\n''
elif ''\n'' in text:
ending = ''\n''
elif ''\r'' in text:
text = text.replace(''\r'', ''\n'')
ending = ''\r''
My worry is that if ''\n'' *doesn''t* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
''\n''`` prematurely ?
All the best,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
Fuzzyman enlightened us with:My worry is that if ''\n'' *doesn''t* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
''\n''`` prematurely ?
I''d count the number of occurences of ''\r\n'', ''\n'' without a preceding
''\r'' and ''\r'' without following ''\n'', and let the majority decide.
Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don''t we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
Sybren Stuvel wrote:Fuzzyman enlightened us with:My worry is that if ''\n'' *doesn''t* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
''\n''`` prematurely ?
I''d count the number of occurences of ''\r\n'', ''\n'' without a preceding
''\r'' and ''\r'' without following ''\n'', and let the majority decide.
Sounds reasonable, edge cases for small files be damned. :-)
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml
Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don''t we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
Sybren Stuvel wrote:Fuzzyman enlightened us with:My worry is that if ''\n'' *doesn''t* signify a line break on the Mac,
then it may exist in the body of the text - and trigger ``ending =
''\n''`` prematurely ?
I''d count the number of occurences of ''\r\n'', ''\n'' without a preceding
''\r'' and ''\r'' without following ''\n'', and let the majority decide.
This is what I came up with. As you can see from the docstring, it
attempts to sensible(-ish) things in the event of a tie, or no line
endings at all.
Comments/corrections welcomed. I know the tests aren''t very useful
(because they make no *assertions* they won''t tell you if it breaks),
but you can see what''s going on :
import re
import os
rn = re.compile(''\r\n'')
r = re.compile(''\r(?!\n)'')
n = re.compile(''(?<!\r)\n'')
# Sequence of (regex, literal, priority) for each line ending
line_ending = [(n, ''\n'', 3), (rn, ''\r\n'', 2), (r, ''\r'', 1)]
def find_ending(text, default=os.linesep):
"""
Given a piece of text, use a simple heuristic to determine the line
ending in use.
Returns the value assigned to default if no line endings are found.
This defaults to ``os.linesep``, the native line ending for the
machine.
If there is a tie between two endings, the priority chain is
``''\n'', ''\r\n'', ''\r''``.
"""
results = [(len(exp.findall(text)), priority, literal) for
exp, literal, priority in line_ending]
results.sort()
print results
if not sum([m[0] for m in results]):
return default
else:
return results[-1][-1]
if __name__ == ''__main__'':
tests = [
''hello\ngoodbye\nmy fish\n'',
''hello\r\ngoodbye\r\nmy fish\r\n'',
''hello\rgoodbye\rmy fish\r'',
''hello\rgoodbye\n'',
'''',
''\r\r\r \n\n'',
''\n\n \r\n\r\n'',
''\n\n\r \r\r\n'',
''\n\r \n\r \n\r'',
]
for entry in tests:
print repr(entry)
print repr(find_ending(entry))
All the best,
Fuzzyman
http://www.voidspace.org.uk/python/index.shtml Sybren
--
The problem with the world is stupidity. Not saying there should be a
capital punishment for stupidity, but why don''t we just take the
safety labels off of everything and let the problem solve itself?
Frank Zappa
这篇关于检测行结尾的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!