为什么Python无法看到文件中的所有行? [英] Why Python does not see all the rows in a file?

查看:138
本文介绍了为什么Python无法看到文件中的所有行?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我通过以下方法使用Python计算文件中的行(行)数:

I count number of rows (lines) in a file using Python in the following method:

n = 0
for line in file('input.txt'):
   n += 1
print n

我在Windows下运行此脚本.

I run this script under Windows.

然后我使用Unix命令计算同一文件中的行数:

Then I count the number of rows in the same file using Unix command:

wc -l input.txt

使用Unix命令进行计数会大大增加行数.

Counting with Unix command gives a significantly larger number of rows.

所以,我的问题是:为什么Python无法看到文件中的所有行?还是定义的问题?

So, my question is: Why Python does not see all the rows in the file? Or is it a question of definition?

推荐答案

您很可能有一个文件,其中包含一个或多个DOS EOF(CTRL-Z)字符,ASCII代码点0x1A.当Windows以文本模式打开文件时,它仍然会遵循旧的DOS语义,并在读取该字符时 end 一个文件.参见在0x1A上的线路阅读扼流圈.

You most likely have a file with one or more DOS EOF (CTRL-Z) characters in it, ASCII codepoint 0x1A. When Windows opens a file in text mode, it'll still honour the old DOS semantics and end a file whenever it reads that character. See Line reading chokes on 0x1A.

只有通过以二进制模式打开文件,您才能绕过此行为.为此,仍然计数行数,您有两个选择:

Only by opening a file in binary mode can you bypass this behaviour. To do so and still count lines, you have two options:

  • 读取大块,然后计算每个大块中行分隔符的数量:

  • read in chunks, then count the number of line separators in each chunk:

def bufcount(filename, linesep=os.linesep, buf_size=2 ** 15):
    lines = 0
    with open(filename, 'rb') as f:
        last = ''
        for buf in iter(f.read, ''):
            lines += buf.count(linesep)
            if last and last + buf[0] == linesep:
                # count line separators straddling a boundary
                lines += 1
            if len(linesep) > 1:
                last = buf[-1]
    return lines

考虑到在Windows上os.linesep设置为\r\n的情况,请根据需要对文件进行调整;在二进制模式下,行分隔符不会转换为\n.

Take into account that on Windows os.linesep is set to \r\n, adjust as needed for your file; in binary mode line separators are not translated to \n.

使用 io.open() io组文件对象总是以二进制模式打开文件,然后自己进行翻译:

Use io.open(); the io set of file objects open the file in binary mode always, then do the translations themselves:

import io

with io.open(filename) as f:
    lines = sum(1 for line in f)

这篇关于为什么Python无法看到文件中的所有行?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆