为什么在没有换行符的情况下读取文件更快? [英] Why is it faster to read a file without line breaks?

查看:70
本文介绍了为什么在没有换行符的情况下读取文件更快?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在Python 3.6中,如果存在换行符,则读取文件所需的时间更长.如果我有两个文件,一个带有换行符,一个没有换行符(但否则它们具有相同的文本),那么带有换行符的文件将花费大约100-200%的时间来读取.我提供了一个具体示例.

In Python 3.6, it takes longer to read a file if there are line breaks. If I have two files, one with line breaks and one without lines breaks (but otherwise they have the same text) then the file with line breaks will take around 100-200% the time to read. I have provided a specific example.

sizeMB = 128
sizeKB = 1024 * sizeMB

with open(r'C:\temp\bigfile_one_line.txt', 'w') as f:
    for i in range(sizeKB):
        f.write('Hello World!\t'*73)  # There are roughly 73 phrases in one KB

with open(r'C:\temp\bigfile_newlines.txt', 'w') as f:
    for i in range(sizeKB):  
        f.write('Hello World!\n'*73)

第2步:读取一行内容并具有时间表现的文件

IPython

%%timeit
with open(r'C:\temp\bigfile_one_line.txt', 'r') as f:
    text = f.read()

输出

1 loop, best of 3: 368 ms per loop

步骤3:读取具有很多行和时间性能的文件

IPython

%%timeit
with open(r'C:\temp\bigfile_newlines.txt', 'r') as f:
    text = f.read()

输出

1 loop, best of 3: 589 ms per loop

这只是一个例子.我已经针对许多不同的情况对此进行了测试,并且它们执行相同的操作:

This is just one example. I have tested this for many different situations, and they do the same thing:

  1. 从1MB到2GB的不同文件大小
  2. 使用file.readlines()代替file.read()
  3. 在单行文件(即"Hello World!")中使用空格代替制表符('\ t')

我的结论是,带有换行符('\ n')的文件比没有换行符的文件需要更长的读取时间.但是,我希望所有字符都一样.读取大量文件时,这可能会对性能产生重要影响. 有人知道为什么会这样吗?

My conclusion is that files with new lines characters ('\n') take longer to read than files without them. However, I would expect all characters to be treated the same. This can have important consequences for performance when reading a lot of files. Does anyone know why this happens?

我正在使用Python 3.6.1,Anaconda 4.3.24和Windows10.

I am using Python 3.6.1, Anaconda 4.3.24, and Windows 10.

推荐答案

以文本模式(默认)在Python中打开文件时,它使用的是通用换行符"(

When you open a file in Python in text mode (the default), it uses what it calls "universal newlines" (introduced with PEP 278, but somewhat changed later with the release of Python 3). What universal newlines means is that regardless of what kind of newline characters are used in the file, you'll see only \n in Python. So a file containing foo\nbar would appear the same as a file containing foo\r\nbar or foo\rbar (since \n, \r\n and \r are all line ending conventions used on some operating systems at some time).

提供支持的逻辑可能是导致性能差异的原因.即使文件中的\n字符没有被转换,与非换行字符相比,代码也需要更仔细地检查它们.

The logic that provides that support is probably what causes your performance differences. Even if the \n characters in the file are not being transformed, the code needs to examine them more carefully than it does non-newline characters.

我怀疑如果您以二进制模式(没有提供此类换行符支持)打开文件,则看到的性能差异会消失.您还可以在Python 3中将newline参数传递给open,具体取决于您提供的值,该参数可能具有各种含义.我不知道任何特定的值会对性能产生什么影响,但是如果您看到的性能差异实际上对程序很重要,则可能值得测试.我会尝试传递newline=""newline="\n"(或您平台常规行尾的任何内容).

I suspect the performance difference you see will disappear if you opened your files in binary mode where no such newline support is provided. You can also pass a newline parameter to open in Python 3, which can have various meanings depending on exactly what value you give. I have no idea what impact any specific value would have on performance, but it might be worth testing if the performance difference you're seeing actually matters to your program. I'd try passing newline="" and newline="\n" (or whatever your platform's conventional line ending is).

这篇关于为什么在没有换行符的情况下读取文件更快?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆