Python writelines()和write()巨大的时差 [英] Python writelines() and write() huge time difference

查看:82
本文介绍了Python writelines()和write()巨大的时差的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在编写一个脚本,该脚本读取文件的文件夹(每个文件的大小从20 MB到100 MB不等),修改每一行中的某些数据,然后写回文件的副本.

I was working on a script which reading a folder of files(each of size ranging from 20 MB to 100 MB), modifies some data in each line, and writes back to a copy of the file.

with open(inputPath, 'r+') as myRead:
     my_list = myRead.readlines()
     new_my_list = clean_data(my_list)
with open(outPath, 'w+') as myWrite:
     tempT = time.time()
     myWrite.writelines('\n'.join(new_my_list) + '\n')
     print(time.time() - tempT)
print(inputPath, 'Cleaning Complete.')

在使用90 MB文件(〜900,000行)运行此代码时,它打印140秒作为写入文件所花费的时间.在这里,我使用了writelines().因此,我寻找了提高文件写入速度的不同方法,并且在我阅读的大多数文章中,write()writelines()不应显示任何差异,因为我正在编写单个串联字符串.我还检查了以下语句所需的时间:

On running this code with a 90 MB file (~900,000 lines), it printed 140 seconds as the time taken to write to the file. Here I used writelines(). So I searched for different ways to improve file writing speed, and in most of the articles that I read, it said write() and writelines() should not show any difference since I am writing a single concatenated string. I also checked the time taken for only the following statement:

new_string = '\n'.join(new_my_list) + '\n'

仅花费了0.4秒,因此花费的大量时间并不是因为创建列表. 只是为了尝试write()我尝试了以下代码:

And it took only 0.4 seconds, so the large time taken was not because of creating the list. Just to try out write() I tried this code:

with open(inputPath, 'r+') as myRead:
     my_list = myRead.readlines()
     new_my_list = clean_data(my_list)
with open(outPath, 'w+') as myWrite:
     tempT = time.time()
     myWrite.write('\n'.join(new_my_list) + '\n')
     print(time.time() - tempT)
print(inputPath, 'Cleaning Complete.')

它打印了2.5秒.为什么即使write()writelines()是相同的数据,它们在文件写入时间上有如此大的差异?这是正常现象还是我的代码有问题?两种情况下的输出文件似乎都是相同的,所以我知道数据没有丢失.

And it printed 2.5 seconds. Why is there such a large difference in the file writing time for write() and writelines() even though it is the same data? Is this normal behaviour or is there something wrong in my code? The output file seems to be the same for both cases, so I know that there is no loss in data.

推荐答案

file.writelines()期望字符串的 iterable .然后,它继续循环并为Iterable中的每个字符串调用file.write().在Python中,该方法执行以下操作:

file.writelines() expects an iterable of strings. It then proceeds to loop and call file.write() for each string in the iterable. In Python, the method does this:

def writelines(self, lines)
    for line in lines:
        self.write(line)

您传递的是一个大字符串,字符串也是字符串的迭代.迭代时,您会得到个字符,它们是长度为1的字符串.因此,实际上您是在分别对file.write()进行len(data)调用.这很慢,因为您一次要建立一个字符的写缓冲区.

You are passing in a single large string, and a string is an iterable of strings too. When iterating you get individual characters, strings of length 1. So in effect you are making len(data) separate calls to file.write(). And that is slow, because you are building up a write buffer a single character at a time.

不要将单个字符串传递给file.writelines().传入列表或元组或其他可迭代对象.

Don't pass in a single string to file.writelines(). Pass in a list or tuple or other iterable instead.

您可以发送单独的行,并在生成器表达式中添加换行符,例如:

You could send in individual lines with added newline in a generator expression, for example:

 myWrite.writelines(line + '\n' for line in new_my_list)

现在,如果您可以将clean_data()设置为 generator ,并产生干净的行,则可以不使用任何内容就将输入文件中的数据通过数据清理生成器传输到输出文件中比读写缓冲区所需的内存更多,但是清理行需要很多状态:

Now, if you could make clean_data() a generator, yielding cleaned lines, you could stream data from the input file, through your data cleaning generator, and out to the output file without using any more memory than is required for the read and write buffers and however much state is needed to clean your lines:

with open(inputPath, 'r+') as myRead, open(outPath, 'w+') as myWrite:
    myWrite.writelines(line + '\n' for line in clean_data(myRead))

此外,我会考虑更新clean_data()以发出包含换行符的行.

In addition, I'd consider updating clean_data() to emit lines with newlines included.

这篇关于Python writelines()和write()巨大的时差的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆