循环浏览大量文件并保存数据图的最快/最有效方法是什么? [英] What is the fastest/most efficient way to loop through a large collection of files and save a plot of the data?

查看:117
本文介绍了循环浏览大量文件并保存数据图的最快/最有效方法是什么?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我有一个程序,该程序循环遍历约2000多个数据文件,执行傅立叶变换,绘制变换,然后保存图形.感觉程序运行的时间越长,获得的速度就越慢.是否可以通过以下代码的简单更改使它运行得更快或更干净?

以前,我将傅里叶变换定义为一个函数,但是在这里我读到python具有较高的函数调用开销,因此我取消了该函数,并且一直在直接运行.另外,我读到clf()具有以前数据的稳定记录,如果循环浏览许多图,它们可能会变得很大并减慢该过程,因此我将其更改为close().这些好的变化在哪里?

from numpy import *
from pylab import *

for filename in filelist:

    t,f = loadtxt(filename, unpack=True)

    dt = t[1]-t[0]
    fou = absolute(fft.fft(f))
    frq = absolute(fft.fftfreq(len(t),dt))

    ymax = median(fou)*30

    figure(figsize=(15,7))
    plot(frq,fou,'k')

    xlim(0,400)
    ylim(0,ymax)

    iname = filename.replace('.dat','.png')
    savefig(iname,dpi=80)
    close()

解决方案

您是否考虑过使用multiprocessing模块并行处理文件?假设您实际上在这里受CPU约束(这意味着傅立叶转换占用了大部分运行时间,而不是读写文件),这应该会加快执行时间,而实际上并不需要加快循环本身./p>

例如,类似这样的内容(未经测试,但应该可以给您带来灵感):

def do_transformation(filename)
    t,f = loadtxt(filename, unpack=True)

    dt = t[1]-t[0]
    fou = absolute(fft.fft(f))
    frq = absolute(fft.fftfreq(len(t),dt))

    ymax = median(fou)*30

    figure(figsize=(15,7))
    plot(frq,fou,'k')

    xlim(0,400)
    ylim(0,ymax)

    iname = filename.replace('.dat','.png')
    savefig(iname,dpi=80)
    close()

pool = multiprocessing.Pool(multiprocessing.cpu_count())
for filename in filelist:
    pool.apply_async(do_transformation, (filename,))
pool.close()
pool.join()

您可能需要调整工作进程中实际完成的工作.例如,尝试并行化磁盘I/O部分可能对您没有多大帮助(甚至伤害您).

So I have this program that is looping through about 2000+ data files, performing a fourier transform, plotting the transform, then saving the figure. It feels like the longer the program runs, the slower it seems to get. Is there anyway to make it run faster or cleaner with a simple change in the code below?

Previously, I had the fourier transform defined as a function, but I read somewhere here that python has high function calling overhead, so i did away with the function and am running straight through now. Also, I read that the clf() has a steady log of previous figures that can get quite large and slow down the process if you loop through a lot of plots, so I've changed that to close(). Where these good changes as well?

from numpy import *
from pylab import *

for filename in filelist:

    t,f = loadtxt(filename, unpack=True)

    dt = t[1]-t[0]
    fou = absolute(fft.fft(f))
    frq = absolute(fft.fftfreq(len(t),dt))

    ymax = median(fou)*30

    figure(figsize=(15,7))
    plot(frq,fou,'k')

    xlim(0,400)
    ylim(0,ymax)

    iname = filename.replace('.dat','.png')
    savefig(iname,dpi=80)
    close()

解决方案

Have you considered using the multiprocessing module to parallelize processing the files? Assuming that you're actually CPU-bound here (meaning it's the fourier transform that's eating up most of the running time, not reading/writing the files), that should speed up execution time without actually needing to speed up the loop itself.

Edit:

For example, something like this (untested, but should give you the idea):

def do_transformation(filename)
    t,f = loadtxt(filename, unpack=True)

    dt = t[1]-t[0]
    fou = absolute(fft.fft(f))
    frq = absolute(fft.fftfreq(len(t),dt))

    ymax = median(fou)*30

    figure(figsize=(15,7))
    plot(frq,fou,'k')

    xlim(0,400)
    ylim(0,ymax)

    iname = filename.replace('.dat','.png')
    savefig(iname,dpi=80)
    close()

pool = multiprocessing.Pool(multiprocessing.cpu_count())
for filename in filelist:
    pool.apply_async(do_transformation, (filename,))
pool.close()
pool.join()

You may need to tweak what work actually gets done in the worker processes. Trying to parallelize the disk I/O portions may not help you much (or even hurt you), for example.

这篇关于循环浏览大量文件并保存数据图的最快/最有效方法是什么?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆