基准测试:python是否有更快的遍历网络文件夹的方式? [英] benchmarks: does python have a faster way of walking a network folder?

查看:74
本文介绍了基准测试:python是否有更快的遍历网络文件夹的方式?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要浏览一个包含约一万个文件的文件夹.我的旧vbscript处理起来很慢.自从从那时起我就开始使用Ruby和Python以来,我在这三种脚本语言之间建立了基准,以查看哪种脚本最适合此工作.

I need to walk through a folder with approximately ten thousand files. My old vbscript is very slow in handling this. Since I've started using Ruby and Python since then, I made a benchmark between the three scripting languages to see which would be the best fit for this job.

以下对共享网络中4500个文件的子集的测试结果为

The results of the tests below on a subset of 4500 files on a shared network are

Python: 106 seconds
Ruby: 5 seconds
Vbscript: 124 seconds

Vbscript最慢是不足为奇的,但是我无法解释Ruby和Python之间的区别.我对Python的测试不是最佳的吗?有没有更快的方法可以在Python中做到这一点?

That Vbscript would be slowest was no surprise but I can't explain the difference between Ruby and Python. Is my test for Python not optimal? Is there a faster way to do this in Python?

thumbs.db的测试仅用于测试,实际上还有更多测试要做.

The test for thumbs.db is just for the test, in reality there are more tests to do.

我需要一些东西来检查路径上的每个文件,并且不会产生太多输出以免干扰时间.每次运行的结果略有不同,但相差不大.

I needed something that checks every file on the path and doesn't produce too much output to not disturb the timing. The results are a bit different each run but not by much.

#python2.7.0
import os

def recurse(path):
  for (path, dirs, files) in os.walk(path):
    for file in files:
      if file.lower() == "thumbs.db":
        print (path+'/'+file)

if __name__ == '__main__':
  import timeit
  path = '//server/share/folder/'
  print(timeit.timeit('recurse("'+path+'")', setup="from __main__ import recurse", number=1))

'vbscript5.7
set oFso = CreateObject("Scripting.FileSystemObject")
const path = "\\server\share\folder"
start = Timer
myLCfilename="thumbs.db"

sub recurse(folder)
  for each file in folder.Files
    if lCase(file.name) = myLCfilename then
      wscript.echo file
    end if
  next
  for each subfolder in folder.SubFolders
    call Recurse(subfolder)
  next
end Sub

set folder = oFso.getFolder(path)
recurse(folder)
wscript.echo Timer-start

#ruby1.9.3
require 'benchmark'

def recursive(path, bench)
  bench.report(path) do
    Dir["#{path}/**/**"].each{|file| puts file if File.basename(file).downcase == "thumbs.db"}
  end
end

path = '//server/share/folder/'
Benchmark.bm {|bench| recursive(path, bench)}

由于我怀疑打印导致延迟,所以我测试了脚本,打印了所有4500个文件,并且不打印任何文件,区别仍然是,第一种情况下为R:5 P:107,第二种情况为R:4.5 P:107后者

since i suspected the print caused a delay i tested the scripts with printing all 4500 files and also printing none, the difference remains, R:5 P:107 in the first case and R:4.5 P:107 in the latter

基于此处的答案和评论,该Python版本在某些情况下可以通过跳过文件夹来更快地运行

based on the answers and comments here a Python version that in some cases could run faster by skipping folders

import os

def recurse(path):
  for (path, dirs, files) in os.walk(path):
    for file in files:
      if file.lower() == "thumbs.db":
        print (path+'/'+file)

def recurse2(path):
    for (path, dirs, files) in os.walk(path):
        for dir in dirs:
            if dir in ('comics'):
                dirs.remove(dir)
        for file in files:
            if file.lower() == "thumbs.db":
                print (path+'/'+file)


if __name__ == '__main__':
  import timeit
  path = 'f:/'
  print(timeit.timeit('recurse("'+path+'")', setup="from __main__ import recurse", number=1)) 
#6.20102692
  print(timeit.timeit('recurse2("'+path+'")', setup="from __main__ import recurse2", number=1)) 
#2.73848228
#ruby 5.7

推荐答案

Dir的Ruby实现在C中(根据在Python中实现的.

The Ruby implementation for Dir is in C (the file dir.c, according to this documentation). However, the Python equivalent is implemented in Python.

Python的性能不及C并不奇怪,但是Python中使用的方法却提供了更多的灵活性-例如,您可以跳过名为e.g.的整个子树. '.svn''.git''.hg'遍历目录层次结构时.

It's not surprising that Python is less performant than C, but the approach used in Python gives a little more flexibility - for example, you could skip entire subtrees named e.g. '.svn', '.git', '.hg' while traversing a directory hierarchy.

在大多数情况下,Python实现足够快.

Most of the time, the Python implementation is fast enough.

更新:跳过文件/子目录根本不会影响遍历速率,但是可以肯定地减少了处理目录树所需的总时间,因为您可以避免遍历主树的潜在大子树.当然,节省的时间与您跳过的时间成正比.就您而言,它看起来像图像文件夹,不太可能节省大量时间(除非图像受版本控制,否则跳过版本控制系统所拥有的子树可能会产生一些影响).

Update: The skipping of files/subdirs doesn't affect the traversal rate at all, but the overall time taken to process a directory tree could certainly be reduced because you avoid having to traverse potentially large subtrees of the main tree. The time saved is of course proportional to how much you skip. In your case, which looks like folders of images, it's unlikely you would save much time (unless the images were under revision control, when skipping subtrees owned by the revision control system might have some impact).

其他更新:通过更改dirs值来跳过文件夹:

Additional update: Skipping folders is done by changing the dirs value in place:

for root, dirs, files in os.walk(path):
    for skip in ('.hg', '.git', '.svn', '.bzr'):
        if skip in dirs:
            dirs.remove(skip)
        # Now process other stuff at this level, i.e.
        # in directory "root". The skipped folders
        # won't be recursed into.

这篇关于基准测试:python是否有更快的遍历网络文件夹的方式?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆