基准测试:python 是否有更快的方式来遍历网络文件夹? [英] benchmarks: does python have a faster way of walking a network folder?

查看:22
本文介绍了基准测试:python 是否有更快的方式来遍历网络文件夹?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要遍历一个包含大约一万个文件的文件夹.我的旧 vbscript 处理这个很慢.从那时起我就开始使用 Ruby 和 Python,因此我在这三种脚本语言之间做了一个基准测试,看看哪一种最适合这项工作.

I need to walk through a folder with approximately ten thousand files. My old vbscript is very slow in handling this. Since I've started using Ruby and Python since then, I made a benchmark between the three scripting languages to see which would be the best fit for this job.

以下对共享网络上 4500 个文件的子集的测试结果是

The results of the tests below on a subset of 4500 files on a shared network are

Python: 106 seconds
Ruby: 5 seconds
Vbscript: 124 seconds

Vbscript 最慢并不奇怪,但我无法解释 Ruby 和 Python 之间的区别.我对 Python 的测试不是最佳的吗?在 Python 中有没有更快的方法来做到这一点?

That Vbscript would be slowest was no surprise but I can't explain the difference between Ruby and Python. Is my test for Python not optimal? Is there a faster way to do this in Python?

thumbs.db 的测试只是为了测试,实际上还有更多的测试要做.

The test for thumbs.db is just for the test, in reality there are more tests to do.

我需要一些东西来检查路径上的每个文件,并且不会产生过多的输出来不干扰时间.每次运行的结果都有些不同,但相差不大.

I needed something that checks every file on the path and doesn't produce too much output to not disturb the timing. The results are a bit different each run but not by much.

#python2.7.0
import os

def recurse(path):
  for (path, dirs, files) in os.walk(path):
    for file in files:
      if file.lower() == "thumbs.db":
        print (path+'/'+file)

if __name__ == '__main__':
  import timeit
  path = '//server/share/folder/'
  print(timeit.timeit('recurse("'+path+'")', setup="from __main__ import recurse", number=1))

'vbscript5.7
set oFso = CreateObject("Scripting.FileSystemObject")
const path = "\serversharefolder"
start = Timer
myLCfilename="thumbs.db"

sub recurse(folder)
  for each file in folder.Files
    if lCase(file.name) = myLCfilename then
      wscript.echo file
    end if
  next
  for each subfolder in folder.SubFolders
    call Recurse(subfolder)
  next
end Sub

set folder = oFso.getFolder(path)
recurse(folder)
wscript.echo Timer-start

#ruby1.9.3
require 'benchmark'

def recursive(path, bench)
  bench.report(path) do
    Dir["#{path}/**/**"].each{|file| puts file if File.basename(file).downcase == "thumbs.db"}
  end
end

path = '//server/share/folder/'
Benchmark.bm {|bench| recursive(path, bench)}

由于我怀疑打印导致延迟,因此我测试了脚本,打印了所有 4500 个文件并且不打印,差异仍然存在,第一种情况下为 R:5 P:107,第一种情况下为 R:4.5 P:107后者

since i suspected the print caused a delay i tested the scripts with printing all 4500 files and also printing none, the difference remains, R:5 P:107 in the first case and R:4.5 P:107 in the latter

基于这里的答案和评论,Python 版本在某些情况下可以通过跳过文件夹来更快地运行

based on the answers and comments here a Python version that in some cases could run faster by skipping folders

import os

def recurse(path):
  for (path, dirs, files) in os.walk(path):
    for file in files:
      if file.lower() == "thumbs.db":
        print (path+'/'+file)

def recurse2(path):
    for (path, dirs, files) in os.walk(path):
        for dir in dirs:
            if dir in ('comics'):
                dirs.remove(dir)
        for file in files:
            if file.lower() == "thumbs.db":
                print (path+'/'+file)


if __name__ == '__main__':
  import timeit
  path = 'f:/'
  print(timeit.timeit('recurse("'+path+'")', setup="from __main__ import recurse", number=1)) 
#6.20102692
  print(timeit.timeit('recurse2("'+path+'")', setup="from __main__ import recurse2", number=1)) 
#2.73848228
#ruby 5.7

推荐答案

Dir 的 Ruby 实现是用 C 语言实现的(文件 dir.c,根据 本文档).但是,Python 中的等效项是在 Python 中.

The Ruby implementation for Dir is in C (the file dir.c, according to this documentation). However, the Python equivalent is implemented in Python.

Python 的性能不如 C 并不奇怪,但是 Python 中使用的方法提供了更多的灵活性 - 例如,您可以跳过命名为的整个子树,例如'.svn', '.git', '.hg' 同时遍历目录层次结构.

It's not surprising that Python is less performant than C, but the approach used in Python gives a little more flexibility - for example, you could skip entire subtrees named e.g. '.svn', '.git', '.hg' while traversing a directory hierarchy.

大多数情况下,Python 实现已经足够快了.

Most of the time, the Python implementation is fast enough.

更新:跳过文件/子目录根本不会影响遍历,但是处理目录树所花费的总时间肯定会减少,因为您避免必须遍历主树的潜在大子树.节省的时间当然与你跳过多少成正比.在您的情况下,看起来像图像文件夹,您不太可能节省太多时间(除非图像处于修订控制之下,否则跳过修订控制系统拥有的子树可能会产生一些影响).

Update: The skipping of files/subdirs doesn't affect the traversal rate at all, but the overall time taken to process a directory tree could certainly be reduced because you avoid having to traverse potentially large subtrees of the main tree. The time saved is of course proportional to how much you skip. In your case, which looks like folders of images, it's unlikely you would save much time (unless the images were under revision control, when skipping subtrees owned by the revision control system might have some impact).

额外更新:跳过文件夹是通过更改dirs值来完成的:

Additional update: Skipping folders is done by changing the dirs value in place:

for root, dirs, files in os.walk(path):
    for skip in ('.hg', '.git', '.svn', '.bzr'):
        if skip in dirs:
            dirs.remove(skip)
        # Now process other stuff at this level, i.e.
        # in directory "root". The skipped folders
        # won't be recursed into.

这篇关于基准测试:python 是否有更快的方式来遍历网络文件夹?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆