使用Batch迭代大量文件 [英] Iterating a large amout of files with Batch

查看:72
本文介绍了使用Batch迭代大量文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个简短的批处理脚本,它遍历目录及其子目录的文件.总共有超过一百万个文件.如果我将其用于较少数量的文件和目录,则该批次将按预期工作.但是,如果我尝试对所有人使用它,它似乎永远都不会停止工作. 我的印象是,脚本需要在输出之前检查"每个文件.所以我的问题是:是否有办法更快地完成此任务,或者至少要进行测试(如果该批处理完全正常?)?

I wrote a short batch script that iterates through the files of a directory and its subdirectories. In total there are more than a million files. My batch is working as intented if I use it for smaller numbers of files and directories. But if I try to use it for all of them, it just seems to never stop working. My impression is, that the script needs to "check" every file before I get an output. So my question is: Is there a way to get this done faster or at least to test, if the batch is working at all?

这是我的示例代码:

FOR /F "delims=*" %%i IN ('dir /s /b *.txt') do echo "test"

提前谢谢!

推荐答案

已编辑以包含评论中讨论的信息

EDITED to include information discussed in comments

这个问题的原始答案是

for /r "c:\startingPoint" %%a in (*.txt) do echo %%~fa

按OP的要求工作:它将以递归方式处理文件在磁盘中的位置,而无需等待或暂停,或者至少没有不必要的暂停(当然,需要找到第一个文件).

which works as intended by the OP: it will recursively process files as they are located in disk, with no wait or pause or at least with no unnecessary pause (of course the first file needs to be found).

答案和原始代码有什么区别

What is the difference between the anwswer and the original code

FOR /F "delims=*" %%i IN ('dir /s /b *.txt') do echo "test"

有疑问吗?

通常,for /f用于遍历一组行而不是一组文件,对每行执行for命令主体中的代码.该命令的in子句从"where"定义以检索行集.

In general, for /f is used to iterate over a set of lines instead of a set of files, executing the code in the body of the for command for each of the lines. The in clause of the command defines from "where" to retrieve the set of lines.

此"where"可以是磁盘上要读取的文件,也可以是要执行的命令或命令集,并且将处理其输出.在这两种情况下,所有数据都将在开始处理之前被完全检索.在所有数据都存储在内存缓冲区中之前,for命令主体中的代码不会执行.

This "where" can be a file on disk to be read or a command or set of commands to execute and whose output will be processed. In both cases, all the data is fully retrieved before start processing it. Until all the data is in a memory buffer, the code in the body of the for command is not executed.

这就是出现差异的地方.

And this is where a difference appears.

读取磁盘中的文件时,for /f获取文件的大小并分配足够大的内存缓冲区以容纳内存中的完整文件,将文件读入缓冲区并开始处理缓冲区(和当然,您不能使用for /f处理大于可用内存的文件)

When a file in disk is read, for /f gets the size of the file and allocates a memory buffer big enough to acomodate the full file in memory, reads the file into the buffer and starts to process the buffer (and of course, you can not use for /f to process a file bigger than free memory)

但是,当for /f处理命令时,它分配一个启动缓冲区,将数据从已执行命令的stdout流添加到该缓冲区中;当缓冲区已满时,将分配一个新的较大缓冲区,而旧缓冲区中的数据将被分配被复制到新缓冲区,而旧缓冲区被丢弃.在新缓冲区的关联点中检索新数据.每次缓冲区满时,都会重复此过程.缓冲区少量增加的事实加剧了这种情况.

But when for /f processes a command, it allocates a starting buffer, appends data into it from the stdout stream of the executed command and, when the buffer is full, a new larger buffer is allocated, data from the old buffer is copied to the new buffer and old buffer is discarded. New data is retrieved in the adecuated point of new buffer. And this process is repeated each time the buffer gets full. And this scenario is exacerbated by the fact that the buffer is increased in small amounts.

因此,当命令生成的数据非常大时,会完成很多内存分配,复制和释放.这需要时间.对于大数据,需要很多时间.

So, when the data generated by the command is very large, a lot of memory allocation, copy, free is done. And this needs time. For large data, a lot of time.

总结一下,如果使用for /f来处理命令的输出并且要处理的数据很大,则处理该命令所需的时间将成倍增加.

Summarizing, if for /f is used to process the output of a command and the data to process is large, the time needed to to it will increase exponentially.

如何避免呢?问题(在这种情况下)是从命令中检索数据,而不是对其进行处理.因此,当数据量确实很大时,而不是通常的for /f %%a in (' command ' ) ....语法,最好执行将输出重定向到临时文件的命令,然后使用for /f处理该文件.数据生成将需要相同的时间,但是数据处理延迟的差异可能从几小时到几秒或几分钟.

How to avoid it? The problem (in this cases) is to retrieve the data from the command, not to process it. So, when the volume of data is really big, instead of the usual for /f %%a in (' command ' ) .... syntax, it is better to execute the command redirecting the output to a temporary file and then use for /f to process the file. The generation of data will need the same amout of time, but the difference in data processing delay can go from hours to seconds or minutes.

这篇关于使用Batch迭代大量文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆