循环慢速处理,利用FINDSTR [英] Slow processing a for loop that utilizes findstr

查看:499
本文介绍了循环慢速处理,利用FINDSTR的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个奇怪的有些情况下,当一个for循环是慢得令人难以置信,当我使用FINDSTR作为字符串做的。

其值得一提的是,文件(老file.xml )我是处理包含大约200万行。

这部分是极快的,但如果我删除可以呈现慢|查找/ C:

  REM发现XML的文件中的行总数
FINDSTR / N ^^老file.xml |查找/ C:> TEMP-count.txt
集/ p线=< TEMP-count.txt

在code这是缓慢的这个样子的,我不能使用管道上面的伎俩。这似乎是缓慢的部分是本身,因为我没有看到在标题栏中的任何进展,直到10分钟后。

  SETLOCAL DisableDelayedExpansion
REM开始与正确的日期更换错误的日期
FOR / F有usebackq令牌= 1 * Delims =:我%%中(`FINDSTR / N ^^老file.xml')做(
    REM缓存的每一行的值在变量
    集读取行= %%Ĵ
    集线= %%我
    REM恢复延迟扩展
    SETLOCAL EnableDelayedExpansion
    在标题栏中写入物权进度
    标题处理行:!行/%线%
    REM删除尾随行号
    REM SET读线=读行:*:=!
    FOR / F有usebackq%% i的(%TMPFILE%)做(
        REM替换所有错误的日期与正确的日期
        !设置读取行=读行:%% I =%correctdate%!
    )
    REM结果写入新文件
    回声(读线>!>中更新了-file.xml
    REM结束地​​方
    ENDLOCAL

编辑:

进一步调查显示我使用应显示绕环当前行号这一行就需要200万行我8MB文件约10分钟。这只是为了得到它开始显示的行。

  FOR / F有usebackq令牌= 1 * Delims =:我%%中(`FINDSTR / N ^^老file.xml`)也呼应%%我

所以它看起来像 FINDSTR 正在写屏幕输出对用户隐藏了,但看到的 -loop 。我怎样才能prevent这种情况的发生,同时还得到了相同的结果?

编辑2:解决方案

所建议的 Aacini 最后修订由我的解决方案。

这是一个更大的脚本片段。错误的日期是在另一个循环检索。和线的总数量也从另一个循环中检索。

  SETLOCAL enabledelayedexpansion
REM这部分是片段而已,是从最终脚本另一个循环生成的日期
回音2069年4月29日>日期,tmp.txt
回声二零六九年四月三十零日>>日期,tmp.txtFINDSTR / N ^^超级大File.xml> out.tmp集TMPFILE =日期,tmp.txt
集correctdate = 2011-11-25
设置错误的日期=
行REM硬codeD总数
套系= 186442
FOR / F %% i的(%TMPFILE%)做(
    设置错误的日期!=错误的日期! %%一世

REM过程out.tmp和环路他们经过的每一行:ProcessLines
致电:ProcessLines< out.tmp
REM当上面要求在out.tmp每一行,转到出口成品
转到ProcessLinesEnd
:ProcessLines
在/ L %% L(1,1,线条%%),做(
    集/ p读线=
    在标题栏中写入物权进度
    标题处理行:%% L /%线%
    对于%% i的(%错日期%)做(
        REM替换所有错误的日期与正确的日期
        !设置读取行=读行:%% I =%correctdate%!
    )
    REM结果写入新文件
    回声(读行:*:!=>>中out2.tmp

REM在这里结束及以下继续
GOTO:EOF:ProcessLinesEnd
回声这不应该被打印,直到通话结束:出口
退出/ B


解决方案

下面两点:

1的 SETLOCAL EnableDelayedExpansion 命令执行的的文件的每一行。这意味着完整的环境大约200000次必须复制到一个新的本地存储区域。这可能会导致一些问题。

2 - 我建议你先从最基础的部分。需要多少时间的FINDSTR执行?运行 FINDSTR / N ^^老file.xml 独自试图修复任何其他部分之前检查。如果这个过程是快速的,然后再添加一个步骤,它和测试,直到你发现了减缓的原因。我建议你​​不要使用管道也不 FOR / F FINDSTR 的执行,但在由$生成的文件p $ pvious重定向。

修改一个更快的解决方案

有另一种方式来做到这一点。你可以管FINDSTR输出成批量子程序,这样的线条可以用 SET / P 命令读取。这种方法可以处理通过的命令行susbtitution FOR / F 的通过延迟扩展,而不是完全的线条,所以对的SETLOCAL EnableDelayedExpansion ENDLOCAL 命令不再是必要的。但是,如果你仍然想显示的行数,需要重新计算。

另外,也更快加载错误的日期在一个变量而不是过程的%TMPFILE%的大文件的每一行

  SETLOCAL EnableDelayedExpansion
从TMPFILE REM负荷错日期
设置错误的日期=
FOR / F %% i的(%TMPFILE%)做(
    设置错误的日期!=错误的日期! %%一世

回声创建FINDSTR输出,请稍候...
FINDSTR / N ^^老file.xml> findstr.txt
回声:EOF>> findstr.txt
REM开始与正确的日期更换错误的日期
致电:ProcessLines< findstr.txt
GOTO:EOF

 :ProcessLines
设置行= 0
:读下一行
集/ p读线=
REM检查输入文件结束
如果!看行! ==:EOF的goto:EOF
在标题栏中写入物权进度
集/ A线+ = 1
标题处理行:%行%/%线%
对于%% i的(%错日期%)做(
    REM替换所有错误的日期与正确的日期
    !设置读取行=读行:%% I =%correctdate%!

REM结果写入新文件
回声(读行:*:!=>>中更新了-file.xml
REM回去下一行
转到读下一行

第二修改 更快的修改

previous方法可slighlty加快如果循环是通过为/ L 命令,而不是通过转到

 :ProcessLines
在/ L %% L(1,1,线条%%),做(
    集/ p读线=
    在标题栏中写入物权进度
    标题处理行:%% L /%线%
    对于%% i的(%错日期%)做(
        REM替换所有错误的日期与正确的日期
        !设置读取行=读行:%% I =%correctdate%!
    )
    REM结果写入新文件
    回声(读行:*:!=>>中更新了-file.xml

此修改还省略了:EOF比较和行数计算,所以时间增益可很有意义经过反复它20万次。如果你使用这种方法,不要忘记删除回响:EOF>>在第一部分findstr.txt 行。

I've got a somewhat weird case, where a for-loop is incredibly slow when I use findstr as the string for DO.

Its worth mentioning that the file (old-file.xml) that I'm processing contains about 200 000 lines.

This part is blazing fast, but can be rendered slower if I remove | find /c ":"

rem find total number of lines in xml-file
findstr /n ^^ old-file.xml | find /c ":" > "temp-count.txt"
set /p lines=< "temp-count.txt"

The code which is slow looks like this and I can't use the pipe trick above. It seems like the slow part is the for itself, as i'm not seeing any progress in the title bar until after 10 min.

setlocal DisableDelayedExpansion
rem start replacing wrong dates with correct date
for /f "usebackq Tokens=1* Delims=:" %%i in (`"findstr /n ^^ old-file.xml"`) do (
    rem cache the value of each line in a variable
    set read-line=%%j
    set line=%%i
    rem restore delayed expansion
    setlocal EnableDelayedExpansion
    rem write progress in title bar
    title Processing line: !line!/%lines%
    rem remove trailing line number
    rem set read-line=!read-line:*:=!
    for /f "usebackq" %%i in ("%tmpfile%") do (
        rem replace all wrong dates with correct dates
        set read-line=!read-line:%%i=%correctdate%!
    )
    rem write results to new file
    echo(!read-line!>>"Updated-file.xml"
    rem end local
    endlocal
)

EDIT:

Further investigation showed me that using this single line that should display the current line number being looped takes about 10 minutes on my 8MB file of 200 000 lines. That's just for getting it to start displaying the lines.

for /f "usebackq Tokens=1* Delims=:" %%i in (`"findstr /n ^^ old-file.xml"`) do echo %%i

So it seems like findstr is writing screen output hidden for the user, but visible for the for-loop. How can I prevent that from happening while still getting the same results?

EDIT 2: Solution

The solution as proposed by Aacini and finally revised by me.

This is a snippet from a much bigger script. Wrong dates are retrieved in another loop. And total number of lines are also retrieved from another loop.

setlocal enabledelayedexpansion
rem this part is for snippet only, dates are generated from another loop in final script 
echo 2069-04-29 > dates-tmp.txt
echo 2069-04-30 >> dates-tmp.txt

findstr /n ^^ Super-Large-File.xml > out.tmp

set tmpfile=dates-tmp.txt
set correctdate=2011-11-25
set wrong-dates=
rem hardcoded total number of lines
set lines=186442
for /F %%i in (%tmpfile%) do (
    set wrong-dates=!wrong-dates! %%i
)
rem process each line in out.tmp and loop them through :ProcessLines
call :ProcessLines < out.tmp
rem when finished with above call for each line in out.tmp, goto exit
goto ProcessLinesEnd
:ProcessLines
for /L %%l in (1,1,%lines%) do (
    set /P read-line=
    rem write progress in title bar
    title Processing line: %%l/%lines%
    for %%i in (%wrong-dates%) do (
        rem replace all wrong dates with correct dates
        set read-line=!read-line:%%i=%correctdate%!
    )
    rem write results to new file
    echo(!read-line:*:=!>>"out2.tmp"
)
rem end here and continue below
goto :eof

:ProcessLinesEnd
echo this should not be printed until call has ended

:exit
exit /b

解决方案

Two points here:

1- The setlocal EnableDelayedExpansion command is executed with every line of the file. This means that about 200000 times the complete environment must be copied to a new local memory area. This may cause several problems.

2- I suggest you to start with the most basic part. How much time takes the findstr to execute? Run findstr /n ^^ old-file.xml alone and check this before trying to fix any other part. If this process is fast, then add a single step to it and test again until you discover the cause of the slow down. I suggest you not use pipes nor for /f over the execution of findstr, but over the file generated by a previous redirection.

EDIT A faster solution

There is another way to do this. You may pipe findstr output into a Batch subroutine, so the lines can be read with SET /P command. This method allows to process the lines entirely via delayed expansions and not via the command-line susbtitution of FOR /F, so the pair of setlocal EnableDelayedExpansion and endlocal commands are no longer necessary. However, if you still want to display the line number it is necessary to calculate it again.

Also, it is faster to load the wrong dates in a variable instead of process the %tmpfile% with every line of the big file.

setlocal EnableDelayedExpansion
rem load wrong dates from tmpfile
set wrong-dates=
for /F %%i in (%tmpfile%) do (
    set wrong-dates=!wrong-dates! %%i
)
echo creating findstr output, please wait...
findstr /n ^^ old-file.xml > findstr.txt
echo :EOF>> findstr.txt
rem start replacing wrong dates with correct date
call :ProcessLines < findstr.txt
goto :eof

.

:ProcessLines
set line=0
:read-next-line
set /P read-line=
rem check if the input file ends
if !read-line! == :EOF goto :eof
rem write progress in title bar
set /A line+=1
title Processing line: %line%/%lines%
for %%i in (%wrong-dates%) do (
    rem replace all wrong dates with correct dates
    set read-line=!read-line:%%i=%correctdate%!
)
rem write results to new file
echo(!read-line:*:=!>>"Updated-file.xml"
rem go back for next line
goto read-next-line

SECOND EDIT An even faster modification

Previous method may be slighlty speeded up if the loop is achieved via for /L command instead of via a goto.

:ProcessLines
for /L %%l in (1,1,%lines%) do (
    set /P read-line=
    rem write progress in title bar
    title Processing line: %%l/%lines%
    for %%i in (%wrong-dates%) do (
        rem replace all wrong dates with correct dates
        set read-line=!read-line:%%i=%correctdate%!
    )
    rem write results to new file
    echo(!read-line:*:=!>>"Updated-file.xml"
)

This modification also omit the :EOF comparison and the calculation of line number, so the time gain may be significative after repeated it 200000 times. If you use this method, don't forget to delete the echo :EOF>> findstr.txt line in first part.

这篇关于循环慢速处理,利用FINDSTR的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆