如何在tcl中查找两个大文件之间的区别? [英] How to look for the difference between two large files in tcl?

查看:97
本文介绍了如何在tcl中查找两个大文件之间的区别?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个文件,这些文件中的某些内容可能在两个文件中都是相同的.(例如,文件 A.txt 和文件 B.txt )这两个文件都是已排序的文件.我需要获取文件 A.txt B.txt 的区别,即,文件 C.txt 的内容为A,除了两者的共同内容.

I have two files, the some of the contents of these might be common in both. (say file A.txt and file B.txt) Both the files are sorted files. I need to get the difference of file A.txt and B.txt, ie, a file C.txt which has contents of A except the common contents in both.

我使用了典型的搜索和打印算法,即从 A.txt 中提取一行,在 B.txt 中进行搜索,如果找到,则在中不打印任何内容> C.txt ,否则在 C.txt 中打印该行.但是,我正在处理具有大量内容的文件,因此,它引发错误:无法加载太多文件.(尽管对于较小的文件也可以使用)

I used the typical search and print algorithm, ie, took a line from A.txt, searched in B.txt, if found, print nothing in C.txt, else print that line in C.txt. But, I am dealing with files with huge # of contents, and thus, it throws error: failed to load too many files. (Though it works fine for smaller files)

有人可以建议一种更有效的获取C.txt 的方法吗?使用的脚本:仅TCL!

Can anybody suggest more efficient way of getting C.txt? Script to be used: TCL only!

推荐答案

首先,文件太多错误表明您没有关闭频道,可能是在B.txt 扫描程序.解决问题可能是您的首要目标.如果您拥有Tcl 8.6,请尝试以下帮助程序:

First off, the too many files error is an indication that you're not closing a channel, probably in the B.txt scanner. Fixing that is probably your first goal. If you've got Tcl 8.6, try this helper procedure:

proc scanForLine {searchLine filename} {
    set f [open $filename]
    try {
        while {[gets $f line] >= 0} {
            if {$line eq $searchLine} {
                return true
            }
        }
        return false
    } finally {
        close $f
    }
}

但是,如果其中一个文件足够小,可以合理地容纳到内存中,那么最好将其读入哈希表(例如,字典或数组):

However, if one of the files is small enough to fit into memory reasonably, you'd be far better reading it into a hash table (e.g., a dictionary or array):

set f [open B.txt]
while {[gets $f line]} {
    set B($line) "any dummy value; we'll ignore it"
}
close $f

set in [open A.txt]
set out [open C.txt w]
while {[gets $in line]} {
    if {![info exists B($line)]} {
        puts $out $line
    }
}
close $in
close $out

效率更高,但取决于 B.txt 是否足够小.

This is much more efficient, but depends on B.txt being small enough.

如果 A.txt B.txt 都太大了,那么您最好是分阶段进行某种处理,然后将内容写到磁盘中-之间.这变得越来越复杂!

If both A.txt and B.txt are too large for that, you are probably best doing some sort of processing by stages, writing things out to disk in-between. This is getting rather more complex!

set filter [open B.txt]
set fromFile A.txt

for {set tmp 0} {![eof $filter]} {incr tmp} {
    # Filter by a million lines at a time; that'll probably fit OK
    for {set i 0} {$i < 1000000} {incr i} {
        if {[gets $filter line] < 0} break
        set B($line) "dummy"
    }

    # Do the filtering
    if {$tmp} {set fromFile $toFile}
    set from [open $fromFile]
    set to [open [set toFile /tmp/[pid]_$tmp.txt] w]
    while {[gets $from line] >= 0} {
        if {![info exists B($line)]} {
            puts $to $line
        }
    }
    close $from
    close $to

    # Keep control of temporary files and data
    if {$tmp} {file delete $fromFile}
    unset B
}
close $filter
file rename $toFile C.txt

警告!我尚未测试此代码...

Warning! I've not tested this code…

这篇关于如何在tcl中查找两个大文件之间的区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆