如何在tcl中查找两个大文件之间的区别? [英] How to look for the difference between two large files in tcl?
问题描述
我有两个文件,这些文件中的某些内容可能在两个文件中都是相同的.(例如,文件 A.txt
和文件 B.txt
)这两个文件都是已排序的文件.我需要获取文件 A.txt
和 B.txt
的区别,即,文件 C.txt
的内容为A,除了两者的共同内容.
I have two files, the some of the contents of these might be common in both. (say file A.txt
and file B.txt
)
Both the files are sorted files.
I need to get the difference of file A.txt
and B.txt
, ie, a file C.txt
which has contents of A except the common contents in both.
我使用了典型的搜索和打印算法,即从 A.txt
中提取一行,在 B.txt
中进行搜索,如果找到,则在中不打印任何内容> C.txt
,否则在 C.txt
中打印该行.但是,我正在处理具有大量内容的文件,因此,它引发错误:无法加载太多文件
.(尽管对于较小的文件也可以使用)
I used the typical search and print algorithm, ie, took a line from A.txt
, searched in B.txt
, if found, print nothing in C.txt
, else print that line in C.txt
.
But, I am dealing with files with huge # of contents, and thus, it throws error: failed to load too many files
. (Though it works fine for smaller files)
有人可以建议一种更有效的获取C.txt 的方法吗?使用的脚本:仅TCL!
Can anybody suggest more efficient way of getting C.txt
?
Script to be used: TCL only!
推荐答案
首先,文件太多
错误表明您没有关闭频道,可能是在B.txt
扫描程序.解决问题可能是您的首要目标.如果您拥有Tcl 8.6,请尝试以下帮助程序:
First off, the too many files
error is an indication that you're not closing a channel, probably in the B.txt
scanner. Fixing that is probably your first goal. If you've got Tcl 8.6, try this helper procedure:
proc scanForLine {searchLine filename} {
set f [open $filename]
try {
while {[gets $f line] >= 0} {
if {$line eq $searchLine} {
return true
}
}
return false
} finally {
close $f
}
}
但是,如果其中一个文件足够小,可以合理地容纳到内存中,那么最好将其读入哈希表(例如,字典或数组):
However, if one of the files is small enough to fit into memory reasonably, you'd be far better reading it into a hash table (e.g., a dictionary or array):
set f [open B.txt]
while {[gets $f line]} {
set B($line) "any dummy value; we'll ignore it"
}
close $f
set in [open A.txt]
set out [open C.txt w]
while {[gets $in line]} {
if {![info exists B($line)]} {
puts $out $line
}
}
close $in
close $out
这效率更高,但取决于 B.txt
是否足够小.
This is much more efficient, but depends on B.txt
being small enough.
如果 A.txt
和 B.txt
都太大了,那么您最好是分阶段进行某种处理,然后将内容写到磁盘中-之间.这变得越来越复杂!
If both A.txt
and B.txt
are too large for that, you are probably best doing some sort of processing by stages, writing things out to disk in-between. This is getting rather more complex!
set filter [open B.txt]
set fromFile A.txt
for {set tmp 0} {![eof $filter]} {incr tmp} {
# Filter by a million lines at a time; that'll probably fit OK
for {set i 0} {$i < 1000000} {incr i} {
if {[gets $filter line] < 0} break
set B($line) "dummy"
}
# Do the filtering
if {$tmp} {set fromFile $toFile}
set from [open $fromFile]
set to [open [set toFile /tmp/[pid]_$tmp.txt] w]
while {[gets $from line] >= 0} {
if {![info exists B($line)]} {
puts $to $line
}
}
close $from
close $to
# Keep control of temporary files and data
if {$tmp} {file delete $fromFile}
unset B
}
close $filter
file rename $toFile C.txt
警告!我尚未测试此代码...
Warning! I've not tested this code…
这篇关于如何在tcl中查找两个大文件之间的区别?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!