从包含数百万个文件的目录(bash/python/perl)中完全匹配地高效查找数千个文件 [英] Find thousands of files efficiently with exact match from a directory containing millions of files (bash/python/perl)

查看:97
本文介绍了从包含数百万个文件的目录(bash/python/perl)中完全匹配地高效查找数千个文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Linux上,我试图从包含数百万个文件的目录(SOURCE_DIR)中查找数千个文件.我有一个需要查找的文件名列表,存储在单个文本文件(FILE_LIST)中.该文件的每一行都包含一个与SOURCE_DIR中的文件相对应的名称,并且该文件中包含数千行.

I am on Linux and I am trying to find thousands of files from a directory (SOURCE_DIR) that contain millions of files. I have a list of file names that I need to find, stored in a single text file (FILE_LIST). Each line of this file contain a single name corresponding to a file in SOURCE_DIR and there are thousands of lines in the file.

## FILE_LIST contain single word file names, each per line
#Name0001
#Name0002
#..
#Name9999

我想将文件复制到另一个目录(DESTINATION_DIR).我写了下面的循环,里面有一个循环,一个一个地找到.

I want to copy the files to another directory (DESTINATION_DIR). I wrote the below loop, with a loop inside to find one by one.

#!/bin/bash
FILE_LIST='file.list'
## FILE_LIST contain single word file names, each per line
#Name0001
#Name0002
#..
#Name9999

SOURCE_DIR='/path/to/source/files' # Contain millions of files in sub-directories
DESTINATION_DIR='/path/to/destination/files' # Files will be copied to here


while read FILE_NAME
do
    echo $FILE_NAME
    for FILE_NAME_WITH_PATH in `find SOURCE_DIR -maxdepth 3 -name "$FILE_NAME*" -type f -exec readlink -f {} \;`; 
    do 
        echo $FILE
        cp -pv $FILE_NAME_WITH_PATH $DESTINATION_DIR; 
    done
done < $FILE_LIST

这个循环要花很多时间,我想知道是否有更好的方法可以实现我的目标.我进行了搜索,但没有找到解决问题的方法.请直接向我提供解决方案(如果已有的话),或者建议您对上述代码进行任何调整.如果采用其他方法甚至使用python/perl解决方案,我也很好.感谢您的时间和帮助!

This loop is taking a lot of time and I was wondering whether there is a better way to achieve my goal. I searched, but did not find a solution to my problem. Please direct to me to a solution if already exist or kindly suggest any tweak in the above code. I am also fine if another approach or even a python/perl solution. Thanks for your time and help!

推荐答案

注意

需要找到要复制的文件,因为它们没有路径(不知道它们在哪个目录中),但是重新搜索每个文件非常浪费,大大增加了复杂性.

The files to copy need to be found as they aren't given with a path (don't know in which directories they are), but searching anew for each is extremely wasteful, increasing complexity greatly.

相反,首先为每个文件名构建具有全路径名称的哈希.

Instead, build a hash with a full-path name for each filename first.

使用Perl的一种方法是利用快速核心模块 File :: Find

One way, with Perl, utilizing the fast core module File::Find

use warnings;
use strict;
use feature 'say';

use File::Find;
use File::Copy qw(copy);

my $source_dir = shift // '/path/to/source';  # give at invocation or default

my $copy_to_dir = '/path/to/destination';

my $file_list = 'file_list_to_copy.txt';  
open my $fh, '<', $file_list or die "Can't open $file_list: $!";
my @files = <$fh>;
chomp @files;


my %fqn;    
find( sub { $fqn{$_} = $File::Find::name  unless -d }, $source_dir );

# Now copy the ones from the list to the given location        
foreach my $fname (@files) { 
    copy $fqn{$fname}, $copy_to_dir  
        or do { 
            warn "Can't copy $fqn{$fname} to $copy_to_dir: $!";
            next;
        };
}

剩下的问题是关于可能存在于多个目录中的文件名,但是我们需要给出处理该规则的规则.

The remaining problem is about filenames that may exists in multiple directories, but we need to be given a rule for what to do then.

我无视问题中使用的最大深度,因为无法解释,在我看来,这是与极端运行时间(?)相关的解决方案.而且,文件被复制到平面"文件中.结构(无需恢复其原始层次结构),从问题中得到提示.

I disregard that a maximal depth is used in the question, since it is unexplained and seemed to me to be a fix related to extreme runtimes (?). Also, files are copied into a "flat" structure (without restoring their orginal hierarchy), taking the cue from the question.

最后,我只跳过目录,而其他各种文件类型都有其自身的问题(在需要注意的地方复制链接).要仅接受纯文件,请将unless -d 更改为if -f.

Finally, I skip only directories, while various other file types come with their own issues (copying links around needs care). To accept only plain files change unless -d to if -f.

需要澄清的是,实际上,不同目录中可能存在相同名称的文件.扩展名应复制为相同的名称,并在扩展名后加序号.

A clarification came that, indeed, there may be files with the same name in different directories. Those should be copied to same name suffixed with a sequential number before the extension.

为此,我们需要检查名称是否已经存在,并在构建哈希时跟踪重复的名称,因此这将花费更长的时间.那么,如何解释重复的名称还有一点难题?我使用另一个哈希,在arrayrefs中仅保留duped-names .这简化并加快了工作的两个部分.

For this we need to check whether a name exists already, and to keep track of duplicate ones, while building the hash, so this will take a little longer. There is a little conundrum of how to account for duplicate names then? I use another hash where only duped-names are kept, in arrayrefs; this simplifies and speeds up both parts of the job.

my (%fqn, %dupe_names);
find( sub {
    return if -d;
    (exists $fqn{$_})
        ? push( @{ $dupe_names{$_} }, $File::Find::name )
        : ( $fqn{$_} = $File::Find::name );
}, $source_dir );

令我惊讶的是,它运行起来几乎不比代码慢一点,而且无需担心重复的名称,即使是现在每个项目都在运行测试,遍及庞大的层次结构的文件也只有25万个.

To my surprise this runs barely a little slower than the code with no concern for duplicate names, on a quarter million files spread over a sprawling hierarchy, even as now a test runs for each item.

由于三元运算符,因此需要对赋值进行解析.可能将运算符分配给了(如果最后两个参数是有效的左值",如此处所示),因此在分支内部进行赋值时要格外小心.

The parens around the assignment in the ternary operator are needed since the operator may be assigned to (if the last two arguments are valid "lvalues," as they are here) and so one need be careful with assignments inside the branches.

然后在按照帖子的主要部分复制%fqn之后,还复制相同名称的其他文件.我们需要分解文件名,以便在.ext之前添加枚举.我使用核心 File :: Basename

Then after copying %fqn as in the main part of the post, also copy other files with the same name. We need to break up filenames so to add enumeration before .ext; I use core File::Basename

use File::Basename qw(fileparse);

foreach my $fname (@files) { 
    next if not exists $dupe_names{$fname};  # no dupe (and copied already)
    my $cnt = 1;
    foreach my $fqn (@{$dupe_names{$fname}}) { 
        my ($name, $path, $ext) = fileparse($fqn, qr/\.[^.]*/); 
        copy $fqn, "$copy_to_dir/${name}_$cnt$ext";
            or do { 
                warn "Can't copy $fqn to $copy_to_dir: $!";
                next;
            };
        ++$cnt;
    }
}

(已完成基本测试,但还没完成)

(basic testing done but not much more)

我可能会使用undef而不是上面的$path来指示该路径未使用(尽管这也避免了分配和填充标量),但是为了避免那些不熟悉什么的人,我以这种方式保留了它模块的子返回.

I'd perhaps use undef instead of $path above, to indicate that the path is unused (while that also avoids allocating and populating a scalar), but I left it this way for clarity for those unfamiliar with what the module's sub returns.

注意.对于具有重复项的文件,将有副本fname.extfname_1.ext等.如果希望将它们 all 编入索引,则首先将fname.ext重命名(在目标位置,已经通过%fqn复制了它)到fname_1.ext,并将计数器初始化更改为my $cnt = 2;.

Note.   For files with duplicates there'll be copies fname.ext, fname_1.ext, etc. If you'd rather have them all indexed, then first rename fname.ext (in the destination, where it has already been copied via %fqn) to fname_1.ext, and change counter initialization to my $cnt = 2;.

请注意,这些绝对不需要是相同的文件.

Note that these by no means need be same files.

这篇关于从包含数百万个文件的目录(bash/python/perl)中完全匹配地高效查找数千个文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆