在Unix上将非常大的目录分割成更小的目录 [英] One-liner to split very large directory into smaller directories on Unix

查看:126
本文介绍了在Unix上将非常大的目录分割成更小的目录的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何在UNIX上将包含潜在的数百万个文件的非常大的目录拆分成一些自定义的最大数量的文件(比如每个目录100个)的小目录?



如果您知道如何将 wget 下载文件自动下载到这些子目录,则可获得奖励积分。因此,如果在 www.example.com 的顶级路径中有一百万个 .html 页,例如

  /1.html 
/2.html
...
/ 1000000。 html

我们只需要每个目录100个文件,它会将它们下载到类似

  ./ www.example.com/1-100/1.html 
...
./www .example.com / 999901-1000000 / 1000000.html

只有真正需要能够运行在 wget 文件夹下的UNIX命令已经下载了这些文件,但是如果可以在下载文件时使用 wget 我很想知道!

解决方案

你可以通过几个循环来运行它,对于文件名的数字部分)。我认为这样做是一个单一的行为是过于乐观。

 #! / bin / bash 
for {0..99}
do
min = $(($ hundred * 100 + 1))
max = $(($ * 100 + 100))
current_dir =$ min- $ max
mkdir $ current_dir $ b $ for one_tens in {1..100}
do
current_file = $($ million * 100 + $ ones_tens))。html
#touch $ current_file
mv $ current_file $ current_dir
完成
完成
mkdir $ current_dir
mv $ current_file $ current_dir 并取消注释 touch $ current_file 。这创建了10000个文件(1000000个文件的目标的百分之一)。一旦创建了这些文件,我就回到了脚本中:

$ $ $ $ bash /tmp/test.bash 2 ; 1

real 0m27.700s
user 0m26.426s
sys 0m17.653s

只要不跨文件系统移动文件,每个 mv 命令的时间应该是不变的,所以你应该看到类似的或更好的表现。缩放到一百万个文件将会给你大约27700秒,即46分钟。有几个优化的途径,例如在一个命令中移动给定目录的所有文件,或者删除内部for循环。



做'wget'来抓取百万个文件将花费比这更长的时间,并且几乎肯定会需要一些优化;仅在http头文件中保留带宽将使运行时间缩短数小时。我不认为shell脚本可能是这个工作的正确工具;在cpan上使用诸如WWW :: Curl之类的库将更容易优化。


How do you to split a very large directory, containing potentially millions of files, into smaller directories of some custom defined maximum number of files, such as 100 per directory, on UNIX?

Bonus points if you know of a way to have wget download files into these subdirectories automatically. So if there are 1 million .html pages at the top-level path at www.example.com, such as

/1.html
/2.html
...
/1000000.html

and we only want 100 files per directory, it will download them to folders something like

./www.example.com/1-100/1.html
...
./www.example.com/999901-1000000/1000000.html

Only really need to be able to run the UNIX command on the folder after wget has downloaded the files, but if it's possible to do this with wget as it's downloading I'd love to know!

解决方案

You can run this through a couple of loops, which should do the trick (at least for the numeric part of the file name). I think that doing this as a one-liner is over-optimistic.

#! /bin/bash
for hundreds in {0..99}
do
    min=$(($hundreds*100+1))
    max=$(($hundreds*100+100))
    current_dir="$min-$max"
    mkdir $current_dir
    for ones_tens in {1..100}
    do
        current_file="$(($hundreds*100+$ones_tens)).html"
        #touch $current_file 
        mv $current_file $current_dir
    done
done

I did performance testing by first commenting out mkdir $current_dir and mv $current_file $current_dir and uncommenting touch $current_file. This created 10000 files (one-hundredth of your target of 1000000 files). Once the files were created, I reverted to the script as written:

$ time bash /tmp/test.bash 2>&1 

real        0m27.700s
user        0m26.426s
sys         0m17.653s

As long as you aren't moving files across file systems, the time for each mv command should be constant, so you should see similar or better performance. Scaling this up to a million files would give you around 27700 seconds, i.e. 46 minutes. There are several avenues for optimization, such as moving all files for a given directory in one command, or removing the inner for loop.

Doing the 'wget' to grab a million files is going to take far longer than this, and is almost certainly going to require some optimization; preserving bandwidth in http headers alone will cut down run time by hours. I don't think that a shell script is probably the right tool for that job; using a library such as WWW::Curl on cpan will be much easier to optimize.

这篇关于在Unix上将非常大的目录分割成更小的目录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆