在Unix上将非常大的目录分割成更小的目录 [英] One-liner to split very large directory into smaller directories on Unix
问题描述
如果您知道如何将 wget
下载文件自动下载到这些子目录,则可获得奖励积分。因此,如果在 www.example.com
的顶级路径中有一百万个 .html
页,例如
/1.html
/2.html
...
/ 1000000。 html
我们只需要每个目录100个文件,它会将它们下载到类似
./ www.example.com/1-100/1.html
...
./www .example.com / 999901-1000000 / 1000000.html
只有真正需要能够运行在 wget
文件夹下的UNIX命令已经下载了这些文件,但是如果可以在下载文件时使用 wget
我很想知道!
你可以通过几个循环来运行它,对于文件名的数字部分)。我认为这样做是一个单一的行为是过于乐观。
#! / bin / bash
for {0..99}
do
min = $(($ hundred * 100 + 1))
max = $(($ * 100 + 100))
current_dir =$ min- $ max
mkdir $ current_dir $ b $ for one_tens in {1..100}
do
current_file = $($ million * 100 + $ ones_tens))。html
#touch $ current_file
mv $ current_file $ current_dir
完成
完成
$我做了性能测试,首先评论 mkdir $ current_dir
和 mv $ current_file $ current_dir
并取消注释 touch $ current_file
。这创建了10000个文件(1000000个文件的目标的百分之一)。一旦创建了这些文件,我就回到了脚本中:
$ $ $ $ bash /tmp/test.bash 2 ; 1
real 0m27.700s
user 0m26.426s
sys 0m17.653s
只要不跨文件系统移动文件,每个 mv
命令的时间应该是不变的,所以你应该看到类似的或更好的表现。缩放到一百万个文件将会给你大约27700秒,即46分钟。有几个优化的途径,例如在一个命令中移动给定目录的所有文件,或者删除内部for循环。
做'wget'来抓取百万个文件将花费比这更长的时间,并且几乎肯定会需要一些优化;仅在http头文件中保留带宽将使运行时间缩短数小时。我不认为shell脚本可能是这个工作的正确工具;在cpan上使用诸如WWW :: Curl之类的库将更容易优化。
How do you to split a very large directory, containing potentially millions of files, into smaller directories of some custom defined maximum number of files, such as 100 per directory, on UNIX?
Bonus points if you know of a way to have wget
download files into these subdirectories automatically. So if there are 1 million .html
pages at the top-level path at www.example.com
, such as
/1.html
/2.html
...
/1000000.html
and we only want 100 files per directory, it will download them to folders something like
./www.example.com/1-100/1.html
...
./www.example.com/999901-1000000/1000000.html
Only really need to be able to run the UNIX command on the folder after wget
has downloaded the files, but if it's possible to do this with wget
as it's downloading I'd love to know!
You can run this through a couple of loops, which should do the trick (at least for the numeric part of the file name). I think that doing this as a one-liner is over-optimistic.
#! /bin/bash
for hundreds in {0..99}
do
min=$(($hundreds*100+1))
max=$(($hundreds*100+100))
current_dir="$min-$max"
mkdir $current_dir
for ones_tens in {1..100}
do
current_file="$(($hundreds*100+$ones_tens)).html"
#touch $current_file
mv $current_file $current_dir
done
done
I did performance testing by first commenting out mkdir $current_dir
and mv $current_file $current_dir
and uncommenting touch $current_file
. This created 10000 files (one-hundredth of your target of 1000000 files). Once the files were created, I reverted to the script as written:
$ time bash /tmp/test.bash 2>&1
real 0m27.700s
user 0m26.426s
sys 0m17.653s
As long as you aren't moving files across file systems, the time for each mv
command should be constant, so you should see similar or better performance. Scaling this up to a million files would give you around 27700 seconds, i.e. 46 minutes. There are several avenues for optimization, such as moving all files for a given directory in one command, or removing the inner for loop.
Doing the 'wget' to grab a million files is going to take far longer than this, and is almost certainly going to require some optimization; preserving bandwidth in http headers alone will cut down run time by hours. I don't think that a shell script is probably the right tool for that job; using a library such as WWW::Curl on cpan will be much easier to optimize.
这篇关于在Unix上将非常大的目录分割成更小的目录的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!