bash-随机播放一个太大而无法容纳在内存中的文件 [英] bash - shuffle a file that is too large to fit in memory
问题描述
我的文件太大,无法容纳在内存中. shuf
似乎在RAM中运行,并且sort -R
不会混洗(相同的行彼此相邻;我需要对所有行进行混洗).除了推出自己的解决方案之外,还有其他选择吗?
I've got a file that's too large to fit in memory. shuf
seems to run in RAM, and sort -R
doesn't shuffle (identical lines end up next to each other; I need all of the lines to be shuffled). Are there any options other than rolling my own solution?
推荐答案
使用 decorate-sort-取消装饰模式,然后awk
您可以执行以下操作:
Using a form of decorate-sort-undecorate pattern and awk
you can do something like:
$ seq 10 | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' | sort -n | cut -c8-
8
5
1
9
6
3
7
2
10
4
对于文件,您可以这样做:
For a file, you would do:
$ awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' SORTED.TXT | sort -n | cut -c8- > SHUFFLED.TXT
或cat
管道开始处的文件.
这通过生成一列在000000
和999999
(含)之间的随机数(装饰)来工作;在该列上排序(排序);然后删除该列(取消装饰率).这应该在排序不理解数字的平台上起作用,因为该平台会生成一个带有前导零的列以进行字典排序.
This works by generating a column of random numbers between 000000
and 999999
inclusive (decorate); sorting on that column (sort); then deleting the column (undecorate). That should work on platforms where sort does not understand numerics by generating a column with leading zeros for lexicographic sorting.
如果需要,您可以通过几种方式增加随机性:
You can increase that randomization, if desired, in several ways:
-
如果平台的
sort
可以理解数值(POSIX,GNU和BSD可以),则可以执行awk 'BEGIN{srand();} {printf "%0.15f\t%s\n", rand(), $0;}' FILE.TXT | sort -n | cut -f 2-
将近双精度浮点数用于随机表示.
If your platform's
sort
understands numerical values (POSIX, GNU and BSD do) you can doawk 'BEGIN{srand();} {printf "%0.15f\t%s\n", rand(), $0;}' FILE.TXT | sort -n | cut -f 2-
to use a near double float for random representation.
如果限于字典编排,只需将对rand
的两个调用合并到一栏中,如下所示:awk 'BEGIN{srand();} {printf "%06d%06d\t%s\n", rand()*1000000,rand()*1000000, $0;}' FILE.TXT | sort -n | cut -f 2-
,它给出12位复合的随机数.
If you are limited to a lexicographic sort, just combine two calls to rand
into one column like so: awk 'BEGIN{srand();} {printf "%06d%06d\t%s\n", rand()*1000000,rand()*1000000, $0;}' FILE.TXT | sort -n | cut -f 2-
which gives a composite 12 digits of randomization.
这篇关于bash-随机播放一个太大而无法容纳在内存中的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!