bash-随机播放一个太大而无法容纳在内存中的文件 [英] bash - shuffle a file that is too large to fit in memory

查看:58
本文介绍了bash-随机播放一个太大而无法容纳在内存中的文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的文件太大,无法容纳在内存中. shuf似乎在RAM中运行,并且sort -R不会混洗(相同的行彼此相邻;我需要对所有行进行混洗).除了推出自己的解决方案之外,还有其他选择吗?

I've got a file that's too large to fit in memory. shuf seems to run in RAM, and sort -R doesn't shuffle (identical lines end up next to each other; I need all of the lines to be shuffled). Are there any options other than rolling my own solution?

推荐答案

使用 decorate-sort-取消装饰模式,然后awk您可以执行以下操作:

Using a form of decorate-sort-undecorate pattern and awk you can do something like:

$ seq 10 | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' | sort -n | cut -c8-
8
5
1
9
6
3
7
2
10
4

对于文件,您可以这样做:

For a file, you would do:

$ awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' SORTED.TXT | sort -n | cut -c8- > SHUFFLED.TXT

cat管道开始处的文件.

这通过生成一列在000000999999(含)之间的随机数(装饰)来工作;在该列上排序(排序);然后删除该列(取消装饰率).这应该在排序不理解数字的平台上起作用,因为该平台会生成一个带有前导零的列以进行字典排序.

This works by generating a column of random numbers between 000000 and 999999 inclusive (decorate); sorting on that column (sort); then deleting the column (undecorate). That should work on platforms where sort does not understand numerics by generating a column with leading zeros for lexicographic sorting.

如果需要,您可以通过几种方式增加随机性:

You can increase that randomization, if desired, in several ways:

  1. 如果平台的sort可以理解数值(POSIX,GNU和BSD可以),则可以执行awk 'BEGIN{srand();} {printf "%0.15f\t%s\n", rand(), $0;}' FILE.TXT | sort -n | cut -f 2-将近双精度浮点数用于随机表示.

  1. If your platform's sort understands numerical values (POSIX, GNU and BSD do) you can do awk 'BEGIN{srand();} {printf "%0.15f\t%s\n", rand(), $0;}' FILE.TXT | sort -n | cut -f 2- to use a near double float for random representation.

如果限于字典编排,只需将对rand的两个调用合并到一栏中,如下所示:awk 'BEGIN{srand();} {printf "%06d%06d\t%s\n", rand()*1000000,rand()*1000000, $0;}' FILE.TXT | sort -n | cut -f 2-,它给出12位复合的随机数.

If you are limited to a lexicographic sort, just combine two calls to rand into one column like so: awk 'BEGIN{srand();} {printf "%06d%06d\t%s\n", rand()*1000000,rand()*1000000, $0;}' FILE.TXT | sort -n | cut -f 2- which gives a composite 12 digits of randomization.

这篇关于bash-随机播放一个太大而无法容纳在内存中的文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆