随机播放大文件的行 [英] shuffle the lines of a large file

查看:58
本文介绍了随机播放大文件的行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您好,


我正在寻找一种洗牌的方法一个大文件的行。


我有一个已排序的语料库和uniqed用(1)产生的英语句子




(1)排序语料库uniq> corpus.uniq


corpus.uniq是80G大。在corpus.uniq中,每个句子只出现一次只有

这一事实对我的过程起着重要作用

我用来介入我的语料库。但是,字母顺序是一个

不需要的副作用(1):很多时候,我不希望(或者更确切地说,我没有b $ b没有计算能力)将程序应用于所有

corpus.uniq。然而,任何一系列的corpus.uniq系列显然都是一套非常不平衡的英语句子。


所以,做一个非常有用以下内容:


- 以某种方式生成corpus.uniq,使其无法以任何方式排序

- shuffle corpus.uniq> corpus.uniq.shuffled


不幸的是,我可能使用的机器都没有80G内存。

因此,使用字典无济于事。 />

有什么想法吗?


Joerg Schuster

解决方案

< blockquote> Joerg Schuster写道:

你好,

我正在寻找一种方法来洗牌一个大文件的行。

我有一个排序和uniqed的语料库。用(1)产生的英语句子:

(1)排序语料库| uniq> corpus.uniq

corpus.uniq是80G大。事实上,每个句子只出现一次在corpus.uniq中,对于我用来涉及我的语料库的过程起着重要的作用。然而,字母顺序是一个不必要的副作用。 (1):很多时候,我不想(或者更确切地说,我没有计算能力)将程序应用到所有的
corpus.uniq。然而,任何一系列的corpus.uniq系列显然都是一套非常不平衡的英语句子。

因此,做以下事情之一是非常有用的:

- 以一种不以任何方式排序的方式生成corpus.uniq
- shuffle corpus.uniq> corpus.uniq.shuffled

不幸的是,我可能使用的机器都没有80G内存。
因此,使用字典无济于事。




前一段时间有一个关于从文件中选择随机行而没有将整个

文件读入内存的线程。那会有帮助吗?而不是洗牌文件,洗牌用户。我找不到

这个帖子......


肯特


" Joerg Schuster <乔******** @ gmail.com>写道:

你好,
我正在寻找一种洗牌的方法一个大文件的行。
我有一个已排序的语料库和uniqed用(1)生成的英语句子:
(1)排序语料库| uniq> corpus.uniq
corpus.uniq是80G大。事实上,每个句子只出现一次在corpus.uniq中,对于我用来涉及我的语料库的过程起着重要的作用。然而,字母顺序是一个不必要的副作用。 (1):很多时候,我不想(或者更确切地说,我没有计算能力)将程序应用到所有的
corpus.uniq。然而,任何一系列的corpus.uniq系列显然都是一套非常不平衡的英语句子。
所以,做以下事情之一会非常有用:
- 以一种不以任何方式排序的方式生成corpus.uniq
- shuffle corpus.uniq > corpus.uniq.shuffled
不幸的是,我可能使用的机器都没有80G RAM。
因此,使用字典无济于事。
任何想法?




而不是改组文件本身可能你可以索引它(使用dbm for

实例)并选择当你需要一个

样本时,使用随机索引随机排列。


Eddie


周一2005年3月7日14:36,Joerg Schuster写道:

任何想法?




以下程序应该做的伎俩(文件名是硬编码的,看看

文件顶部):


### shuffle.py


导入随机
导入搁置


#打开数据存储所需的外部文件。

lines = open(" test.dat"," r" ;)

lineindex = shelve.open(" test.idx")

newlines = open(" test.new.dat"," w")


#在外部平面文件DB中创建文件所有行的索引。

#这意味着什么都没有留下s在内存中,但在一个非常好的(b)有效(g)dbm flatfile数据库中。

def makeIdx():

i = 0L

lastpos = 0L

curpos =无

而lines.readline():

#这是之后的(\\ \\ r)\ n,将被剥离()并重写

#by writeNewLines()。

curpos = long(lines.tell())

lineindex [hex(i)[2:-1]] ="%s:%s" %(hex(lastpos)[2:-1],

hex(curpos-lastpos)[2:-1])

lastpos = curpos

i + = 1

返回i


maxidx = makeIdx()


#要随机播放文件,只是洗牌指数。问题在于:没有

#随机数生成器甚至可以远程生成所有可能的排列。因此,为简单起见:只需在文件的其余部分中使用随机元素交换每个元素

#。这是

#当然没有完美的洗牌,如果洗牌太糟糕了,只需要几次

#rerun shuffleIdx()几次。

def shuffleIdx():

oldi = 0L

#使用while循环,因为xrange不适用于longs。

而oldi< maxidx:

oi = hex(oldi)[2:-1]

而True:

ni = hex(long(random.randrange) (maxidx)))[2:-1]

如果ni<> oi:

break

lineindex [oi],lineindex [ni] = lineindex [ni],lineindex [oi]

oldi + = 1


shuffleIdx()


#写出洗牌文件。通过走索引0..end来做到这一点。

def writeNewLines():

i = 0L

#使用while循环,如同xrange不适合多头。

而i< maxidx:

#从索引文件中提取行索引和行长度。

lidx,llen = [long(x,16)for x in lineindex [hex(i) [2:-1]]。分裂(":")]

lines.seek(lidx)

line = lines.read(llen).strip( )

newlines.write(line +" \ n")

i + = 1


writeNewLines()


###结束shuffle.py


我不知道这个程序运行的速度有多快,但至少,它确实如此br $> b $ b告诉...;)


-

--- Heiko。


----- BEGIN PGP SIGNATURE -----

版本:GnuPG v1.4.0(GNU / Linux)

iD8DBQBCLGMhf0bpgh6uVAMRAkxVAJ43QQI1d + X6FvxjQ0WBwM E0JDc6fQCeJn9q

sTPw + DGj + / UVlp14TXia4Ds =

= Z4ir

----- END PGP SIGNATURE -----


Hello,

I am looking for a method to "shuffle" the lines of a large file.

I have a corpus of sorted and "uniqed" English sentences that has been
produced with (1):

(1) sort corpus | uniq > corpus.uniq

corpus.uniq is 80G large. The fact that every sentence appears only
once in corpus.uniq plays an important role for the processes
I use to involve my corpus in. Yet, the alphabetical order is an
unwanted side effect of (1): Very often, I do not want (or rather, I
do not have the computational capacities) to apply a program to all of
corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a
very lopsided set of English sentences.

So, it would be very useful to do one of the following things:

- produce corpus.uniq in a such a way that it is not sorted in any way
- shuffle corpus.uniq > corpus.uniq.shuffled

Unfortunately, none of the machines that I may use has 80G RAM.
So, using a dictionary will not help.

Any ideas?

Joerg Schuster

解决方案

Joerg Schuster wrote:

Hello,

I am looking for a method to "shuffle" the lines of a large file.

I have a corpus of sorted and "uniqed" English sentences that has been
produced with (1):

(1) sort corpus | uniq > corpus.uniq

corpus.uniq is 80G large. The fact that every sentence appears only
once in corpus.uniq plays an important role for the processes
I use to involve my corpus in. Yet, the alphabetical order is an
unwanted side effect of (1): Very often, I do not want (or rather, I
do not have the computational capacities) to apply a program to all of
corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a
very lopsided set of English sentences.

So, it would be very useful to do one of the following things:

- produce corpus.uniq in a such a way that it is not sorted in any way
- shuffle corpus.uniq > corpus.uniq.shuffled

Unfortunately, none of the machines that I may use has 80G RAM.
So, using a dictionary will not help.



There was a thread a while ago about choosing random lines from a file without reading the whole
file into memory. Would that help? Instead of shuffling the file, shuffle the users. I can''t find
the thread though...

Kent


"Joerg Schuster" <jo***********************@gmail.com> writes:

Hello, I am looking for a method to "shuffle" the lines of a large file. I have a corpus of sorted and "uniqed" English sentences that has been
produced with (1): (1) sort corpus | uniq > corpus.uniq corpus.uniq is 80G large. The fact that every sentence appears only
once in corpus.uniq plays an important role for the processes
I use to involve my corpus in. Yet, the alphabetical order is an
unwanted side effect of (1): Very often, I do not want (or rather, I
do not have the computational capacities) to apply a program to all of
corpus.uniq. Yet, any series of lines of corpus.uniq is obviously a
very lopsided set of English sentences. So, it would be very useful to do one of the following things: - produce corpus.uniq in a such a way that it is not sorted in any way
- shuffle corpus.uniq > corpus.uniq.shuffled Unfortunately, none of the machines that I may use has 80G RAM.
So, using a dictionary will not help. Any ideas?



Instead of shuffling the file itself maybe you could index it (with dbm for
instance) and select random lines by using random indexes whenever you need a
sample.

Eddie


On Monday 07 March 2005 14:36, Joerg Schuster wrote:

Any ideas?



The following program should do the trick (filenames are hardcoded, look at
top of file):

### shuffle.py

import random
import shelve

# Open external files needed for data storage.
lines = open("test.dat","r")
lineindex = shelve.open("test.idx")
newlines = open("test.new.dat","w")

# Create an index of all lines of the file in an external flat file DB.
# This means that nothing actually remains in memory, but in an extremely
# efficient (g)dbm flatfile DB.
def makeIdx():
i = 0L
lastpos = 0L
curpos = None
while lines.readline():
# This is after the (\r)\n, which will be stripped() and rewritten
# by writeNewLines().
curpos = long(lines.tell())
lineindex[hex(i)[2:-1]] = "%s:%s" % (hex(lastpos)[2:-1],
hex(curpos-lastpos)[2:-1])
lastpos = curpos
i += 1
return i

maxidx = makeIdx()

# To shuffle the file, just shuffle the index. Problem being: there is no
# random number generator which even remotely has the possibility of yielding
# all possible permutations. Thus, for simplicity: just exchange every element
# in order 1..end with a random element from the rest of the file. This is
# certainly no perfect shuffle, and in case the shuffling is too bad, just
# rerun shuffleIdx() a couple of times.
def shuffleIdx():
oldi = 0L
# Use a while loop, as xrange doesn''t work with longs.
while oldi < maxidx:
oi = hex(oldi)[2:-1]
while True:
ni = hex(long(random.randrange(maxidx)))[2:-1]
if ni <> oi:
break
lineindex[oi], lineindex[ni] = lineindex[ni], lineindex[oi]
oldi += 1

shuffleIdx()

# Write out the shuffled file. Do this by just walking the index 0..end.
def writeNewLines():
i = 0L
# Use a while loop, as xrange doesn''t work with longs.
while i < maxidx:
# Extract line index and line length from the index file.
lidx, llen = [long(x,16) for x in lineindex[hex(i)[2:-1]].split(":")]
lines.seek(lidx)
line = lines.read(llen).strip()
newlines.write(line+"\n")
i += 1

writeNewLines()

### End shuffle.py

I don''t know how fast this program will run, but at least, it does as
told... ;)

--
--- Heiko.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQBCLGMhf0bpgh6uVAMRAkxVAJ43QQI1d+X6FvxjQ0WBwM E0JDc6fQCeJn9q
sTPw+DGj+/UVlp14TXia4Ds=
=Z4ir
-----END PGP SIGNATURE-----


这篇关于随机播放大文件的行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆