如何在Python中随机播放磁盘上的文本文件 [英] How to shuffle a text file on disk in Python
问题描述
我正在使用大约12 * 10 ^ 6行的文本文件,该文件存储在我的硬盘上.该文件的结构为:
I am working with a text file of about 12*10^6 rows which is stored on my hard disk. The structure of the file is:
data|data|data|...|data\n
data|data|data|...|data\n
data|data|data|...|data\n
...
data|data|data|...|data\n
没有标题,也没有ID来唯一标识行.
There's no header, and there's no id to uniquely identify the rows.
由于我想将其用于机器学习,因此我需要确保文本文件中没有顺序可能影响随机学习.
Since I want to use it for machine learning purposes, I need to make sure that there's no order in the text file which may affect the stochastic learning.
通常,我将此类文件上传到内存中,并在将它们重写到磁盘之前先对其进行洗牌.不幸的是,由于文件的大小,这一次是不可能的,所以我必须直接在磁盘上管理改组(假设我磁盘空间没有问题).关于如何使用Python有效(以最低的复杂度,即写入磁盘)来管理此类任务的任何想法?
Usually I upload such kind of files into memory, and I shuffle them before rewriting them to disk. Unfortunately this time it is not possible, due to the size of the file, so I have to manage the shuffling directly on disk(assume I don't have problem with the disk space). Any idea about how to effectively (with lowest possible complexity, i.e. writing to the disk) manage such task with Python?
推荐答案
除了上述想法之一外,所有想法都使用O(N)内存-但是,如果您使用的是 array.array
或 numpy.ndarray
我们谈论的是N * 4个字节,该字节明显小于整个文件.(为简单起见,我将使用简单列表;如果您需要转换为更紧凑类型的帮助,我也可以展示出来.)
All but one of these ideas use O(N) memory—but if you use an array.array
or numpy.ndarray
we're talking around N*4 bytes, which is significantly smaller than the whole file. (I'll use a plain list for simplicity; if you need help converting to a more compact type, I can show that too.)
使用临时数据库和索引列表:
Using a temporary database and an index list:
with contextlib.closing(dbm.open('temp.db', 'n')) as db:
with open(path) as f:
for i, line in enumerate(f):
db[str(i)] = line
linecount = i
shuffled = random.shuffle(range(linecount))
with open(path + '.shuffled', 'w') as f:
for i in shuffled:
f.write(db[str(i)])
os.remove('temp.db')
这是2N个单行磁盘操作和2N个单dbm键磁盘操作,它们应该是2NlogN个单磁盘磁盘操作等效的操作,因此总复杂度为O(NlogN).
This is 2N single-line disk operations, and 2N single-dbm-key disk operations, which should be 2NlogN single-disk-disk-operation-equivalent operations, so the total complexity is O(NlogN).
如果使用诸如 sqlite3
之类的关系数据库而不是dbm,则您甚至不需要索引列表,因为您可以这样做:
If you use a relational database like sqlite3
instead of a dbm, you don't even need the index list, because you can just do this:
SELECT * FROM Lines ORDER BY RANDOM()
从理论上讲,这具有与上面相同的时间复杂度,并且空间复杂度是O(1)而不是O(N).实际上,您需要一个RDBMS,它可以一次从100M行集中为您提供一行,而又不会在那一侧存储该100M.
This has the same time complexity as the above, and the space complexity is O(1) instead of O(N)—in theory. In practice, you need an RDBMS that can feed you a row at a time from a 100M row set without storing that 100M on either side.
另一种选择,不使用临时数据库-理论上为O(N ** 2),但实际上,如果您碰巧有足够的内存来使行高速缓存发挥作用,则可能会更快:
A different option, without using a temporary database—in theory O(N**2), but in practice maybe faster if you happen to have enough memory for the line cache to be helpful:
with open(path) as f:
linecount = sum(1 for _ in f)
shuffled = random.shuffle(range(linecount))
with open(path + '.shuffled', 'w') as f:
for i in shuffled:
f.write(linecache.getline(path, i))
最后,通过将索引列表的大小加倍,我们可以消除临时磁盘存储.但是在实践中,这可能会慢很多,因为您要进行更多的随机访问读取,而这些驱动器的性能却差强人意.
Finally, by doubling the size of the index list, we can eliminate the temporary disk storage. But in practice, this might be a lot slower, because you're doing a lot more random-access reads, which drives aren't nearly as good at.
with open(path) as f:
linestarts = [f.tell() for line in f]
lineranges = zip(linestarts, linestarts[1:] + [f.tell()])
shuffled = random.shuffle(lineranges)
with open(path + '.shuffled', 'w') as f1:
for start, stop in shuffled:
f.seek(start)
f1.write(f.read(stop-start))
这篇关于如何在Python中随机播放磁盘上的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!