如何在Python中随机播放磁盘上的文本文件 [英] How to shuffle a text file on disk in Python

查看:57
本文介绍了如何在Python中随机播放磁盘上的文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用大约12 * 10 ^ 6行的文本文件,该文件存储在我的硬盘上.该文件的结构为:

I am working with a text file of about 12*10^6 rows which is stored on my hard disk. The structure of the file is:

data|data|data|...|data\n
data|data|data|...|data\n
data|data|data|...|data\n
...
data|data|data|...|data\n

没有标题,也没有ID来唯一标识行.

There's no header, and there's no id to uniquely identify the rows.

由于我想将其用于机器学习,因此我需要确保文本文件中没有顺序可能影响随机学习.

Since I want to use it for machine learning purposes, I need to make sure that there's no order in the text file which may affect the stochastic learning.

通常,我将此类文件上传到内存中,并在将它们重写到磁盘之前先对其进行洗牌.不幸的是,由于文件的大小,这一次是不可能的,所以我必须直接在磁盘上管理改组(假设我磁盘空间没有问题).关于如何使用Python有效(以最低的复杂度,即写入磁盘)来管理此类任务的任何想法?

Usually I upload such kind of files into memory, and I shuffle them before rewriting them to disk. Unfortunately this time it is not possible, due to the size of the file, so I have to manage the shuffling directly on disk(assume I don't have problem with the disk space). Any idea about how to effectively (with lowest possible complexity, i.e. writing to the disk) manage such task with Python?

推荐答案

除了上述想法之一外,所有想法都使用O(N)内存-但是,如果您使用的是 array.array numpy.ndarray 我们谈论的是N * 4个字节,该字节明显小于整个文件.(为简单起见,我将使用简单列表;如果您需要转换为更紧凑类型的帮助,我也可以展示出来.)

All but one of these ideas use O(N) memory—but if you use an array.array or numpy.ndarray we're talking around N*4 bytes, which is significantly smaller than the whole file. (I'll use a plain list for simplicity; if you need help converting to a more compact type, I can show that too.)

使用临时数据库和索引列表:

Using a temporary database and an index list:

with contextlib.closing(dbm.open('temp.db', 'n')) as db:
    with open(path) as f:
        for i, line in enumerate(f):
            db[str(i)] = line
    linecount = i
    shuffled = random.shuffle(range(linecount))
    with open(path + '.shuffled', 'w') as f:
        for i in shuffled:
            f.write(db[str(i)])
os.remove('temp.db')

这是2N个单行磁盘操作和2N个单dbm键磁盘操作,它们应该是2NlogN个单磁盘磁盘操作等效的操作,因此总复杂度为O(NlogN).

This is 2N single-line disk operations, and 2N single-dbm-key disk operations, which should be 2NlogN single-disk-disk-operation-equivalent operations, so the total complexity is O(NlogN).

如果使用诸如 sqlite3 之类的关系数据库而不是dbm,则您甚至不需要索引列表,因为您可以这样做:

If you use a relational database like sqlite3 instead of a dbm, you don't even need the index list, because you can just do this:

SELECT * FROM Lines ORDER BY RANDOM()

从理论上讲,这具有与上面相同的时间复杂度,并且空间复杂度是O(1)而不是O(N).实际上,您需要一个RDBMS,它可以一次从100M行集中为您提供一行,而又不会在那一侧存储该100M.

This has the same time complexity as the above, and the space complexity is O(1) instead of O(N)—in theory. In practice, you need an RDBMS that can feed you a row at a time from a 100M row set without storing that 100M on either side.

另一种选择,不使用临时数据库-理论上为O(N ** 2),但实际上,如果您碰巧有足够的内存来使行高速缓存发挥作用,则可能会更快:

A different option, without using a temporary database—in theory O(N**2), but in practice maybe faster if you happen to have enough memory for the line cache to be helpful:

with open(path) as f:
    linecount = sum(1 for _ in f)
shuffled = random.shuffle(range(linecount))
with open(path + '.shuffled', 'w') as f:
    for i in shuffled:
        f.write(linecache.getline(path, i))


最后,通过将索引列表的大小加倍,我们可以消除临时磁盘存储.但是在实践中,这可能会慢很多,因为您要进行更多的随机访问读取,而这些驱动器的性能却差强人意.


Finally, by doubling the size of the index list, we can eliminate the temporary disk storage. But in practice, this might be a lot slower, because you're doing a lot more random-access reads, which drives aren't nearly as good at.

with open(path) as f:
    linestarts = [f.tell() for line in f]
    lineranges = zip(linestarts, linestarts[1:] + [f.tell()])
    shuffled = random.shuffle(lineranges)
    with open(path + '.shuffled', 'w') as f1:
        for start, stop in shuffled:
            f.seek(start)
            f1.write(f.read(stop-start))

这篇关于如何在Python中随机播放磁盘上的文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆