在Python中从巨大的CSV文件中读取随机线 [英] Read random lines from huge CSV file in Python

查看:1244
本文介绍了在Python中从巨大的CSV文件中读取随机线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有这个相当大的CSV文件(15 Gb),我需要从它读取约100万条随机线。
至于我可以看到和实现 - 在Python中的CSV实用程序只允许在文件中依次迭代。

I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it. As far as I can see - and implement - the CSV utility in Python only allows to iterate sequentially in the file.

读取所有文件到内存中使用一些随机选择,它是非常耗时,所有的文件和丢弃一些值,并选择其他人,所以,有任何地方从CSV文件中选择一些随机线,并只读该行

It's very memory consuming to read the all file into memory to use some random choosing and it's very time consuming to go trough all the file and discard some values and choose others, so, is there anyway to choose some random line from the CSV file and read only that line?

我没有成功:

   import csv

    with open('linear_e_LAN2A_F_0_435keV.csv') as file:
        reader = csv.reader(file)
        print reader[someRandomInteger]

CSV文件示例:

331.093,329.735 
251.188,249.994 
374.468,373.782 
295.643,295.159 
83.9058,0 
380.709,116.221 
352.238,351.891 
183.809,182.615 
257.277,201.302
61.4598,40.7106


推荐答案

import random

filesize = 1500                 #size of the really big file
offset = random.randrange(filesize)

f = open('really_big_file')
f.seek(offset)                  #go to random position
f.readline()                    # discard - bound to be partial line
random_line = f.readline()      # bingo!

# extra to handle last/first line edge cases
if len(random_line) == 0:       # we have hit the end
    f.seek(0)
    random_line = f.readline()  # so we'll grab the first line instead

As @AndreBoos指出,这种方法会导致偏向选择。如果你知道最小和最大长度的行,你可以通过执行以下操作来消除这种偏差:

As @AndreBoos pointed out, this approach will lead to biased selection. If you know min and max length of line you can remove this bias by doing the following:

我们假设(在这种情况下)我们有min = 3和max = 15

Let's assume (in this case) we have min=3 and max=15

1)查找上一行的长度(Lp)。

1) Find the length (Lp) of the previous line.

然后如果Lp = 3,线最偏袒。因此,我们应该把它100%的时间
如果Lp = 15,行最偏向。

Then if Lp = 3, the line is most biased against. Hence we should take it 100% of the time If Lp = 15, the line is most biased towards. We should only take it 20% of the time as it is 5* more likely selected.

我们通过随机保持X%的时间来实现这一点,其中:

We accomplish this by randomly keeping the line X% of the time where:

X = min / Lp

X = min / Lp

如果我们不保留该行,骰子卷好。 : - )

If we don't keep the line, we do another random pick until our dice roll comes good. :-)

这篇关于在Python中从巨大的CSV文件中读取随机线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆