在Python中从巨大的CSV文件中读取随机线 [英] Read random lines from huge CSV file in Python
问题描述
我有这个相当大的CSV文件(15 Gb),我需要从它读取约100万条随机线。
至于我可以看到和实现 - 在Python中的CSV实用程序只允许在文件中依次迭代。
I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it. As far as I can see - and implement - the CSV utility in Python only allows to iterate sequentially in the file.
读取所有文件到内存中使用一些随机选择,它是非常耗时,所有的文件和丢弃一些值,并选择其他人,所以,有任何地方从CSV文件中选择一些随机线,并只读该行
It's very memory consuming to read the all file into memory to use some random choosing and it's very time consuming to go trough all the file and discard some values and choose others, so, is there anyway to choose some random line from the CSV file and read only that line?
我没有成功:
import csv
with open('linear_e_LAN2A_F_0_435keV.csv') as file:
reader = csv.reader(file)
print reader[someRandomInteger]
CSV文件示例:
331.093,329.735
251.188,249.994
374.468,373.782
295.643,295.159
83.9058,0
380.709,116.221
352.238,351.891
183.809,182.615
257.277,201.302
61.4598,40.7106
推荐答案
import random
filesize = 1500 #size of the really big file
offset = random.randrange(filesize)
f = open('really_big_file')
f.seek(offset) #go to random position
f.readline() # discard - bound to be partial line
random_line = f.readline() # bingo!
# extra to handle last/first line edge cases
if len(random_line) == 0: # we have hit the end
f.seek(0)
random_line = f.readline() # so we'll grab the first line instead
As @AndreBoos指出,这种方法会导致偏向选择。如果你知道最小和最大长度的行,你可以通过执行以下操作来消除这种偏差:
As @AndreBoos pointed out, this approach will lead to biased selection. If you know min and max length of line you can remove this bias by doing the following:
我们假设(在这种情况下)我们有min = 3和max = 15
Let's assume (in this case) we have min=3 and max=15
1)查找上一行的长度(Lp)。
1) Find the length (Lp) of the previous line.
然后如果Lp = 3,线最偏袒。因此,我们应该把它100%的时间
如果Lp = 15,行最偏向。
Then if Lp = 3, the line is most biased against. Hence we should take it 100% of the time If Lp = 15, the line is most biased towards. We should only take it 20% of the time as it is 5* more likely selected.
我们通过随机保持X%的时间来实现这一点,其中:
We accomplish this by randomly keeping the line X% of the time where:
X = min / Lp
X = min / Lp
如果我们不保留该行,骰子卷好。 : - )
If we don't keep the line, we do another random pick until our dice roll comes good. :-)
这篇关于在Python中从巨大的CSV文件中读取随机线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!