将巨大的.dat加载到数组中的最快方法 [英] Fastest way to load huge .dat into array

查看:105
本文介绍了将巨大的.dat加载到数组中的最快方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在stackexchange中进行了广泛的搜索,找到了一个很好的解决方案,可以将巨大的(〜2GB).dat文件加载到numpy数组中,但是没有找到合适的解决方案.到目前为止,我设法以一种非常快的方式(不到1分钟)将其作为列表加载:

I have extensively searched in stackexchange a neat solution for loading a huge (~2GB) .dat file into a numpy array, but didn't find a proper solution. So far I managed to load it as a list in a really fast way (<1 min):

list=[]
f = open('myhugefile0')
for line in f:
    list.append(line)
f.close()

使用np.loadtxt会冻结我的计算机,并且需要几分钟才能加载(〜10分钟).我如何才能将文件作为数组打开而不会出现似乎困扰np.loadtxt的分配问题?

Using np.loadtxt freezes my computer and takes several minutes to load (~ 10 min). How can I open the file as an array without the allocating issue that seems to bottleneck np.loadtxt?

输入数据是一个float(200000,5181)数组.一行示例:

Input data is a float (200000,5181) array. One line example:

2.27069e-15 2.40985e-15 2.22525e-15 2.1138e-15 1.92038e-15 1.54218e-15 1.30739e-15 1.09205e-15 8.53416e-16 7.71566e-16 7.58353e-16 7.58362e- 16 8.81664e-16 1.09204e-15 1.27305e-15 1.58008e-15

2.27069e-15 2.40985e-15 2.22525e-15 2.1138e-15 1.92038e-15 1.54218e-15 1.30739e-15 1.09205e-15 8.53416e-16 7.71566e-16 7.58353e-16 7.58362e-16 8.81664e-16 1.09204e-15 1.27305e-15 1.58008e-15

以此类推

谢谢

推荐答案

查看

Looking at the source, it appears that numpy.loadtxt contains a lot of code to handle many different formats. In case you have a well defined input file, it is not too difficult to write your own function optimized for your particular file format. Something like this (untested):

def load_big_file(fname):
    '''only works for well-formed text file of space-separated doubles'''

    rows = []  # unknown number of lines, so use list
    with open(fname) as f:
        for line in f:
            line = [float(s) for s in line.split()]
            rows.append(np.array(line, dtype = np.double))
    return np.vstack(rows)  # convert list of vectors to array

一种替代解决方案,如果以前知道行数和列数,则可能是:

An alternative solution, if the number of rows and columns is known before, might be:

def load_known_size(fname, nrow, ncol)
    x = np.empty((nrow, ncol), dtype = np.double)
    with open(fname) as f:
        for irow, line in enumerate(f):
            for icol, s in enumerate(line.split()):
                x[irow, icol] = float(s)
    return x

通过这种方式,您不必分配所有中间列表.

In this way, you don't have to allocate all the intermediate lists.

EDIT :似乎第二种解决方案要慢一些,列表理解可能比显式for循环要快.结合这两种解决方案,并使用Numpy进行从字符串到浮点的隐式转换的技巧(才刚刚发现),这可能会更快:

EDIT: Seems that the second solution is a bit slower, the list comprehension is probably faster than the explicit for loop. Combining the two solutions, and using the trick that Numpy does implicit conversion from string to float (only discovered that just now), this might possibly be faster:

def load_known_size(fname, nrow, ncol)
    x = np.empty((nrow, ncol), dtype = np.double)
    with open(fname) as f:
        for irow, line in enumerate(f):
            x[irow, :] = line.split()
    return x

要获得进一步的加速,您可能必须使用一些用C或Cython编写的代码.我想知道这些功能需要多少时间才能打开文件.

To get any further speedup, you would probably have to use some code written in C or Cython. I would be interested to know how much time these functions take to open your files.

这篇关于将巨大的.dat加载到数组中的最快方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆