从一个巨大的文件中读取几行作为Numpy数组 [英] Reading a few lines as Numpy array from a huge file

查看:77
本文介绍了从一个巨大的文件中读取几行作为Numpy数组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个文本文件,其中包含十亿个单词及其对应的300维单词向量.我需要提取数千个单词&他们的单词矢量来自文件&将它们存储为Numpy数组.文本文件的大小约为1 GB.
天真的,我尝试使用genfromtxt将整个文件加载到数组中,但是那没有用.然后我尝试逐行读取整个文件(文件中的每一行都包含一个单词及其单词向量),寻找word并提取单词向量,但我想这需要每个单词一次遍历文件,而且由于我需要数千个单词,因此需要遍历整个文件数千次.
最快,最有效的方法是什么?

I have a text file that contains a billion words and their corresponding 300 dimensional word vectors.I need to extract a few thousands of words & their words vectors from the file & store them as Numpy arrays.Size of the text file is around 1 GB.
Naively, I tried to load the whole file in an array using genfromtxt but that did not worked.Then I tried to read the whole file line by line(each line in the file consists of a word and its word vector), looking for the word and extracting the word vector but I guess, that requires one pass over the file per word and as I need thousands of words, it will need to iterate over the whole file thousands of time.
What would be the fastest and most efficient way to do this?

推荐答案

我不知道这是最快的(可能不是),但是效果很好(我在> 100,000行的文件中对其进行了测试):

I don't know this is the fastest (probably not) but it works reasonably well (I tested it on a >100,000 lines file):

F = filter(lambda s: s.strip().split()[0] in word_set if s.strip() else False,
           open(fn, 'rt'))
x = np.genfromtxt(F, *yourargs, **yourkwds)

这是针对Python2的.在Python3中,似乎必须.encode()输入.

This is for Python2. In Python3 it seems one has to .encode() the input.

这篇关于从一个巨大的文件中读取几行作为Numpy数组的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆