有没有更有效的方法将一个大文件中的行追加到numpy数组？ - MemoryError [英] Is there a more efficient way to append lines from a large file to a numpy array? - MemoryError

查看：3568 发布时间：2017/2/24 22:54:08 python csv numpy out-of-memory text-mining

本文介绍了有没有更有效的方法将一个大文件中的行追加到numpy数组？ - MemoryError的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我尝试使用此 lda 软件包处理一个字词 - 文档矩阵csv文件包含39568行和27519列只包含计数/自然数。

问题：我得到一个MemoryError与我的方法读取文件和存储它到一个numpy数组。

目标：从TDM csv文件获取数字并将其转换为numpy数组，所以我可以使用numpy数组作为lda的输入。

 打开（Results / TDM  -  Matrix Only.csv，'r'）as matrix_file：
 matrix = np.array（[在line.strip（）中的值的整数（值）。split（'，'）]在matrix_file中的行]）

我也试过使用numpy append，vstack和concatenate，我仍然得到MemoryError。

如何避免MemoryError？

编辑：

dtype 和 int ，它给了我：

WindowsError：[错误8 ]没有足够的存储空间来处理此命令

我也尝试过使用dtype float64 给我：

OverflowError：无法将'long'放入索引大小的整数

使用这些代码：

  fp = np.memmap memmap.txt，dtype ='float64'，mode ='w +'，shape =（len（documents），len（vocabulary）））
 matrix = np.genfromtxt（Results / TDM.csv，dtype ='float64'，delimiter ='，'，skip_header = 1）
 fp [：] = matrix [：]

（Results / TDM.csv，'r'）as tdm_file：

和

  
 vocabulary = [tdm_file.readline（）中值的值。strip（）。split（'，'）] 
 fp = np.memmap（Results / TDM-memmap.txt，dtype = 'float64'，mode ='w +'，shape =（len（documents），len（vocabulary）））
 for idx，enumerate中的行（tdm_file）：
 fp [idx] = np.array （line.strip（）。split（'，'））

Win10 64位

8GB RAM在报告MemoryError

使用PyCharm Community Edition 5.0.3

解决方案

几乎全部为零，将它们存储在 scipy.sparse 矩阵中会更有效率。例如：

 来自scipy import sparse 
 import textmining 
 import lda 
 
＃一个小例子矩阵
 tdm = textmining.TermDocumentMatrix（）
 tdm.add_doc（这里是一串句子中的单词）
 tdm.add_doc（这里有一些字 ）
 tdm.add_doc（and another sentence）
 tdm.add_doc（有一些字）
 
＃tdm.sparse是一个列表， dict包含单个
＃文档的{word：count} 
 ndocs = len（tdm.sparse）
 nwords = len（tdm.doc_count）
 words = tdm.doc_count。 keys（）
 
＃初始化输出稀疏矩阵
 X = sparse.lil_matrix（（ndocs，nwords），dtype = int）
 
＃在行中X 
 for ii，doc in enumerate（tdm.sparse）：
 for word，count in doc.iteritems（）：
 jj = words.index（word）
 X [ii，jj] = count

X 现在是（ndocs，nwords） scipy.sparse.lil_matrix 和 words 是对应于 X ：

  '，'和'，'另一个'，'句子'，'有'，'的'，'一些'，'这里' 
 
打印（X.todense（））
＃[[2 0 0 1 0 1 0 1 1 1 1 0 1] 
＃[0 0 0 0 0 0 1 1 1 1 0 1 0] 
＃[0 1 1 1 0 0 0 0 0 0 0 0 0] 
＃[0 0 0 0 1 0 1 0 0 1 0 1 0]]

您可以将 X 直接传递给 lda.LDA.fit ，尽管它可能会更快地将其转换为 scipy.sparse.csr_matrix 第一个：

  X = X.tocsr（）
 model = lda.LDA（n_topics = 2，random_state = 0，n_iter = 100）
 model.fit（X）
＃INFO： lda：n_documents：4 
＃INFO：lda：vocab_size：13 
＃INFO：lda：n_words：21 
＃INFO：lda：n_topics：2 
＃INFO：lda： n_iter：100 
＃INFO：lda：< 0>对数似然：-126 
＃INFO：lda：< 10>对数似然：-102 
＃INFO：lda：< 20>对数似然：-99 
＃INFO：lda：< 30>对数似然性：-97 
＃INFO：lda：< 40>对数似然：-100 
＃INFO：lda：< 50>对数似然：-100 
＃INFO：lda：< 60>对数似然：-104 
＃INFO：lda：< 70>对数似然：-108 
＃INFO：lda：< 80>对数似然：-98 
＃INFO：lda：< 90>对数似然：-98 
＃INFO：lda：< 99>对数似然：-99

I'm trying to use this lda package to process a term-document matrix csv file with 39568 rows and 27519 columns containing counting/natural numbers only.

Problem: I'm getting a MemoryError with my approach to read the file and store it to a numpy array.

Goal: Get the numbers from the TDM csv file and convert it to numpy array so I can use the numpy array as input for the lda.

with open("Results/TDM - Matrix Only.csv", 'r') as matrix_file:
    matrix = np.array([[int(value) for value in line.strip().split(',')] for line in matrix_file])

I've also tried using the numpy append, vstack and concatenate and I still get the MemoryError.

Is there a way to avoid the MemoryError?

Edit:

I've tried using dtype int32 and int and it gives me:

WindowsError: [Error 8] Not enough storage is available to process this command

I've also tried using dtype float64 and it gives me:

OverflowError: cannot fit 'long' into an index-sized integer

With these codes:

fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
matrix = np.genfromtxt("Results/TDM.csv", dtype='float64', delimiter=',', skip_header=1)
fp[:] = matrix[:]

and

with open("Results/TDM.csv", 'r') as tdm_file:
    vocabulary = [value for value in tdm_file.readline().strip().split(',')]
    fp = np.memmap("Results/TDM-memmap.txt", dtype='float64', mode='w+', shape=(len(documents), len(vocabulary)))
    for idx, line in enumerate(tdm_file):
        fp[idx] = np.array(line.strip().split(','))

Other info that might matter

Win10 64bit
8GB RAM (7.9 usable) | peaks at 5.5GB from more or less 3GB (around 2GB used) before it reports MemoryError
Python 2.7.10 [MSC v.1500 32 bit (Intel)]
Using PyCharm Community Edition 5.0.3

解决方案

Since your word counts will be almost all zeros, it would be much more efficient to store them in a scipy.sparse matrix. For example:

from scipy import sparse
import textmining
import lda

# a small example matrix
tdm = textmining.TermDocumentMatrix()
tdm.add_doc("here's a bunch of words in a sentence")
tdm.add_doc("here's some more words")
tdm.add_doc("and another sentence")
tdm.add_doc("have some more words")

# tdm.sparse is a list of dicts, where each dict contains {word:count} for a single
# document
ndocs = len(tdm.sparse)
nwords = len(tdm.doc_count)
words = tdm.doc_count.keys()

# initialize output sparse matrix
X = sparse.lil_matrix((ndocs, nwords),dtype=int)

# iterate over documents, fill in rows of X
for ii, doc in enumerate(tdm.sparse):
    for word, count in doc.iteritems():
        jj = words.index(word)
        X[ii, jj] = count

X is now an (ndocs, nwords) scipy.sparse.lil_matrix, and words is a list corresponding to the columns of X:

print(words)
# ['a', 'and', 'another', 'sentence', 'have', 'of', 'some', 'here', 's', 'words', 'in', 'more', 'bunch']

print(X.todense())
# [[2 0 0 1 0 1 0 1 1 1 1 0 1]
#  [0 0 0 0 0 0 1 1 1 1 0 1 0]
#  [0 1 1 1 0 0 0 0 0 0 0 0 0]
#  [0 0 0 0 1 0 1 0 0 1 0 1 0]]

You could pass X directly to lda.LDA.fit, although it will probably be faster to convert it to a scipy.sparse.csr_matrix first:

X = X.tocsr()
model = lda.LDA(n_topics=2, random_state=0, n_iter=100)
model.fit(X)
# INFO:lda:n_documents: 4
# INFO:lda:vocab_size: 13
# INFO:lda:n_words: 21
# INFO:lda:n_topics: 2
# INFO:lda:n_iter: 100
# INFO:lda:<0> log likelihood: -126
# INFO:lda:<10> log likelihood: -102
# INFO:lda:<20> log likelihood: -99
# INFO:lda:<30> log likelihood: -97
# INFO:lda:<40> log likelihood: -100
# INFO:lda:<50> log likelihood: -100
# INFO:lda:<60> log likelihood: -104
# INFO:lda:<70> log likelihood: -108
# INFO:lda:<80> log likelihood: -98
# INFO:lda:<90> log likelihood: -98
# INFO:lda:<99> log likelihood: -99

这篇关于有没有更有效的方法将一个大文件中的行追加到numpy数组？ - MemoryError的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

有没有更有效的方法将一个大文件中的行追加到numpy数组？ - MemoryError [英] Is there a more efficient way to append lines from a large file to a numpy array? - MemoryError

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

有没有更有效的方法将一个大文件中的行追加到numpy数组？ - MemoryError [英] Is there a more efficient way to append lines from a large file to a numpy array? - MemoryError

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭