使用SciKit学习和大型数据集进行文本分类 [英] text classification with SciKit-learn and a large dataset

查看:75
本文介绍了使用SciKit学习和大型数据集进行文本分类的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

首先,我昨天开始使用python.我正在尝试使用SciKit和大型数据集(250.000条推文)进行文本分类.对于该算法,每个推特将表示为4000 x 1向量,因此这意味着输入为250.000行和4000列.当我尝试在python中构造此函数时,在8500条tweet之后(使用列表并将其追加时),我的内存用完了,而当我预分配内存时,我得到的错误是:MemoryError(np.zeros(4000,2500000 )). SciKit是否无法使用这些大型数据集\?我做错什么了吗(因为这是我第二天使用python)?还有另一种方式来表示这些功能,以便使其适合我的记忆吗?

First of all I started with python yesterday. I'm trying to do text classification with SciKit and a large dataset (250.000 tweets). For the algorithm, every tweet will be represented as a 4000 x 1 vector, so this means the input is 250.000 rows and 4000 columns. When i try to construct this in python, I run out of memory after 8500 tweets (when working with a list and appending it) and when I preallocate the memory I just get the error: MemoryError (np.zeros(4000,2500000)). Is SciKit not able to work with these large datasets \? Am I doing something wrong (as it is my second day with python)? Is there another way of representing the features so that it can fit in my memory ?

编辑:我想要贝努利笔记本电脑

edit: I want the to the Bernoulli NB

edit2 :也许可以通过在线学习吗?阅读推文,让模型使用推文,将其从内存中删除,再阅读,让模型学习...但是我不认为Bernoulli NB允许在scikit-learn中进行在线学习

edit2: Maybe it is possible with online learning ? read a tweet, let the model use the tweet, remove it from memory , read another, let the model learn... but I don't think Bernoulli NB allows for online learning in scikit-learn

推荐答案

我假设这4000 x 1个向量是词袋表示.如果真是这样,那么250000 x 4000矩阵就会有很多零,因为在每个推文中只有很少的单词.这种矩阵称为稀疏矩阵,并且有将它们存储在内存中的有效方法.请参见 Scipy文档

I assume that these 4000 x 1 vectors are bag-of-words representations. If that is the case, then that 250000 by 4000 matrix has a lot of zeros, because in each tweet there are only very few words. This kind of matrices are called sparse matrices, and there are efficient ways of storing them in memory. See the Scipy documentation and the SciKit documentation for sparse matrices to get started; if you need more help after reading those links, post again.

这篇关于使用SciKit学习和大型数据集进行文本分类的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆