使用SciKit学习和大型数据集进行文本分类 [英] text classification with SciKit-learn and a large dataset

查看：75 发布时间：2020/5/18 0:51:12 python nlp scikit-learn scikits

本文介绍了使用SciKit学习和大型数据集进行文本分类的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

首先，我昨天开始使用python.我正在尝试使用SciKit和大型数据集(250.000条推文)进行文本分类.对于该算法，每个推特将表示为4000 x 1向量，因此这意味着输入为250.000行和4000列.当我尝试在python中构造此函数时，在8500条tweet之后(使用列表并将其追加时)，我的内存用完了，而当我预分配内存时，我得到的错误是:MemoryError(np.zeros(4000,2500000 )). SciKit是否无法使用这些大型数据集\?我做错什么了吗(因为这是我第二天使用python)?还有另一种方式来表示这些功能，以便使其适合我的记忆吗?

First of all I started with python yesterday. I'm trying to do text classification with SciKit and a large dataset (250.000 tweets). For the algorithm, every tweet will be represented as a 4000 x 1 vector, so this means the input is 250.000 rows and 4000 columns. When i try to construct this in python, I run out of memory after 8500 tweets (when working with a list and appending it) and when I preallocate the memory I just get the error: MemoryError (np.zeros(4000,2500000)). Is SciKit not able to work with these large datasets \? Am I doing something wrong (as it is my second day with python)? Is there another way of representing the features so that it can fit in my memory ?

编辑:我想要贝努利笔记本电脑

edit: I want the to the Bernoulli NB

edit2 :也许可以通过在线学习吗?阅读推文，让模型使用推文，将其从内存中删除，再阅读，让模型学习...但是我不认为Bernoulli NB允许在scikit-learn中进行在线学习

edit2: Maybe it is possible with online learning ? read a tweet, let the model use the tweet, remove it from memory , read another, let the model learn... but I don't think Bernoulli NB allows for online learning in scikit-learn

使用SciKit学习和大型数据集进行文本分类 [英] text classification with SciKit-learn and a large dataset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

使用SciKit学习和大型数据集进行文本分类 [英] text classification with SciKit-learn and a large dataset

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭