在R中使用文本分类和大稀疏矩阵 [英] Working with text classification and big sparse matrices in R

查看:184
本文介绍了在R中使用文本分类和大稀疏矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在研究文本多类分类项目,我需要构建文档/术语矩阵,并用R语言进行训练和测试。

I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language.

我已经拥有的数据集不适合R中基本矩阵类的有限维,并且需要建立大型稀疏矩阵才能进行分类,例如100k条推文。我使用的是 quanteda 包,因为它现在比 tm 包更加有用和可靠,在包中创建带有字典的DocumentTermMatrix会使进程难以置信地占用内存与小型数据集。正如我所说,目前,我使用 quanteda 构建等效的文档术语矩阵容器,随后将其转换为data.frame以进行训练。

I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would need to build big sparse matrices to be able to classify for example, 100k tweets. I am using the quanteda package, as it has been for now more useful and reliable than the package tm, where creating a DocumentTermMatrix with a dictionary, makes the process incredibly memory hungry with small datasets. Currently, as I said, I use quanteda to build the equivalent Document Term Matrix container that later on I transform into a data.frame to perform the training.

我想知道是否有一种方法可以构建这么大的矩阵。我一直在阅读有关允许这种容器的 bigmemory 软件包,但我不确定它是否可以与插入符一起用于以后的分类。总的来说,我想了解问题并建立变通办法以能够处理更大的数据集,因为RAM并不是一个(大)问题(32GB),但是我试图找到一种解决方法,但我感到完全迷失了

I want to know if there is a way to build such big matrices. I have been reading about the bigmemory package that allows this kind of container but I am not sure it will work with caret for the later classification. Overall I want to understand the problem and build a workaround to be able to work with bigger datasets, as the RAM is not a (big) problem (32GB) but I'm trying to find a way to do it and I feel completely lost about it.

推荐答案

您什么时候达到了ram约束?

At what moment did you reach ram constraints?

quanteda 是在中等数据集上使用NLP的好软件包。但我也建议尝试使用 text2vec 软件包。通常,它对内存非常友好,不需要将所有原始文本加载到RAM中(例如,它可以为16gb笔记本电脑上的Wikipedia转储创建DTM)。

quanteda is good package to work with NLP on medium datasets. But also I suggest to try my text2vec package. Generally it is considerably memory friendly and doesn't require to load all the raw text into the RAM (for example it can create DTM for wikipedia dump on a 16gb laptop).

第二点是,我强烈建议您不要将数据转换为 data.frame 。尝试直接使用 sparseMatrix 对象。

Second point is that I strongly don't recommend to convert data into data.frame. Try to work with sparseMatrix objects directly.

以下方法适用于文本分类:

Following method will work good for text classification:


  1. 具有L1的逻辑回归罚款(请参见 glmnet 软件包)

  2. 线性SVM(请参见 LiblineaR ,但值得寻找替代品)

  3. 还值得尝试`xgboost。我更喜欢线性模型。因此,您可以尝试线性助力器。

  1. logistic regression with L1 penalty (see glmnet package)
  2. Linear SVM (see LiblineaR, but worth to serach for alternatives)
  3. Also worth to try `xgboost. I would prefer linear models. So you can try linear booster.

这篇关于在R中使用文本分类和大稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆