在R中使用文本分类和大稀疏矩阵 [英] Working with text classification and big sparse matrices in R

查看：184 发布时间：2020/10/2 3:11:54 r classification text-mining r-caret quanteda

本文介绍了在R中使用文本分类和大稀疏矩阵的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在研究文本多类分类项目，我需要构建文档/术语矩阵，并用R语言进行训练和测试。

I'm working on a text multi-class classification project and I need to build the document / term matrices and train and test in R language.

我已经拥有的数据集不适合R中基本矩阵类的有限维，并且需要建立大型稀疏矩阵才能进行分类，例如100k条推文。我使用的是 quanteda 包，因为它现在比 tm 包更加有用和可靠，在包中创建带有字典的DocumentTermMatrix会使进程难以置信地占用内存与小型数据集。正如我所说，目前，我使用 quanteda 构建等效的文档术语矩阵容器，随后将其转换为data.frame以进行训练。

I already have datasets that don't fit in the limited dimensionality of the base matrix class in R and would need to build big sparse matrices to be able to classify for example, 100k tweets. I am using the quanteda package, as it has been for now more useful and reliable than the package tm, where creating a DocumentTermMatrix with a dictionary, makes the process incredibly memory hungry with small datasets. Currently, as I said, I use quanteda to build the equivalent Document Term Matrix container that later on I transform into a data.frame to perform the training.

我想知道是否有一种方法可以构建这么大的矩阵。我一直在阅读有关允许这种容器的 bigmemory 软件包，但我不确定它是否可以与插入符一起用于以后的分类。总的来说，我想了解问题并建立变通办法以能够处理更大的数据集，因为RAM并不是一个（大）问题（32GB），但是我试图找到一种解决方法，但我感到完全迷失了

I want to know if there is a way to build such big matrices. I have been reading about the bigmemory package that allows this kind of container but I am not sure it will work with caret for the later classification. Overall I want to understand the problem and build a workaround to be able to work with bigger datasets, as the RAM is not a (big) problem (32GB) but I'm trying to find a way to do it and I feel completely lost about it.

在R中使用文本分类和大稀疏矩阵 [英] Working with text classification and big sparse matrices in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

在R中使用文本分类和大稀疏矩阵 [英] Working with text classification and big sparse matrices in R

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭