从推文创建稀疏矩阵 [英] Create sparse matrix from tweets

查看:96
本文介绍了从推文创建稀疏矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一些tweet和其他变量,我想将其转换为稀疏矩阵.

I have some tweets and other variables that I would like to convert into a sparse matrix.

这基本上就是我的数据.现在,它被保存在data.table中,其中一列包含推文,一列包含乐谱.

This is basically what my data looks like. Right now it is saved in a data.table with one column that contains the tweet and one column that contains the score.

Tweet               Score
Sample Tweet :)        1
Different Tweet        0

我想将其转换为如下所示的矩阵:

I would like to convert this into a matrix that looks like this:

Score Sample Tweet Different :)
    1      1     1         0  1
    0      0     1         1  0

在稀疏矩阵中,data.table中的每一行都有一行.在R中有简单的方法吗?

Where there is one row in the sparse matrix for each row in my data.table. Is there an easy way to do this in R?

推荐答案

这接近于您想要的

library(Matrix)
words = unique(unlist(strsplit(dt[, Tweet], ' ')))

M = Matrix(0, nrow = NROW(dt), ncol = length(words))
colnames(M) = words

for(j in 1:length(words)){
  M[, j] = grepl(paste0('\\b', words[j], '\\b'), dt[, Tweet])
}

M = cbind(M, as.matrix(dt[, setdiff(names(dt),'Tweet'), with=F]))

#2 x 5 sparse Matrix of class "dgCMatrix"
#     Sample Tweet :) Different Score
#[1,]      1     1  .         .     1
#[2,]      .     1  .         1     .

唯一的小问题是正则表达式不能将':)'识别为单词.也许更了解regex的人可以建议如何解决此问题.

The only small issue is that the regex is not recognising ':)' as a word. Maybe someone who knows regex better can advise how to fix this.

这篇关于从推文创建稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆