从推文创建稀疏矩阵 [英] Create sparse matrix from tweets
问题描述
我有一些tweet和其他变量,我想将其转换为稀疏矩阵.
I have some tweets and other variables that I would like to convert into a sparse matrix.
这基本上就是我的数据.现在,它被保存在data.table中,其中一列包含推文,一列包含乐谱.
This is basically what my data looks like. Right now it is saved in a data.table with one column that contains the tweet and one column that contains the score.
Tweet Score
Sample Tweet :) 1
Different Tweet 0
我想将其转换为如下所示的矩阵:
I would like to convert this into a matrix that looks like this:
Score Sample Tweet Different :)
1 1 1 0 1
0 0 1 1 0
在稀疏矩阵中,data.table中的每一行都有一行.在R中有简单的方法吗?
Where there is one row in the sparse matrix for each row in my data.table. Is there an easy way to do this in R?
推荐答案
这接近于您想要的
library(Matrix)
words = unique(unlist(strsplit(dt[, Tweet], ' ')))
M = Matrix(0, nrow = NROW(dt), ncol = length(words))
colnames(M) = words
for(j in 1:length(words)){
M[, j] = grepl(paste0('\\b', words[j], '\\b'), dt[, Tweet])
}
M = cbind(M, as.matrix(dt[, setdiff(names(dt),'Tweet'), with=F]))
#2 x 5 sparse Matrix of class "dgCMatrix"
# Sample Tweet :) Different Score
#[1,] 1 1 . . 1
#[2,] . 1 . 1 .
唯一的小问题是正则表达式不能将':)'
识别为单词.也许更了解regex的人可以建议如何解决此问题.
The only small issue is that the regex is not recognising ':)'
as a word. Maybe someone who knows regex better can advise how to fix this.
这篇关于从推文创建稀疏矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!