CSV文件中字符串的TF-IDF [英] Tf-idf of strings from csv file
问题描述
我的 test.csv
文件是(没有标题):
My test.csv
file is (without header):
very good, very bad, you are great
very bad, good restaurent, nice place to visit
我想用,
分隔语料,以使最终的 DocumentTermMatrix
变为:
I want to make my corpus separated with ,
so that my final DocumentTermMatrix
becomes:
terms
docs very good very bad you are great good restaurent nice place to visit
doc1 tf-idf tf-idf tf-idf 0 0
doc2 0 tf-idf 0 tf-idf tf-idf
I如果我不从加载
,如下所示:文档
,就能够正确生成上面的 DTM
csv文件
I am able to produce the above DTM
correctly, if I don't load the documents
from csv file
, like below:
library(tm)
docs <- c(D1 = "very good, very bad, you are great",
D2 = "very bad, good restaurent, nice place to visit")
dd <- Corpus(VectorSource(docs))
dd <- tm_map(dd, function(x) {
PlainTextDocument(
gsub("\\s+","~",strsplit(x,",\\s*")[[1]]),
id=ID(x)
)
})
inspect(dd)
# A corpus with 2 text documents
#
# The metadata consists of 2 tag-value pairs and a data frame
# Available tags are:
# create_date creator
# Available variables in the data frame are:
# MetaID
# $D1
# very~good
# very~bad
# you~are~great
#
# $D2
# very~bad
# good~restaurent
# nice~place~to~visit
dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm)
这将产生
# Docs good~restaurent nice~place~to~visit very~bad very~good you~are~great
# D1 0.0000000 0.0000000 0 0.3333333 0.3333333
# D2 0.3333333 0.3333333 0 0.0000000 0.0000000
如果我正在加载文档
来自 csv
文件,则只有每个文档的第一项都像下面这样加入:
If, I am loading the document
from csv
file, then only the first term of each document is getting joined like below:
> file_loc <- "testdata.csv"
> require(tm)
Loading required package: tm
> x <- read.csv(file_loc, header = FALSE)
> x <- data.frame(lapply(x, as.character), stringsAsFactors=FALSE)
> dd <- Corpus(DataframeSource(x))
> dd <- tm_map(dd, stripWhitespace)
> dd <- tm_map(dd, tolower)
> dd <- tm_map(dd, function(x) {
PlainTextDocument(
gsub("\\s+","~",strsplit(x,",\\s*")[[1]]),
id=ID(x)
)
})
> inspect(dd)
仅加入以下第一项:
# $D1
# very~good
#
# $D2
# very~bad
如何加入所有条款并创建 DocumentTermMatrix
如上。
How can I join all the terms and create a DocumentTermMatrix
like above.
推荐答案
您错误地读取了数据。我使用 scan
进行阅读。以下工作原理:
You read data incorrectly. I use scan
for reading. The following works:
docs <- scan("testdata.csv", "character", sep = "\n")
dd <- Corpus(VectorSource(x))
dd <- tm_map(dd, function(x) {
PlainTextDocument(
gsub("\\s+","~",strsplit(x,",\\s*")[[1]]),
id=ID(x)
)
})
inspect(dd)
dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm)
这篇关于CSV文件中字符串的TF-IDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!