CSV文件中字符串的TF-IDF [英] Tf-idf of strings from csv file

查看:261
本文介绍了CSV文件中字符串的TF-IDF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的 test.csv 文件是(没有标题):

My test.csv file is (without header):

very good, very bad, you are great
very bad, good restaurent, nice place to visit

我想用分隔语料,以使最终的 DocumentTermMatrix 变为:

I want to make my corpus separated with , so that my final DocumentTermMatrix becomes:

      terms
 docs       very good      very bad        you are great   good restaurent   nice place to visit
  doc1       tf-idf          tf-idf         tf-idf          0                    0
  doc2       0                tf-idf         0                tf-idf             tf-idf

I如果我不从加载文档,就能够正确生成上面的 DTM csv文件,如下所示:

I am able to produce the above DTM correctly, if I don't load the documents from csv file, like below:

library(tm)
docs <- c(D1 = "very good, very bad, you are great", 
    D2 = "very bad, good restaurent, nice place to visit")

dd <- Corpus(VectorSource(docs))
dd <- tm_map(dd, function(x) {
    PlainTextDocument(
       gsub("\\s+","~",strsplit(x,",\\s*")[[1]]), 
       id=ID(x)
     )
})
inspect(dd)

# A corpus with 2 text documents
# 
# The metadata consists of 2 tag-value pairs and a data frame
# Available tags are:
#   create_date creator 
# Available variables in the data frame are:
#   MetaID 

# $D1
# very~good
# very~bad
# you~are~great
# 
# $D2
# very~bad
# good~restaurent
# nice~place~to~visit

dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm)

这将产生

# Docs good~restaurent nice~place~to~visit very~bad very~good you~are~great
#   D1       0.0000000           0.0000000        0 0.3333333     0.3333333
#   D2       0.3333333           0.3333333        0 0.0000000     0.0000000

如果我正在加载文档来自 csv 文件,则只有每个文档的第一项都像下面这样加入:

If, I am loading the document from csv file, then only the first term of each document is getting joined like below:

> file_loc <- "testdata.csv"
> require(tm)
  Loading required package: tm
> x <- read.csv(file_loc, header = FALSE)
> x <- data.frame(lapply(x, as.character), stringsAsFactors=FALSE)
> dd <- Corpus(DataframeSource(x))
> dd <- tm_map(dd, stripWhitespace)
> dd <- tm_map(dd, tolower)
>  dd <- tm_map(dd, function(x) {
            PlainTextDocument(
            gsub("\\s+","~",strsplit(x,",\\s*")[[1]]), 
            id=ID(x)
            )
          })
> inspect(dd)

仅加入以下第一项:

# $D1
# very~good

# 
# $D2
# very~bad

如何加入所有条款并创建 DocumentTermMatrix 如上。

How can I join all the terms and create a DocumentTermMatrix like above.

推荐答案

您错误地读取了数据。我使用 scan 进行阅读。以下工作原理:

You read data incorrectly. I use scan for reading. The following works:

docs <- scan("testdata.csv", "character", sep = "\n")

dd <- Corpus(VectorSource(x))
dd <- tm_map(dd, function(x) {
  PlainTextDocument(
    gsub("\\s+","~",strsplit(x,",\\s*")[[1]]), 
    id=ID(x)
  )
})
inspect(dd)

dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm)

这篇关于CSV文件中字符串的TF-IDF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆