语料库的建立 [英] Corpus build with phrases

查看:173
本文介绍了语料库的建立的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的文档为:

 doc1 = very good, very bad, you are great
 doc2 = very bad, good restaurent, nice place to visit

我想用,分隔语料,使我的最终DocumentTermMatrix变为:

      terms
 docs       very good      very bad        you are great   good restaurent   nice place to visit
  doc1       tf-idf          tf-idf         tf-idf          0                    0
  doc2       0                tf-idf         0                tf-idf             tf-idf

我知道,如何计算单个单词的DocumentTermMatrix,但不知道如何在R中生成语料库separated for each phrase.首选R的解决方案,但也欢迎Python的解决方案. /p>

我尝试过的是:

> library(tm)
> library(RWeka)
> BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
> options(mc.cores=1)
> texts <- c("very good, very bad, you are great","very bad, good restaurent, nice place to visit")
> corpus <- Corpus(VectorSource(texts))
> a <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
> as.matrix(a)

我得到了:

                         Docs
  Terms                   1 2
  bad good restaurent   0 1
  bad you are           1 0
  good restaurent nice  0 1
  good very bad         1 0
  nice place to         0 1
  place to visit        0 1
  restaurent nice place 0 1
  very bad good         0 1
  very bad you          1 0
  very good very        1 0
  you are great         1 0

我想要的不是单词的组合,而是我在矩阵中显示的短语.

解决方案

以下是使用qdap + tm软件包的一种方法:

library(qdap); library(tm); library(qdapTools)

dat <- list2df(list(doc1 = "very good, very bad, you are great",
 doc2 = "very bad, good restaurent, nice place to visit"), "text", "docs")

x <- sub_holder(", ", dat$text)

m <- dtm(wfm(x$unhold(gsub(" ", "~~", x$output)), dat$docs) )
weightTfIdf(m)

inspect(weightTfIdf(m))

## A document-term matrix (2 documents, 5 terms)
## 
## Non-/sparse entries: 4/6
## Sparsity           : 60%
## Maximal term length: 19 
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## 
##       Terms
## Docs   good restaurent nice place to visit very bad very good you are great
##   doc1       0.0000000           0.0000000        0 0.3333333     0.3333333
##   doc2       0.3333333           0.3333333        0 0.0000000     0.0000000

您也可以一举然后返回DocumentTermMatrix,但这可能很难理解:

x <- sub_holder(", ", dat$text)

apply_as_tm(t(wfm(x$unhold(gsub(" ", "~~", x$output)), dat$docs)), 
    weightTfIdf, to.qdap=FALSE)

I have my documents as:

 doc1 = very good, very bad, you are great
 doc2 = very bad, good restaurent, nice place to visit

I want to make my corpus separated with , so that my final DocumentTermMatrix becomes:

      terms
 docs       very good      very bad        you are great   good restaurent   nice place to visit
  doc1       tf-idf          tf-idf         tf-idf          0                    0
  doc2       0                tf-idf         0                tf-idf             tf-idf

I know, how to calculate DocumentTermMatrix of individual words but don't know how to make the corpus separated for each phrase in R. A solution in R is preferred but solution in Python is also welcomed.

What I have tried is:

> library(tm)
> library(RWeka)
> BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 3))
> options(mc.cores=1)
> texts <- c("very good, very bad, you are great","very bad, good restaurent, nice place to visit")
> corpus <- Corpus(VectorSource(texts))
> a <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))
> as.matrix(a)

I am getting:

                         Docs
  Terms                   1 2
  bad good restaurent   0 1
  bad you are           1 0
  good restaurent nice  0 1
  good very bad         1 0
  nice place to         0 1
  place to visit        0 1
  restaurent nice place 0 1
  very bad good         0 1
  very bad you          1 0
  very good very        1 0
  you are great         1 0

What I want is not combination of words but only the phrases that I showed in my matrix.

解决方案

Here's one approach using qdap + tm packages:

library(qdap); library(tm); library(qdapTools)

dat <- list2df(list(doc1 = "very good, very bad, you are great",
 doc2 = "very bad, good restaurent, nice place to visit"), "text", "docs")

x <- sub_holder(", ", dat$text)

m <- dtm(wfm(x$unhold(gsub(" ", "~~", x$output)), dat$docs) )
weightTfIdf(m)

inspect(weightTfIdf(m))

## A document-term matrix (2 documents, 5 terms)
## 
## Non-/sparse entries: 4/6
## Sparsity           : 60%
## Maximal term length: 19 
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## 
##       Terms
## Docs   good restaurent nice place to visit very bad very good you are great
##   doc1       0.0000000           0.0000000        0 0.3333333     0.3333333
##   doc2       0.3333333           0.3333333        0 0.0000000     0.0000000

You could also do one fell swoop and return a DocumentTermMatrix but this may be harder to understand:

x <- sub_holder(", ", dat$text)

apply_as_tm(t(wfm(x$unhold(gsub(" ", "~~", x$output)), dat$docs)), 
    weightTfIdf, to.qdap=FALSE)

这篇关于语料库的建立的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆