无法在Document-Term-Matrix中看到`RTextTools :: toLower()`文本的结果 [英] Impossible to see results of `RTextTools::toLower()` text in Document-Term-Matrix

查看:95
本文介绍了无法在Document-Term-Matrix中看到`RTextTools :: toLower()`文本的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试创建一个矩阵,为此,我想降低文本. 为此,我使用以下R指令:

I try to create a matrix, for this I would like to tolower text. For this I use this R instruction :

matrix = create_matrix(tweets[,1], toLower = TRUE, language="english", 
                      removeStopwords=FALSE, removeNumbers=TRUE, 
                      stemWords=TRUE) 

这里是R代码:

library(RTextTools)
library(e1071)

pos_tweets =  rbind(
  c('j AIME la voiture', 'positive'),
  c('cette machine est performante', 'positive'),
  c('je me sens en bonne forme ce matin', 'positive'),
  c('je suis super excitée d aller voir le spectacle de demain', 'positive'),
  c('il est mon meilleur ami', 'positive')
)



neg_tweets = rbind(
  c('je séteste cette voiture', 'negative'),
  c('ce film est horrible', 'negative'),
  c('je suis fatiguée ce matin', 'negative'),
  c('je déteste ce concert', 'negative'),
  c('il n est pas mon ami', 'negative')
)

test_tweets = rbind(
  c('je suis heureuse ce matin', 'negative'),
  c('un bon ami', 'negative'),
  c('je me sens triste', 'positive'),
  c('pas belle cette maison', 'negative'),
  c('mauvaise chanson', 'negative')
)

tweets = rbind(pos_tweets, neg_tweets, test_tweets)

# build dtm
matrix= create_matrix(tweets[,1], toLower = TRUE, language="french", 
                      removeStopwords=FALSE, removeNumbers=TRUE, 
                      stemWords=TRUE) 

我注意到矩阵中有大写字母的单词的问题.

The problem that I remark that there is words with capital letters in the matrix.

你能告诉我为什么我会遇到这个问题吗?

Can you explain to me please why I get this problem?

谢谢

推荐答案

正如@chateaur所说,它确实在内部执行toLower,只是不会在任意点向您公开管道的内容. RTextTools + tm对管道中的操作,位置,时间和顺序进行了严格的结构限制.真令人沮丧.避免那样...

As @chateaur said, it does perform the toLower internally, it just doesn't expose the contents of the pipeline at arbitrary points to you. RTextTools + tm build in severe structural limitations on what you can do, where, when and in what sequence in your pipeline. It's really frustrating. Avoid that...

我建议您编写自己的管道,最近我调查此管道时发现的最佳开源软件包是 toLower()方法中重载字符串,语料库,标记-在停用词,标点符号删除和词根提取之前或之后的任何位置,没有限制.与RTextTools + tm不同,它还有许多其他有用的方法可以按照您想要的任意步骤序列来构建管道. (您还可以通过查看活动维护者的数量/比率,提交,问题,修复,发布,github上的命中率,SO,Google的命中率,代码和API的清洁度,来衡量诸如Quanteda之类的软件包的有用性.) .

I recommend you write your own pipeline, and the best open-source package I found for pipelines when I was investigating this recently was quanteda. To illustrate the point it has an overloaded toLower() method you can use on strings, corpora, tokens - wherever you like, no restrictions, before or after stopword, punctuation removal and stemming. And it has tons of other useful methods for constructing your pipeline in whatever arbitrary sequence of steps you want, unlike RTextTools + tm. (You can also measure the usefulness of a package like quanteda by looking at the number/rate of active maintainers, commits, issues, fixes, releases, hits on github, SO, google, cleanness of the code and the API...).

在前端使用RTextTools + tm有时会很痛苦,而且经常会受到限制.我只是发现了太多的错误,局限性,语法怪异和烦恼-它杀死了我的工作效率,并不断地使我发疯.而且也不是很出色.您仍然可以使用(RTextTools +)tm来构建和处理DTM(和TF/TFIDF)矩阵,并使用e1071作为分类器.

Using RTextTools + tm on the frontend is sometimes painful, and often limiting. I simply found too many bugs, limitations, syntax quirks and annoyances with them - it killed my productivity and constantly drove me nuts. And it wasn't too performant either. You can still use (RTextTools +) tm for constructing and manipulating the DTM (and TF/TFIDF) matrices, and e1071 for the classifier.

也:荣幸地提及 qdap 包,以类似方式在文档/论述中添加有用的工具级.

Also: an honorable mention to qdap package for similarly adding useful tools at the document/discourse-level.

(PS:令人遗憾的是R文本处理程序包是如此荒唐……如此多的人跨领域工作并疯狂地重新发明轮子……但有时由于多种原因会发生这种情况.)

(PS: it's truly sad that R text-processing packages are so balkanized... so many people working at cross-purposes and furiously reinventing wheels... but sometimes that happens for several reasons.)

这篇关于无法在Document-Term-Matrix中看到`RTextTools :: toLower()`文本的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆