股票推文,文本挖掘,图释Erros [英] Stock Tweets, Text Mining, Emoticon Erros

查看:163
本文介绍了股票推文,文本挖掘,图释Erros的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我希望您能够协助进行文本挖掘练习.我对"AAPL"鸣叫很感兴趣,并且能够从API中提取500条鸣叫.我本人可以自己清除几个障碍,但最后一部分需要帮助.由于某些原因,tm软件包没有删除停用词.您能否看一下问题可能出在哪里?表情符号引起了问题吗?

I was hoping you'd be able to assist in a text mining exercise. I was interested in 'AAPL' tweets, and was able to pull 500 tweets from the API. I was able to clear several hurdles on my own, but need help for last part. For some reason, the tm package is not removing stopwords. Can you please take a look and see what the problem might be? Are emoticons causing an issue?

绘制Term_Frequency后,最常用的术语是"AAPL","Apple","iPhone","Price","Stock"

After plotting Term_Frequency, the most frequent terms are "AAPL", "Apple", "iPhone", "Price", "Stock"

提前谢谢!

蒙肯(Munckinn)

Munckinn

transform into dataframe
tweets.df <- twListToDF(tweets)

#Isolate text from tweets
aapl_tweets <- tweets.df$text

#Deal with emoticons
tweets2 <- data.frame(text = iconv(aapl_tweets, "latin1", "ASCII", "bye"), stringsAsFactors = FALSE)

#Make a vector source:
aapl_source <- VectorSource(tweets2)

#make a volatile corpus
aapl_corpus <- VCorpus(aapl_source)
aapl_cleaned <- clean_corpus(aapl_source)

#create my list to remove words
myList <- c("aapl", "apple", "stock", "stocks", stopwords("en"))

#clean corpus function 

clean_corpus <- function(corpus){
  corpus <- tm_map(corpus, stripWhitespace, mc.cores = 1)
  corpus <- tm_map(corpus, removePunctuation, mc.cores = 1)
  corpus <- tm_map(corpus, removeWords, myList, mc.cores = 1)
  return(corpus)
}

#clean aapl corpus
aapl_cleaned <- clean_corpus(aapl_corpus)

#convert to TDM
aapl.tdm <- TermDocumentMatrix(aapl_cleaned)

aapl.tdm

#Convert as Matrix
aapl_m <- as.matrix(aapl.tdm)

#Create Frequency tables
term_frequency <- rowSums(aapl_m)
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency[1:10]

barplot(term_frequency[1:10])

推荐答案

我认为您的问题出在iconv 将再见"更改为字节"

I think your problem is in the iconv change "bye" to "byte"

   tweets2 <- data.frame(
              text = iconv(aapl_tweets, "latin1", "ASCII", "byte"),
              stringsAsFactors = FALSE)

这篇关于股票推文,文本挖掘,图释Erros的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆