将数据帧转换为带字数的小标题 [英] Converting data frame to tibble with word count

查看:87
本文介绍了将数据帧转换为带字数的小标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试基于 http://tidytextmining.com进行情感分析/sentiment.html#the-sentiments-dataset .在进行情感分析之前,我需要将数据集转换为整齐的格式.

I'm attempting to perform sentiment analysis based on http://tidytextmining.com/sentiment.html#the-sentiments-dataset . Prior to performing sentiment analysis I need to convert my dataset into a tidy format.

我的数据集的格式为:

x <- c( "test1" , "test2")
y <- c( "this is test text1" , "this is test text2")
res <- data.frame( "url" = x, "text" = y)
res
    url               text
1 test1 this is test text1
2 test2 this is test text2

为了转换为每行一个观察值,需要处理文本列并添加新列,其中包含单词和该URL出现的次数.相同的网址将显示在多行中.

In order to convert to one observation per row require to process text column and add new columns that contains word and number of times it appears for that url. Same url will appear in multiple rows.

这是我的尝试:

library(tidyverse)

x <- c( "test1" , "test2")
y <- c( "this is test text1" , "this is test text2")
res <- data.frame( "url" = x, "text" = y)
res

res_1 <- data.frame(res$text)
res_2 <- as_tibble(res_1)
res_2 %>% count(res.text, sort = TRUE) 

返回:

# A tibble: 2 x 2
            res.text     n
              <fctr> <int>
1 this is test text1     1
2 this is test text2     1

如何对res $ text数据框中的单词进行计数并维护url以执行情感分析?

How to count words in res$text dataframe and maintain url in order to perform sentiment analysis ?

更新:

x <- c( "test1" , "test2")
y <- c( "this is test text1" , "this is test text2")
res <- data.frame( "url" = x, "text" = y)
res

res %>%
group_by(url) %>%
transform(text = strsplit(text, " ", fixed = TRUE)) %>%
unnest() %>%
count(url, text) 

返回错误:

Error in strsplit(text, " ", fixed = TRUE) : non-character argument

我正尝试转换为小标题,因为这似乎是整洁文本情感分析所需的格式:

I'm attempting to convert to tibble as this appears to be format required for tidytextmining sentiment analysis : http://tidytextmining.com/sentiment.html#the-sentiments-dataset

推荐答案

您是否正在寻找类似的东西?如果要使用tidytext包处理情感分析,则需要使用unnest_tokens()分隔每个字符串中的单词.该功能不仅可以将文本分成单词,还可以做更多的事情.如果您想稍后再看该功能.每行有一个单词后,您可以使用count()计算每个单词在每个文本中出现的次数.然后,您要删除停用词. tidytext包中包含数据,因此可以调用它.最后,您需要掌握情感信息.在这里,我选择了AFINN,但您可以根据需要选择其他名称.希望对您有帮助.

Are you looking for something like this? When you want to handle sentiment analysis with the tidytext package, you need to separate words in each character string with unnest_tokens(). This function can do more than separating texts into words. If you want have a look of the function later. Once you have a word per row, you can count how many times each word appears in each text using count(). Then, you want to remove stop words. The tidytext package has the data, so you can call it. Finally, you need to have sentiment information. Here, I chose AFINN, but you can choose another if you want. I hope this will help you.

x <- c( "text1" , "text2")
y <- c( "I am very happy and feeling great." , "I am very sad and feeling low")
res <- data.frame( "url" = x, "text" = y, stringsAsFactors = F)

#    url                               text
#1 text1 I am very happy and feeling great.
#2 text2      I am very sad and feeling low

library(tidytext)
library(dplyr)

data(stop_words)
afinn <- get_sentiments("afinn")

unnest_tokens(res, input = text, output = word) %>%
count(url, word) %>%
filter(!word %in% stop_words$word) %>%
inner_join(afinn, by = "word")

#    url    word     n score
#  <chr>   <chr> <int> <int>
#1 text1 feeling     1     1
#2 text1   happy     1     3
#3 text2 feeling     1     1
#4 text2     sad     1    -2

这篇关于将数据帧转换为带字数的小标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆