在R中使用dplyr进行情感分析后,缺少行 [英] Missing rows after sentiment analysis using dplyr in R

查看:78
本文介绍了在R中使用dplyr进行情感分析后,缺少行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

dplyr 在R中进行情感分析时-analysis-using-r>这篇文章,看来我的某些行丢失了。我提供了一组6个荷兰语句子。可以看出,行 3 6 没有出现在新的 df 包括情感分析。

When doing a sentiment analysis in R using dplyr that has been described in this post, it appears that some of my rows go missing. I've provided a set of 6 Dutch sentences. As can be seen, row 3 and 6 do not appear in the new df that includes the sentiment analysis.

我尝试将掉落 更改为保持 放置 NULL 。在 df%>%解决方案之后,我还尝试为某些部分添加标签,但是两者都没有结果。

I tried to change the "drop" to "keep", "drop" and "NULL". I also tried to hashtag certain parts after the df %>% solution, but both without result.

有人可以解释吗?这种行为对我来说?以及我该如何解决?

Is someone able to explain this behavior to me? And how can I fix it?

library(tidyverse)
library(xml2)
library(tidytext)

#Example data set
text = c("Slechte bediening, van begin tot eind",
         "Het eten was heerlijk en de bediening was fantastisch",
         "Geweldige service en beleefde bediening",
         "Verschrikkelijk. Ik had een vlieg in mijn soep", 
         "Het was oké. De bediening kon wat beter, maar het eten was wel lekker. Leuk sfeertje wel!",
         "Ondanks dat het druk was toch op tijd ons eten gekregen. Complimenten aan de kok voor het op smaak brengen van mijn biefstuk")
identifier <- c("3", "4", "6", "7", "1", "5")
df <- data.frame(identifier, text)

#Sentiment analysis Dutch
sentiment_nl <- read_xml(
  "https://raw.githubusercontent.com/clips/pattern/master/pattern/text/nl/nl-sentiment.xml"
) %>% 
  as_list() %>% 
  .[[1]] %>% 
  map_df(function(x) {
    tibble::enframe(attributes(x))
  }) %>% 
  mutate(id = cumsum(str_detect("form", name)))  %>% 
  unnest(value) %>% 
  pivot_wider(id_cols = id) %>% 
  mutate(polarity = as.numeric(polarity),
         subjectivity = as.numeric(subjectivity),
         intensity = as.numeric(intensity),
         confidence = as.numeric(confidence))

df <- df %>% 
  mutate(identifier = identifier) %>% 
  unnest_tokens(output = word, input = text, drop = FALSE) %>% 
  inner_join(sentiment_nl, by = c("word" = "form")) %>%
  group_by(identifier) %>% 
  summarise(text = head(text, 1),
            polarity = mean(polarity),
            subjectivity = mean(subjectivity),
            .groups = "drop")


推荐答案

正如@Bas注释中指出的那样,词典中缺少某些单词形式。您可以通过获得更好的字典,词干分析或词条还原来解决此问题。

As pointed out in @Bas comment, some word forms are missing from the dictionary. You can solve this by getting a better dictionary, stemming or lemmatization.

理想情况下,您将使用词条分解器,该词条优于词干分析器。但是,我认为在示例中您给了一个词干分析器效果很好。因此,您可以使用它来构建字典:

Ideally, you would use a lemmatizer, which is superior to stemming. However, I think in the example you've given a stemmer is working fine. So you can use this to construct the dictionary:

library(tidyverse)
library(xml2)
library(tidytext)
library(textstem)

sentiment_nl <- read_xml(
  "https://raw.githubusercontent.com/clips/pattern/master/pattern/text/nl/nl-sentiment.xml"
) %>% 
  as_list() %>% 
  .[[1]] %>% 
  map_df(function(x) {
    tibble::enframe(attributes(x))
  }) %>% 
  mutate(id = cumsum(str_detect("form", name)))  %>% 
  unnest(value) %>% 
  pivot_wider(id_cols = id) %>% 
  mutate(form = tolower(form),
         stem = textstem::stem_words(form), # this is the new line
         polarity = as.numeric(polarity),
         subjectivity = as.numeric(subjectivity),
         intensity = as.numeric(intensity),
         confidence = as.numeric(confidence))

然后在词干匹配之前先在文本中词干:

And then also stem the words in the text before matching on the stems:

df %>% 
  unnest_tokens(output = word, input = text, drop = FALSE) %>% 
  mutate(stem = textstem::stem_words(word)) %>% 
  inner_join(sentiment_nl, by = "stem") %>%
  group_by(identifier) %>% 
  summarise(text = head(text, 1),
            polarity = mean(polarity),
            subjectivity = mean(subjectivity),
            .groups = "drop")
#> # A tibble: 6 x 4
#>   identifier text                                          polarity subjectivity
#>   <chr>      <chr>                                            <dbl>        <dbl>
#> 1 1          Het was oké. De bediening kon wat beter, maa…    0.6          0.98 
#> 2 3          Slechte bediening, van begin tot eind           -0.7          0.9  
#> 3 4          Het eten was heerlijk en de bediening was fa…    0.56         0.72 
#> 4 5          Ondanks dat het druk was toch op tijd ons et…   -0.233        0.767
#> 5 6          Geweldige service en beleefde bediening          0.7          0.95 
#> 6 7          Verschrikkelijk. Ik had een vlieg in mijn so…   -0.3          0.733

这篇关于在R中使用dplyr进行情感分析后,缺少行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆