使用 R 进行荷兰语情绪分析 [英] Dutch sentiment analysis using R

查看:76
本文介绍了使用 R 进行荷兰语情绪分析的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 RStudio 中,我有一列包含 荷兰语 句子,我想通过情绪分析在 -1.0 和 +1.0 之间添加极性分数.我已经尝试使用 jwfijffels 的 pattern.nlp 包,但这对我不起作用.我找到了关于 https://github.com/bnosac/pattern.nlp 的说明,其中解释说 - 为了使 nlp 包工作,您应该下载特定版本的 Python 并执行一些额外的步骤.但是,这些步骤对我来说有点模糊.

In RStudio, I have column containing Dutch sentences which I would like to add a polarity score between -1.0 and +1.0 to via sentiment analysis. I've already tried to use the pattern.nlp package from jwfijffels, but this didn't work for me. I found an instruction on https://github.com/bnosac/pattern.nlp in which it is explained that - in order for the nlp package to work, you should download a specific version of Python and perform some additional steps. However, these steps are a bit vague to me.

有人可以更详细地向我解释这个安装过程吗?实际上,安装"下的整个部分都在对我来说有点神秘.具体应该下载什么?在哪里运行代码pip install pattern?如何正确设置PATH?如果有人能指导我一步一步地完成它,我将不胜感激.

Is there someone who can explain this installation process to me in more detail? Actually, the whole section under "Installation" is a bit of a mystery to me. What should I download specifically? Where to run the code pip install pattern? How do I properly set the PATH? It would be much appreciated if someone would guide me trough it step by step.

或者:如果有人知道另一种对文本进行情感分析的方法,我当然愿意接受,例如将荷兰语句子翻译成英语,然后进行情感分析.或者这样的翻译是个坏主意?

Or: if someone knows another way to perform sentiment analysis on text, I would of course be open to it, e.g. translating the Dutch sentences to English and then perform the sentiment analysis. Or would such a translation be a bad idea?

这里有一组 6 个荷兰语句子.

Here a set of 6 Dutch sentences.

text = c("Slechte bediening, van begin tot eind",
         "Het eten was heerlijk en de bediening was fantastisch",
         "Geweldige service en beleefde bediening",
         "Verschrikkelijk. Ik had een vlieg in mijn soep", 
         "Het was oké. De bediening kon wat beter, maar het eten was wel lekker. Leuk sfeertje wel!",
         "Ondanks dat het druk was toch op tijd ons eten gekregen. Complimenten aan de kok voor het op smaak brengen van mijn biefstuk")
identifier <- c("3", "4", "6", "7", "1", "5")
df <- data.frame(identifier, text)

推荐答案

情绪分析(使用字典)基本上只是一个模式匹配任务.我认为这在使用 tidytext 包和 阅读有关它的书时变得很清楚.

Sentiment analysis (using a dictionary) is basically just a pattern matching task. I think this becomes clear when using the tidytext package and reading the book about it.

所以我不会在这里处理如此复杂的设置.相反,我会将他们使用的字典(来自此处)转换为 data.frame 然后使用 tidytext.不幸的是,字典以 XML 格式存储,我对此不是很熟悉,所以代码看起来有点 hacky:

So I wouldn't bother with such a complex setup here. Instead, I would convert the dictionary they are using (which is from here) into a data.frame and then use tidytext. Unfortunately, the dictionary is stored in XML format and I'm not very familiar with that, so the code looks a little hacky:

library(tidyverse)
library(xml2)
library(tidytext)

sentiment_nl <- read_xml(
  "https://raw.githubusercontent.com/clips/pattern/master/pattern/text/nl/nl-sentiment.xml"
) %>% 
  as_list() %>% 
  .[[1]] %>% 
  map_df(function(x) {
    tibble::enframe(attributes(x))
  }) %>% 
  mutate(id = cumsum(str_detect("form", name)))  %>% 
  unnest(value) %>% 
  pivot_wider(id_cols = id) %>% 
  mutate(form = tolower(form), # lowercase all words to ignore case during matching
         polarity = as.numeric(polarity),
         subjectivity = as.numeric(subjectivity),
         intensity = as.numeric(intensity),
         confidence = as.numeric(confidence))

但输出是正确的:

head(sentiment_nl)
#> # A tibble: 6 x 11
#>      id form  cornetto_id cornetto_synset… wordnet_id pos   sense polarity
#>   <int> <chr> <chr>       <chr>            <chr>      <chr> <chr>    <dbl>
#> 1     1 amst… r_a-16677   ""               ""         JJ    van …      0  
#> 2     2 ange… r_a-8929    ""               ""         JJ    Enge…      0.1
#> 3     3 arab… r_a-16693   ""               ""         JJ    van …      0  
#> 4     4 arde… r_a-17252   ""               ""         JJ    van …      0  
#> 5     5 arnh… r_a-16698   ""               ""         JJ    van …      0  
#> 6     6 asse… r_a-16700   ""               ""         JJ    van …      0  
#> # … with 3 more variables: subjectivity <dbl>, intensity <dbl>,
#> #   confidence <dbl>

现在我们可以使用 tidytext 和更广泛的 tidyverse 中的函数在字典中查找单词并将分数附加到每个单词.summarise() 用于为每个文本获取一个值(这也是您需要 text_id 的原因).

Now we can use the functions from tidytext and the broader tidyverse to lookup the words in the dictionary and attach the score to each word. summarise() is used to get exactly one value per text (that's also why you need the text_id).

df <- data.frame(text = c("Het eten was heerlijk en de bediening was fantastisch", 
                          "Verschrikkelijk. Ik had een vlieg in mijn soep", 
                          "Het was oké. De bediening kon wat beter, maar het eten was wel lekker. Leuk sfeertje wel!",
                          "Ondanks dat het druk was toch op tijd ons eten gekregen. Complimenten aan de kok voor het op smaak brengen van mijn biefstuk"))

df %>% 
  mutate(text_id = row_number()) %>% 
  unnest_tokens(output = word, input = text, drop = FALSE) %>% 
  inner_join(sentiment_nl, by = c("word" = "form")) %>%
  group_by(text_id) %>% 
  summarise(text = head(text, 1),
            polarity = mean(polarity),
            subjectivity = mean(subjectivity),
            .groups = "drop")
#> # A tibble: 4 x 4
#>   text_id text                                             polarity subjectivity
#>     <int> <chr>                                               <dbl>        <dbl>
#> 1       1 Het eten was heerlijk en de bediening was fanta…    0.56         0.72 
#> 2       2 Verschrikkelijk. Ik had een vlieg in mijn soep     -0.5          0.9  
#> 3       3 Het was oké. De bediening kon wat beter, maar h…    0.6          0.98 
#> 4       4 Ondanks dat het druk was toch op tijd ons eten …   -0.233        0.767

正如我所说,在 tidytextmining.com 上解释了更多关于此(和 NLP)的内容,所以如果您现在觉得这很复杂,请不要担心.

As I said, more on this (and NLP) is explained on tidytextmining.com, so don't worry if this looks complicated to you now.

这篇关于使用 R 进行荷兰语情绪分析的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆