在R中映射审阅主题 [英] Mapping the topic of the review in R

查看:140
本文介绍了在R中映射审阅主题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个数据集,评论数据主题数据

我的评论数据

structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved", 
"Sports and physical exercise need to be given importance"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

我的主题数据

structure(list(word = structure(2:1, .Label = c("canteen food", 
"sports and physical"), class = "factor"), Topic = structure(2:1, .Label = c("Canteen", 
"Sports "), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

我的所需输出的内容,我想查找主题数据中出现的单词,并将其映射到查看数据 >

Dput of my Desired Output, I want to look up the words which are appearing in Topic Data and map the same to the Review Data

structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved", 
"Sports and physical exercise need to be given importance"), class = "factor"), 
    Topic = structure(2:1, .Label = c("Canteen", "Sports "), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))

推荐答案

您想要的是类似于模糊联接的东西.这是一个寻找严格子字符串(但不区分大小写)的蛮力:

What you want is something like a fuzzy join. Here's a brute-force looking for strict substring (but case-insensitive):

library(dplyr)
review %>%
  full_join(topic, by = character()) %>% # full cartesian expansion
  group_by(word) %>%
  mutate(matched = grepl(word[1], Review, ignore.case = TRUE)) %>%
  ungroup() %>%
  filter(matched) %>%
  select(-word, -matched)
# # A tibble: 2 x 2
#   Review                                                   Topic    
#   <fct>                                                    <fct>    
# 1 Sports and physical exercise need to be given importance "Sports "
# 2 Canteen Food could be improved                           "Canteen"

这有点蛮力,因为它在用grepl测试之前会对框架进行笛卡尔连接,但是...您无法真正避免其中的某些部分.

It's a little brute-force in that it does a cartesian join of the frames before testing with grepl, but ... you can't really avoid some parts of that.

您还可以使用fuzzyjoin程序包,该程序包用于 fuzzy 事物(适当命名)上的 join .

You can also use the fuzzyjoin package, which is meant for joins on fuzzy things (appropriately named).

fuzzyjoin::regex_left_join(review, topic, by = c(Review = "word"), ignore_case = TRUE)
# Warning: Coercing `pattern` to a plain character vector.
#                                                     Review                word   Topic
# 1 Sports and physical exercise need to be given importance sports and physical Sports 
# 2                           Canteen Food could be improved        canteen food Canteen

警告是因为您的列是factor而不是character,所以它应该是无害的.如果要隐藏警告,则可以使用suppressWarnings(稍强);否则,可以使用suppressWarnings.如果要防止出现此警告,请将所有适用的列从factor转换为character(例如,topic[] <- lapply(topic, as.character),与review$Review相同,但是如果有数字列,则对其进行修改).

The warning is because your columns are factors, not character, it should be harmless. If you want to hide the warning, you can use suppressWarnings (a little strong); if you want to prevent the warning, convert all applicable columns from factor to character (e.g., topic[] <- lapply(topic, as.character), same for review$Review, though modify it if you have numeric columns).

这篇关于在R中映射审阅主题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆