在R中映射审阅主题 [英] Mapping the topic of the review in R
问题描述
我有两个数据集,评论数据和主题数据
我的评论数据
structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
我的主题数据
structure(list(word = structure(2:1, .Label = c("canteen food",
"sports and physical"), class = "factor"), Topic = structure(2:1, .Label = c("Canteen",
"Sports "), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
我的所需输出的内容,我想查找主题数据中出现的单词,并将其映射到查看数据 >
Dput of my Desired Output, I want to look up the words which are appearing in Topic Data and map the same to the Review Data
structure(list(Review = structure(2:1, .Label = c("Canteen Food could be improved",
"Sports and physical exercise need to be given importance"), class = "factor"),
Topic = structure(2:1, .Label = c("Canteen", "Sports "), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
推荐答案
您想要的是类似于模糊联接的东西.这是一个寻找严格子字符串(但不区分大小写)的蛮力:
What you want is something like a fuzzy join. Here's a brute-force looking for strict substring (but case-insensitive):
library(dplyr)
review %>%
full_join(topic, by = character()) %>% # full cartesian expansion
group_by(word) %>%
mutate(matched = grepl(word[1], Review, ignore.case = TRUE)) %>%
ungroup() %>%
filter(matched) %>%
select(-word, -matched)
# # A tibble: 2 x 2
# Review Topic
# <fct> <fct>
# 1 Sports and physical exercise need to be given importance "Sports "
# 2 Canteen Food could be improved "Canteen"
这有点蛮力,因为它在用grepl
测试之前会对框架进行笛卡尔连接,但是...您无法真正避免其中的某些部分.
It's a little brute-force in that it does a cartesian join of the frames before testing with grepl
, but ... you can't really avoid some parts of that.
您还可以使用fuzzyjoin
程序包,该程序包用于 fuzzy 事物(适当命名)上的 join .
You can also use the fuzzyjoin
package, which is meant for joins on fuzzy things (appropriately named).
fuzzyjoin::regex_left_join(review, topic, by = c(Review = "word"), ignore_case = TRUE)
# Warning: Coercing `pattern` to a plain character vector.
# Review word Topic
# 1 Sports and physical exercise need to be given importance sports and physical Sports
# 2 Canteen Food could be improved canteen food Canteen
警告是因为您的列是factor
而不是character
,所以它应该是无害的.如果要隐藏警告,则可以使用suppressWarnings
(稍强);否则,可以使用suppressWarnings
.如果要防止出现此警告,请将所有适用的列从factor
转换为character
(例如,topic[] <- lapply(topic, as.character)
,与review$Review
相同,但是如果有数字列,则对其进行修改).
The warning is because your columns are factor
s, not character
, it should be harmless. If you want to hide the warning, you can use suppressWarnings
(a little strong); if you want to prevent the warning, convert all applicable columns from factor
to character
(e.g., topic[] <- lapply(topic, as.character)
, same for review$Review
, though modify it if you have numeric columns).
这篇关于在R中映射审阅主题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!