有没有一种方法可以在dplyr中使用for循环来减少所需的str_detect项的数量? [英] Is there a way to use for loops within dplyr to reduce the number of str_detect terms needed?
问题描述
我目前正在研究一个项目,并且我正在考虑根据其内容对十万个字符串进行分类.
I'm currently working on a project, and I'm looking at classifying about a hundred thousand strings, based on their content.
此代码的目标是识别字符串是否匹配,将其分类到特定存储桶,然后将最终结果保存到csv.没有代码包含多个匹配字符串.
The goal of this code is to identify if a string matches, classify them to a particular bucket, then to save the end result to a csv. No code contains more than one matching string.
我意识到,在某一点之后,我的代码变得有点不可读-主要是因为如果我不得不更改200个具有相同格式的str_detect函数之一,那么我就必须在case_when中找到它,等等,
I realise that after a certain point my code gets a little unreadable - mostly because if I have to change one of say, two hundred str_detect functions with the same format, I then have to find it in my case_when, etc.
我正在寻找一种可能集成循环以及是否有条件将其集成到函数中以提高可读性并简化str_detect函数修改的方法.
I'm looking at a way to possibly integrate for loops and if conditionals into my function to improve readability and make modifying str_detect functions easier.
我试图通过定义一个包括我所有的字符串类,字符串项和分类的小标题来替换出case_when/str_detect组合.之后,我将case_when换成了for循环,该循环将小节集成在str_detect内,每回合拉出特定的字符串条件.
I've tried swapping out the case_when/str_detect combination by defining a tibble that includes all my string classes, string terms and classifications. Following that, I've swapped out the case_when for a for loop that integrates the tibble within str_detect, pulling out a specific string condition each turn.
# Working case_when version
library(dplyr)
library(stringr)
a.str <- "(?i)Apple"
b.str <- "(?i)Banana"
c.str <- "(?i)Corn"
food_set <- read_csv("Food.csv")
food_identified <- food_set %>% mutate(
food.type = case_when(
str_detect(food_set, a.str ) = TRUE ~ "A",
str_detect(food_set, b.str ) = TRUE ~ "B",
str_detect(food_set, c.str ) = TRUE ~ "C"
)
)
food_classified <- write_csv(food_identified,"Food_Classified.csv")
# Failing for loop version
library(dplyr)
library(stringr)
str_options <- tribble(
~variety.str, ~String, ~Classification,
#-----------/-------------/-------------------
"a.str" , "(i?)Apple" , "A",
"b.str" , "(i?)Banana", "B",
"c.str" , "(i?)Corn" , "C"
)
food_set <- read_csv("Food.csv")
food_identified <- food_set %>% mutate(
for (k in 1:3) {
if(str_detect(food_set, str_options[k,2]) == TRUE) {
food.type = str_options[k,3]
}
break
}
)
food_classified <- write_csv(food_identified,"Food_Classified.csv")
case_when代码可以正常运行-它吐出一个包含两列(食物,food_type)的表.
The case_when code runs fine - it spits out a table with two columns (food, food_type).
for循环不起作用-它发出一个错误,指出没有适用于'type'的适用方法应用于类"c('tbl_df','tbl','data.frame')"的错误.
The for loop doesn't work - it spits out an error saying 'no applicable method for 'type' applied to an object of class "c('tbl_df','tbl','data.frame')".
有人对我如何使它起作用有想法吗?
Does anyone have an idea as to how I might be able to get this working?
推荐答案
这也可以使用Fuzzyjoin完成.需要注意的一项潜在优势是它会加入所有匹配的正则表达式中.
This could also be done with fuzzyjoin. One potential advantage / thing to watch out for is that it will join to all matching regexes.
library(tidyverse); library(fuzzyjoin)
food_set <- tibble(
food_set = c("sadgad(i?)Apple", "(i?)Bananaasdgas", "hgjdndg(i?)Cornadfba")
)
food_set %>%
regex_left_join(str_options, by = c("food_set" = "String"))
# A tibble: 3 x 4
food_set variety.str String Classification
<chr> <chr> <chr> <chr>
1 sadgad(i?)Apple a.str (i?)Apple A
2 (i?)Bananaasdgas b.str (i?)Banana B
3 hgjdndg(i?)Cornadfba c.str (i?)Corn C
这篇关于有没有一种方法可以在dplyr中使用for循环来减少所需的str_detect项的数量?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!