如何在数据框其他列的一列中搜索字符串 [英] How to search for a string in one column in other columns of a data frame

查看:63
本文介绍了如何在数据框其他列的一列中搜索字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个表,称其为df,有3列,第一个是产品的标题,第二个是产品的说明,第三个是一个单词字符串。我需要做的是在整个表上运行一个操作,创建2个新列(将它们称为 exists_in_title和 exists_in_description),其值为1或0,指示第一列或第二列中是否存在第三列。我需要将其简单地做为1:1运算,例如,调用第1行 A,我需要检查单元格A3是否存在于A1中,并使用该数据创建列
exist_in_title,并且然后检查A2中是否存在A3,并使用该数据创建一列exist_in_description。然后继续前进到B行并执行相同的操作。我有成千上万的数据行,因此一次以1的方式执行这些操作是不现实的,为每行编写单独的函数,肯定需要一个函数或方法来一次性遍历表中的每一行。

I have a table, call it df, with 3 columns, the 1st is the title of a product, the 2nd is the description of a product, and the third is a one word string. What I need to do is run an operation on the entire table, creating 2 new columns (call them 'exists_in_title' and 'exists_in_description') that have either a 1 or 0 indicating if the 3rd column exists in either the 1st or 2nd column. I need it to simply be a 1:1 operation, so for example, calling row 1 'A', I need to check if the cell A3, exists in A1, and use that data to create column exists_in_title, and then check if A3 exists in A2, and use that data to create the column exists_in_description. Then move on to row B and go through the same operation. I have thousands of rows of data so it's not realistic to do these in a 1 at a time fashion, writing individual functions for each row, definitely need a function or method that will run through every row in the table in one shot.

我玩过grepl,pmatch,str_count,但似乎没有一个能真正满足我的需要。我认为grepl可能是最接近我需要的代码,这是我编写的两行代码的示例,这些代码在逻辑上可以执行我希望它们执行的操作,但似乎没有用:

I've played around with grepl, pmatch, str_count but none seem to really do what I need. I think grepl is probably the closest to what I need, here's an example of 2 lines of code I wrote that logically do what I would want them to, but didn't seem to work:

df$exists_in_title <- grepl(df$A3, df$A1)

df$exists_in_description <- grepl(df$A3, df$A2)

但是,当我运行这些命令时,我得到以下消息,这使我相信不能正常工作:参数'pattern'的长度> 1,并且将仅使用第一个元素。

However when I run those I get the following message, which leads me to believe it did not work properly: "argument 'pattern' has length > 1 and only the first element will be used"

任何有关如何执行此操作的帮助将不胜感激。谢谢!

Any help on how to do this would be greatly appreciated. Thanks!

推荐答案

grepl 将与一起使用mapply

示例数据框:

title <- c('eggs and bacon','sausage biscuit','pancakes')
description <- c('scrambled eggs and thickcut bacon','homemade biscuit with breakfast pattie', 'stack of sourdough pancakes')
keyword <- c('bacon','sausage','sourdough')
df <- data.frame(title, description, keyword, stringsAsFactors=FALSE)

使用 grepl 搜索匹配项:

df$exists_in_title <- mapply(grepl, pattern=df$keyword, x=df$title)
df$exists_in_description <- mapply(grepl, pattern=df$keyword, x=df$description)

结果:

            title                            description   keyword exists_in_title exists_in_description
1  eggs and bacon      scrambled eggs and thickcut bacon     bacon            TRUE                  TRUE
2 sausage biscuit homemade biscuit with breakfast pattie   sausage            TRUE                 FALSE
3        pancakes            stack of sourdough pancakes sourdough           FALSE                  TRUE



更新I



您也可以使用 dplyr stringr

library(dplyr)
df %>% 
  rowwise() %>% 
  mutate(exists_in_title = grepl(keyword, title),
         exists_in_description = grepl(keyword, description))

library(stringr)
df %>% 
  rowwise() %>% 
  mutate(exists_in_title = str_detect(title, keyword),
         exists_in_description = str_detect(description, keyword))   



更新II



地图也是一个选项,或者使用 tidyverse 中的更多选项,另一个选项可以是 pu rrr stringr

Update II

Mapis also an option, or using more from tidyverse another option could be purrr with stringr:

library(tidyverse)
df %>%
  mutate(exists_in_title = unlist(Map(function(x, y) grepl(x, y), keyword, title))) %>% 
  mutate(exists_in_description = map2_lgl(description, keyword,  str_detect))

这篇关于如何在数据框其他列的一列中搜索字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆