使用具有多个条件的部分匹配对 df 进行子集 [英] Subset a df using partial match with multiple criteria

查看:29
本文介绍了使用具有多个条件的部分匹配对 df 进行子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是数据集:

company <- c("Coca-Cola Inc.", "DF, CocaCola", 
         "COCA-COLA", "PepsiCo Inc.", "Beverages Distribution")
brand  <- c("Coca-Cola Zero","N/A", "Coca-Cola", "Pepsi", "soft drink")
vol  <- c("2456","1653", "19", "2766", "167")
data   <-data.frame(company, brand, vol)
data

结果:

                 company             brand    vol
1         Coca-Cola Inc.    Coca-Cola Zero   2456
2           DF, CocaCola               N/A   1653
3              COCA-COLA          CocaCola     19
4           PepsiCo Inc.             Pepsi   2766
5 Beverages Distribution        soft drink    167

比方说,这是按品牌进口的数量.

Let's say, this is imported volume by brand.

任务是将数据框细分为仅查看与可口可乐相关的观察结果,而不是任何其他品牌.

  • 问题在于可口可乐的书写方式多种多样.
  • 另外,我们知道饮料分销公司只进口可口可乐,即使上表中没有标明.

我们需要根据条件(键)列表部分匹配 COMPANY 和 BRAND 变量:

We need to partially match COMPANY and BRAND variables against a list of criteria (keys):

company_key <- c("coca-", "cocacola", "coca cola", "beverages distribution")
brand_key <- c("coca-", "cocacola", "coca cola")

我正在努力执行这个想法:

子集数据如果品牌部分匹配来自brand_key向量的任何键或公司部分匹配来自company_key的任何键

所以,只留下以下几行:

So, leave only the lines in which :

(brand 观察部分匹配coca-" OR cocacola" OR coca cola")

(brand observation partially matches "coca-" OR "cocacola" OR "coca cola")

(company 观察部分匹配coca-" OR cocacola" OR coca cola" OR beverages distribution")

(company observation partially matches "coca-" OR "cocacola" OR "coca cola" OR "beverages distribution")

注意:需要不区分大小写

理想的输出:

                 company             brand    vol
1         Coca-Cola Inc.    Coca-Cola Zero   2456
2           DF, CocaCola               N/A   1653
3              COCA-COLA          CocaCola     19
4 Beverages Distribution        soft drink    167

有什么想法吗?提前致谢:)

Any ideas? Thanks in advance :)

推荐答案

使用正则表达式及其 |(或)运算符.参数 ignore.case 处理案例.

Using regex and its | (or) operator. Parameter ignore.case deals with the case.

index <- grepl(paste0(company_key, collapse = "|"), data$company, ignore.case = TRUE) |
    grepl(paste0(brand_key, collapse = "|"), data$company, ignore.case = TRUE)

data[index,]  

#                 company          brand  vol
#1         Coca-Cola Inc. Coca-Cola Zero 2456
#2           DF, CocaCola            N/A 1653
#3              COCA-COLA      Coca-Cola   19
#5 Beverages Distribution     soft drink  167

这篇关于使用具有多个条件的部分匹配对 df 进行子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆