从向量中删除相似但更长的重复项 [英] Removing similar but longer duplicates from vector

查看:47
本文介绍了从向量中删除相似但更长的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

对于数据库清理,我有一个向量,例如,盘子,我想删除基础"盘子的所有变体,只保留基础盘子.例如,如果我有...

For database cleanup, I have a vector of, say, dishes and I want to remove all the variants of the "base" dish, keeping only the base dish. For instance, if I have...

dishes <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
            "HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA", 
            "PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE")

... 我想删除向量中已经有较短匹配版本的所有条目.因此,生成的向量将仅包括:DAL BHAT"、HAMBURGER、PIZZA".

... I want to remove all entries that already have a shorter matching version in the vector. The resulting vector would thus only include: "DAL BHAT", "HAMBURGER, "PIZZA".

使用嵌套的 for 循环并根据所有其他循环检查所有内容将适用于本示例,但对于手头的大型数据集需要很长时间,而且我会说编码很丑.

Using a nested for loop and checking everything against all others will work for this example, but will take long for the large dataset at hand and is furthermore ugly coding I'd say.

可以假设所有条目都是大写的并且向量已经排序.不能假设下一道菜的第一道菜总是比前一道菜短.

It can be assumed that all entries are in caps and that the vector is already sorted. It cannot be assumed that the first item of the next base dish is always shorter than the previous entry.

有关如何以有效方式解决此问题的任何建议?

Any suggestions on how to solve this in an efficient way?

额外问题:理想情况下,我只想从初始向量中删除至少比较短的对应项长 3 个字符的项.在上述情况下,这意味着HAMBURGER2"也将保留在结果向量中.

BONUS QUESTION: Ideally, I only want to remove items from the initial vector if they are at least 3 characters longer than their shorter counterpart. In the above case, that would mean that "HAMBURGER2" would also be retained in the resulting vector.

推荐答案

这是我对此采取的方法.我会创建一个具有我需要考虑的一些条件的函数,并在输入中使用它.我添加了注释来解释函数中发生的事情.

Here's the approach I'd take with this. I'd create a function with some of the conditions that I'd need to consider, and use that on the input. I've added comments to explain what's happening in the function.

该函数有 4 个参数:

The function has 4 arguments:

  • invec:输入字符向量.
  • thresh:我们可以用多少个字符来确定基"菜.默认值 = 5.
  • minlen:您的奖金"问题.默认值 = 3.
  • strict:符合逻辑.如果有 nchar 比您的 thresh 短的基础菜肴,您是想降低阈值还是严格要求基础?默认值 = FALSE.请参阅最后一个示例,了解 strict 的工作原理.
  • invec: The input character vector.
  • thresh: How many characters can we use to determine the "base" dish. Default = 5.
  • minlen: Your "BONUS" question. Default = 3.
  • strict: Logical. If there are base dishes with nchar shorter than your thresh, do you want to lower the thresh or be strict about what you're looking at for the base? Default = FALSE. See the last example for how strict might work.
myfun <- function(invec, thresh = 5, minlen = 3, strict = FALSE) {
  # Bookkeeping -- sort, unique, all upper case
  invec <- sort(unique(toupper(invec)))
  # More bookkeeping -- min should not be longer 
  # than min base dish unless strict = TRUE
  thresh <- if (isTRUE(strict)) thresh else min(min(nchar(invec)), thresh)
  # Use `thresh` to get the `stubs``
  stubs <- invec[!duplicated(substr(invec, 1, thresh))]
  # loop through the stubs and do two things:
  #   - Match the dish with the stub
  #   - Return the base dish and any dishes within the minlen
  unlist(
    lapply(stubs, function(x) {
      temp <- grep(x, invec, value = TRUE, fixed = TRUE)
      temp[temp == x | nchar(temp) <= nchar(x) + minlen]
      }), 
    use.names = FALSE)
}

您的样本数据:

dishes <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
            "HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA", 
            "PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE")    

结果如下:

myfun(dishes, minlen = 0)
# [1] "DAL BHAT"  "HAMBURGER" "PIZZA" 

myfun(dishes)
# [1] "DAL BHAT"   "HAMBURGER"  "HAMBURGER2" "PIZZA" 

这是更多示例数据.请注意,在dishes2"中,数据不再排序,并且有一个新项目DAL",而在dishes3"中,您还有小写的菜肴.

Here's some more sample data. Note that in "dishes2" the data are no longer sorted and there's a new item "DAL", and in "dishes3" you also have lowercase dishes.

dishes2 <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
             "HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA", 
             "PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE", "DAL")

dishes3 <- c("DAL BHAT", "DAL BHAT-(SPICY)", "DAL BHAT WITH EXTRA RICE", 
             "HAMBURGER", "HAMBURGER-BIG", "HAMBURGER2", "PIZZA", 
             "PIZZA (PROSCIUTO)", "PIZZA_BOLOGNESE", "DAL", "pizza!!")

这是这些向量上的函数:

Here's the function on those vectors:

myfun(dishes2, 4)
# [1] "DAL"        "HAMBURGER"  "HAMBURGER2" "PIZZA"   

myfun(dishes3)
# [1] "DAL"        "HAMBURGER"  "HAMBURGER2" "PIZZA"      "PIZZA!!"  

myfun(dishes3, strict = TRUE)
# [1] "DAL"        "DAL BHAT"   "HAMBURGER"  "HAMBURGER2" "PIZZA"      "PIZZA!!"  

这篇关于从向量中删除相似但更长的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆