将具有相似名称 R 的级别组合在一起 [英] Group together levels with similar names R

查看:31
本文介绍了将具有相似名称 R 的级别组合在一起的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个不同级别的变量 q.一些水平实际上是相同的,但已被不良报道.

I have a variable q with various levels. Some of the levels are actually the same but have been bad reported.

 length(q)
[1] 13490
> levels(q)
  [1] ""                          " "                        
  [3] "?"                         "."                        
  [5] "Activelle"                 "CERACETT"                 
  [7] "CERACETTE"                 "CERASETTE"                
  [9] "cerazette"                 "Cerazette"                
 [11] "CERAZETTE"                 "CERAZETTI"                
 [13] "CEVAZETTE"                 "cilest"                   
 [15] "Cilest"                    "Cileste"                  
 [17] "Conludag"                  "COPALETTA?"               
 [19] "DEPO..."                   "Depo-Provera"             
 [21] "Depo. Pro Vera"            "DEPOPROVERA"              
 [23] "DEPO PROVERA"              "depoprovin"               
 [25] "DEPROVERA"                 "DESOLETT"                 
 [27] "desorelle"                 "Diane"                    
 [29] "Diane mite"                "Divana"                   
 [31] "ENDEVINA"                  "Estradot"                 
 [33] "ETHISYLESTRA,LEVONORGESTR" "Evra"                     
 [35] "EXCLUTENA"                 "EXKLUTENA"                
 [37] "EXLUENTA 0,5MG"            "EXLUTENA"                 
 [39] "Femanest"                  "femenest"                 
 [41] "gastonette"                "Harmonet"                 
 [43] "hormon"                    "Hormonspiral"             
 [45] "IMPLANON"                  "INPLANON"                 
 [47] "KOMMER EJ IH\xc5G"         "LEBONOVA"                 
 [49] "LEMINOVA"                  "lemonora"                 
 [51] "LENONOVA"                  "LENOR"                    
 [53] "lenova"                    "Lenova"                   
 [55] "LENOVA"                    "LENOVA?"                  
 [57] "Leonova"                   "Levanova"                 
 [59] "LEVENOVA"                  "LEVINA"                   
 [61] "Levinova"                  "LEVINOVA"                 
 [63] "LEVIONOVA"                 "Levnova"                  
 [65] "levonova"                  "Levonova"                 
 [67] "LEVONOVA"                  "Levonova hormonspiral"    
 [69] "Levonova lykkja"           "Lindinette"               
 [71] "lindynette"                "Lindynette"               
 [73] "loette"                    "lyndynette"               
 [75] "malonetta"                 "Marvelon"                 
 [77] "Meniva"                    "Mercilon"                 
 [79] "Mereilom"                  "merivan"                  
 [81] "Microgyn"                  "microgynon"               
 [83] "Microgynon"                "Mikrogyn"                 
 [85] "Milvane"                   "MINERVA/LEVONORG."        
 [87] "MINI P"                    "MINI-P"                   
 [89] "Mini-pe"                   "mini-pl"                  
 [91] "MINIRA"                    "MINNS EJ"                 
 [93] "minulet"                   "Minulet"                  
 [95] "minulet p-piller"          "MIRANDA"                  
 [97] "Mircne"                    "mirena"                   
 [99] "Mirena"                    "MIRENA"                   
[101] "mirena levonorge"          "MIRENA LEVONORGESTREL"    
[103] "Modina p-piller"           "Mod turner: milv"         
[105] "NEOULETTA"                 "NEOVLETTA"                
[107] "NORLEVO"                   "NOV?"                     
[109] "Novaring"                  "novynette"                
[111] "Novynette"                 "nuva ring"                
[113] "Nuva ring"                 "NUVARING"                 
[115] "Østradiol dlf 2"           "Østradiolgel"             
[117] "P-plaster"                 "PROVERA"                  
[119] "RESTOVAR"                  "spiral"                   
[121] "Spiral"                    "Synfase"                  
[123] "T-GYN"                     "triminetta sando"         
[125] "TRIMORDIOL"                "TRINOVUM"                 
[127] "TRIONETTA 28"              "TRIREGOL"                 
[129] "T-spiral"                  "Vagifem"                  
[131] "VET EJ"                    "yas, bayer"               
[133] "yasmin"                    "Yasmin"                   
[135] "YASMINELL"                 "yasminelle"               
[137] "Yasminelle"                "YAZ"                      
[139] "ZYRONA"   

我想对所有相似的级别进行分组.例如,在这种情况下,我想将 cerazetti、cerasete、ceracett 组合在一起……我怎样才能做到这一点?

I would like to group all similar levels. For example in this case I want to group together cerazetti, cerasete, ceracett... How can I do that?

> dput(levels(q))
c("", " ", "?", ".", "Activelle", "CERACETT", "CERACETTE", "CERASETTE", 
"cerazette", "Cerazette", "CERAZETTE", "CERAZETTI", "CEVAZETTE", 
"cilest", "Cilest", "Cileste", "Conludag", "COPALETTA?", "DEPO...", 
"Depo-Provera", "Depo. Pro Vera", "DEPOPROVERA", "DEPO PROVERA", 
"depoprovin", "DEPROVERA", "DESOLETT", "desorelle", "Diane", 
"Diane mite", "Divana", "ENDEVINA", "Estradot", "ETHISYLESTRA,LEVONORGESTR", 
"Evra", "EXCLUTENA", "EXKLUTENA", "EXLUENTA 0,5MG", "EXLUTENA", 
"Femanest", "femenest", "gastonette", "Harmonet", "hormon", "Hormonspiral", 
"IMPLANON", "INPLANON", "KOMMER EJ IH\xc5G", "LEBONOVA", "LEMINOVA", 
"lemonora", "LENONOVA", "LENOR", "lenova", "Lenova", "LENOVA", 
"LENOVA?", "Leonova", "Levanova", "LEVENOVA", "LEVINA", "Levinova", 
"LEVINOVA", "LEVIONOVA", "Levnova", "levonova", "Levonova", "LEVONOVA", 
"Levonova hormonspiral", "Levonova lykkja", "Lindinette", "lindynette", 
"Lindynette", "loette", "lyndynette", "malonetta", "Marvelon", 
"Meniva", "Mercilon", "Mereilom", "merivan", "Microgyn", "microgynon", 
"Microgynon", "Mikrogyn", "Milvane", "MINERVA/LEVONORG.", "MINI P", 
"MINI-P", "Mini-pe", "mini-pl", "MINIRA", "MINNS EJ", "minulet", 
"Minulet", "minulet p-piller", "MIRANDA", "Mircne", "mirena", 
"Mirena", "MIRENA", "mirena levonorge", "MIRENA LEVONORGESTREL", 
"Modina p-piller", "Mod turner: milv", "NEOULETTA", "NEOVLETTA", 
"NORLEVO", "NOV?", "Novaring", "novynette", "Novynette", "nuva ring", 
"Nuva ring", "NUVARING", "Østradiol dlf 2", "Østradiolgel", 
"P-plaster", "PROVERA", "RESTOVAR", "spiral", "Spiral", "Synfase", 
"T-GYN", "triminetta sando", "TRIMORDIOL", "TRINOVUM", "TRIONETTA 28", 
"TRIREGOL", "T-spiral", "Vagifem", "VET EJ", "yas, bayer", "yasmin", 
"Yasmin", "YASMINELL", "yasminelle", "Yasminelle", "YAZ", "ZYRONA"
)
> 

推荐答案

您可以使用 agrep 函数来搜索近似匹配.它使用 Levenshtein 距离,您可以通过参数 max.distance 来允许匹配的最大距离.

You can use the function agrep, which searches for approximate matches. It uses the Levenshtein distance and you can maximum distance allowed for a match by means of the argument max.distance.

采用这个向量(除了空字符串 """KOMMER EJ IH\xc5G" 之外你发布的那个):

Taking this vector (the one that you posted except the empty string "" and "KOMMER EJ IH\xc5G"):

x <- c("Activelle", "CERACETTE", "cerazette", "CERAZETTE", "CEVAZETTE", 
"Cilest", "Conludag", "DEPO...", "Depo. Pro Vera", "DEPO PROVERA", 
"DEPROVERA", "desorelle", "Diane mite", "ENDEVINA", "ETHISYLESTRA,LEVONORGESTR", 
"EXCLUTENA", "EXLUENTA 0,5MG", "Femanest", "gastonette", "hormon", 
"IMPLANON", "LEMINOVA", "LENONOVA", "lenova", "LENOVA", "Leonova", 
"LEVENOVA", "Levinova", "LEVIONOVA", "levonova", "LEVONOVA", 
"Levonova lykkja", "lindynette", "loette", "malonetta", "Meniva", 
"Mereilom", "Microgyn", "Microgynon", "Milvane", "MINI P", "Mini-pe", 
"MINIRA", "minulet", "minulet p-piller", "Mircne", "Mirena", 
"mirena levonorge", "Modina p-piller", "NEOULETTA", "NORLEVO", 
"Novaring", "Novynette", "Nuva ring", "Østradiol dlf 2", "P-plaster", 
"RESTOVAR", "Spiral", "T-GYN", "TRIMORDIOL", "TRIONETTA 28", 
"T-spiral", "VET EJ", "yasmin", "YASMINELL", "Yasminelle", "ZYRONA", 
"CERACETT", "CERASETTE", "Cerazette", "CERAZETTI", "cilest", 
"Cileste", "COPALETTA?", "Depo-Provera", "DEPOPROVERA", "depoprovin", 
"DESOLETT", "Diane", "Divana", "Estradot", "EXKLUTENA", "EXLUTENA", 
"femenest", "Harmonet", "Hormonspiral", "INPLANON", "LEBONOVA", 
"lemonora", "LENOR", "Lenova", "LENOVA?", "Levanova", "LEVINA", 
"LEVINOVA", "Levnova", "Levonova", "Levonova hormonspiral", "Lindinette", 
"Lindynette", "lyndynette", "Marvelon", "Mercilon", "merivan", 
"microgynon", "Mikrogyn", "MINERVA/LEVONORG.", "MINI-P", "mini-pl", 
"MINNS EJ", "Minulet", "MIRANDA", "mirena", "MIRENA", "MIRENA LEVONORGESTREL", 
"Mod turner: milv", "NEOVLETTA", "novynette", "nuva ring", "NUVARING", 
"Østradiolgel", "PROVERA", "spiral", "Synfase", "triminetta sando", 
"TRINOVUM", "TRIREGOL", "Vagifem", "yas, bayer", "Yasmin", "yasminelle")

你可以这样做:

groups <- list()
i <- 1
while(length(x) > 0)
{
  id <- agrep(x[1], x, ignore.case = TRUE, max.distance = 0.1)
  groups[[i]] <- x[id]
  x <- x[-id]
  i <- i + 1
}

第一组定义如下:

head(groups)
[[1]]
[1] "Activelle"

[[2]]
[1] "CERACETTE" "cerazette" "CERAZETTE" "CERACETT"  "CERASETTE" "Cerazette"

[[3]]
[1] "CEVAZETTE"

[[4]]
[1] "Cilest"  "cilest"  "Cileste"

[[5]]
[1] "Conludag"

[[6]]
[1] "DEPO..."

请注意,上面的代码删除了 x 中的元素.循环结束后,向量 x 将为空.

Be aware that the above code removes the elements in x. When the loop is finished the vector x will be empty.

这篇关于将具有相似名称 R 的级别组合在一起的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆