使用dplyr为Group中的不同值分配唯一的ID [英] Assign unique ID to distinct values within Group with dplyr

查看:79
本文介绍了使用dplyr为Group中的不同值分配唯一的ID的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:我需要为具有两个分组级别的数据创建唯一的ID字段。在这里的示例代码中,它是 Emp Color 。 ID需要结构化:



Emp +每个唯一编号颜色 +重复的连续号码颜色



这些值以句点分隔。

示例数据:

  dat < -  data.frame(Emp = c(A,A,A,B,B,C),
颜色= (Red,Green,Green,Orange,Yellow,Brown),
stringsAsFactors = FALSE)

该ID应该显示为:

  ID< c(A.01.001,A.02.001,A.02.002,B.01.001,B.02.001,C.01.001)
/ pre>


ID
[1]A.01.001A.02.001A.02.002B. 01.001B.02.001C.01.001


记录重复项的ID的三个字符后缀可以做为:

  group_by(dat,Emp,Color)%>%
mutate(suffix = str_pad(row_number ,width = 3,side =left,pad =0))

但是我无法为每个独立出现的颜色指定序列号



我更喜欢dplyr解决方案,但任何方法都不胜感激。

解决方案

使用 data.table sprintf

  library(data.table)
setDT(dat)[,ID:= sprintf('%s 。%02d。%03d',
Emp,rleid(Color),rowid(rleid(Color)))
by = Emp]

你得到:

 > dat 
Emp颜色ID
1:A红色A.01.001
2:A绿色A.02.001
3:绿色A.02.002
4:B橙色B .01.001
5:B黄色B.02.001
6:C棕色C.01.001

如何工作:




  • 您将 dat 转换为 data.table with setDT()

  • Emp

  • 使用 sprintf ID C $ C> - 函数。使用 sprintf 可以根据指定的格式轻松地粘贴几个向量。

  • 使用:= 意味着 data.table 通过引用更新。

  • %s 表示在第一部分(这是 Emp )中使用一个字符串。 %02d & %03d 表示在需要时,数字需要有两位或三位数的前导零。两者之间的点将逐字地取代,因此在结果字符串中被排除。






接受@jsta的评论,如果 Color -column中的值不是顺序的,您可以使用:

  setDT(dat)[,r:= as.integer(factor(Color,levels = unique(Color)))by = Emp 
] [,ID:= sprintf '%s。%02d。%03d',
Emp,r,rowid(r)),
by = Emp] [,r:= NULL]

这也将保持显示 Color 列的顺序。而不是 as.integer(factor(Color,levels = unique(Color)))你也可以使用 match(Color,unique(Color) 如akrun所示。



在更大的数据集上实现上述说明:



$ {code> dat2 < - rbindlist(list(dat,dat))
dat2 [,r:= match(Color,unique(Color)),by = Emp
] [,ID = = sprintf('%s。%02d。%03d',
Emp,r,rowid(r)),
by = Emp]

得到你:

 > ; dat2 
Emp颜色r ID
1:A红色1 A.01.001
2:A绿色2 A.02.001
3:绿色2 A.02.002
4 :B橙1 B.01.001
5:B黄2 B.02.001
6:C棕1 C.01.001
7:A红1 A.01.002
8:A绿色2 A.02.003
9:绿色2 A.02.004
10:B橙色1 B.01.002
11:B黄色2 B.02.002
12:C棕色1 C.01.002


Problem: I need to make a unique ID field for data that has two levels of grouping. In the example code here, it is Emp and Color. The ID needs to be structured as:

Emp + unique number of each Color + sequential number for duplicated Colors.

These values are separated by periods.
Example data:

dat <- data.frame(Emp = c("A","A","A","B","B","C"), 
              Color = c("Red","Green","Green","Orange","Yellow","Brown"),
              stringsAsFactors = FALSE)

The ID is supposed to appear as this:

ID <- c("A.01.001", "A.02.001", "A.02.002", "B.01.001", "B.02.001", "C.01.001")

ID [1] "A.01.001" "A.02.001" "A.02.002" "B.01.001" "B.02.001" "C.01.001"

The three character suffix to the ID to record the duplicates can be done as:

 group_by(dat, Emp, Color) %>%
         mutate(suffix = str_pad(row_number(), width=3, side="left", pad="0"))

But I am unable to assign sequential numbers to the unique occurrence of Color with each Emp group.

I prefer a dplyr solution, but any method would be appreciated.

解决方案

Using data.table and sprintf:

library(data.table)
setDT(dat)[, ID := sprintf('%s.%02d.%03d', 
                           Emp, rleid(Color), rowid(rleid(Color))), 
           by = Emp]

you get:

> dat
   Emp  Color       ID
1:   A    Red A.01.001
2:   A  Green A.02.001
3:   A  Green A.02.002
4:   B Orange B.01.001
5:   B Yellow B.02.001
6:   C  Brown C.01.001

How this works:

  • You convert dat to a data.table with setDT()
  • Group by Emp.
  • And create the ID-variable with the sprintf-function. With sprintf you paste several vector easily together according to a specified format.
  • The use of := means that the data.table is updated by reference.
  • %s indicates that a string is to be used in the first part (which is Emp). %02d & %03d indicates that a number needs to have two or three digits with a leading zero(s) when needed. The dots in between will taken literally and thus in cluded in the resulting string.

Adressing the comment of @jsta, if the values in the Color-column are not sequential you can use:

setDT(dat)[, r := as.integer(factor(Color, levels = unique(Color))), by = Emp
           ][, ID := sprintf('%s.%02d.%03d', 
                             Emp, r, rowid(r)), 
             by = Emp][, r:= NULL]

This will also maintain the order in which the Color column is presented. Instead of as.integer(factor(Color, levels = unique(Color))) you can also use match(Color, unique(Color)) as shown by akrun.

Implementing the above on a bit larger dataset to illustrate:

dat2 <- rbindlist(list(dat,dat))
dat2[, r := match(Color, unique(Color)), by = Emp
     ][, ID := sprintf('%s.%02d.%03d', 
                     Emp, r, rowid(r)), 
     by = Emp]

gets you:

> dat2
    Emp  Color r       ID
 1:   A    Red 1 A.01.001
 2:   A  Green 2 A.02.001
 3:   A  Green 2 A.02.002
 4:   B Orange 1 B.01.001
 5:   B Yellow 2 B.02.001
 6:   C  Brown 1 C.01.001
 7:   A    Red 1 A.01.002
 8:   A  Green 2 A.02.003
 9:   A  Green 2 A.02.004
10:   B Orange 1 B.01.002
11:   B Yellow 2 B.02.002
12:   C  Brown 1 C.01.002

这篇关于使用dplyr为Group中的不同值分配唯一的ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆