使用dplyr为Group中的不同值分配唯一的ID [英] Assign unique ID to distinct values within Group with dplyr
问题描述
问题:我需要为具有两个分组级别的数据创建唯一的ID字段。在这里的示例代码中,它是 Emp
和 Color
。 ID需要结构化:
Emp
+每个唯一编号颜色
+重复的连续号码颜色
。
这些值以句点分隔。
示例数据:
dat < - data.frame(Emp = c(A,A,A,B,B,C),
颜色= (Red,Green,Green,Orange,Yellow,Brown),
stringsAsFactors = FALSE)
该ID应该显示为:
ID< c(A.01.001,A.02.001,A.02.002,B.01.001,B.02.001,C.01.001)
/ pre>
ID
[1]A.01.001A.02.001A.02.002B. 01.001B.02.001C.01.001
记录重复项的ID的三个字符后缀可以做为:
group_by(dat,Emp,Color)%>%
mutate(suffix = str_pad(row_number ,width = 3,side =left,pad =0))
但是我无法为每个独立出现的
颜色
指定序列号
我更喜欢dplyr解决方案,但任何方法都不胜感激。
解决方案使用
data.table
和sprintf
:library(data.table)
setDT(dat)[,ID:= sprintf('%s 。%02d。%03d',
Emp,rleid(Color),rowid(rleid(Color)))
by = Emp]
你得到:
> dat
Emp颜色ID
1:A红色A.01.001
2:A绿色A.02.001
3:绿色A.02.002
4:B橙色B .01.001
5:B黄色B.02.001
6:C棕色C.01.001
如何工作:
- 您将
dat
转换为data.table
withsetDT()
- 由
Emp
。 - 使用
sprintf $创建
ID
C $ C> - 函数。使用sprintf
可以根据指定的格式轻松地粘贴几个向量。 - 使用
:=
意味着data.table
通过引用更新。 -
%s
表示在第一部分(这是Emp
)中使用一个字符串。%02d
&%03d
表示在需要时,数字需要有两位或三位数的前导零。两者之间的点将逐字地取代,因此在结果字符串中被排除。
接受@jsta的评论,如果 Color
-column中的值不是顺序的,您可以使用:
setDT(dat)[,r:= as.integer(factor(Color,levels = unique(Color)))by = Emp
] [,ID:= sprintf '%s。%02d。%03d',
Emp,r,rowid(r)),
by = Emp] [,r:= NULL]
这也将保持显示 Color
列的顺序。而不是 as.integer(factor(Color,levels = unique(Color)))
你也可以使用 match(Color,unique(Color)
如akrun所示。
在更大的数据集上实现上述说明:
$ {code> dat2 < - rbindlist(list(dat,dat))
dat2 [,r:= match(Color,unique(Color)),by = Emp
] [,ID = = sprintf('%s。%02d。%03d',
Emp,r,rowid(r)),
by = Emp]
得到你:
> ; dat2
Emp颜色r ID
1:A红色1 A.01.001
2:A绿色2 A.02.001
3:绿色2 A.02.002
4 :B橙1 B.01.001
5:B黄2 B.02.001
6:C棕1 C.01.001
7:A红1 A.01.002
8:A绿色2 A.02.003
9:绿色2 A.02.004
10:B橙色1 B.01.002
11:B黄色2 B.02.002
12:C棕色1 C.01.002
Problem: I need to make a unique ID field for data that has two levels of grouping. In the example code here, it is Emp
and Color
. The ID needs to be structured as:
Emp
+ unique number of each Color
+ sequential number for duplicated Colors
.
These values are separated by periods.
Example data:
dat <- data.frame(Emp = c("A","A","A","B","B","C"),
Color = c("Red","Green","Green","Orange","Yellow","Brown"),
stringsAsFactors = FALSE)
The ID is supposed to appear as this:
ID <- c("A.01.001", "A.02.001", "A.02.002", "B.01.001", "B.02.001", "C.01.001")
ID [1] "A.01.001" "A.02.001" "A.02.002" "B.01.001" "B.02.001" "C.01.001"
The three character suffix to the ID to record the duplicates can be done as:
group_by(dat, Emp, Color) %>%
mutate(suffix = str_pad(row_number(), width=3, side="left", pad="0"))
But I am unable to assign sequential numbers to the unique occurrence of Color
with each Emp
group.
I prefer a dplyr solution, but any method would be appreciated.
Using data.table
and sprintf
:
library(data.table)
setDT(dat)[, ID := sprintf('%s.%02d.%03d',
Emp, rleid(Color), rowid(rleid(Color))),
by = Emp]
you get:
> dat
Emp Color ID
1: A Red A.01.001
2: A Green A.02.001
3: A Green A.02.002
4: B Orange B.01.001
5: B Yellow B.02.001
6: C Brown C.01.001
How this works:
- You convert
dat
to adata.table
withsetDT()
- Group by
Emp
. - And create the
ID
-variable with thesprintf
-function. Withsprintf
you paste several vector easily together according to a specified format. - The use of
:=
means that thedata.table
is updated by reference. %s
indicates that a string is to be used in the first part (which isEmp
).%02d
&%03d
indicates that a number needs to have two or three digits with a leading zero(s) when needed. The dots in between will taken literally and thus in cluded in the resulting string.
Adressing the comment of @jsta, if the values in the Color
-column are not sequential you can use:
setDT(dat)[, r := as.integer(factor(Color, levels = unique(Color))), by = Emp
][, ID := sprintf('%s.%02d.%03d',
Emp, r, rowid(r)),
by = Emp][, r:= NULL]
This will also maintain the order in which the Color
column is presented. Instead of as.integer(factor(Color, levels = unique(Color)))
you can also use match(Color, unique(Color))
as shown by akrun.
Implementing the above on a bit larger dataset to illustrate:
dat2 <- rbindlist(list(dat,dat))
dat2[, r := match(Color, unique(Color)), by = Emp
][, ID := sprintf('%s.%02d.%03d',
Emp, r, rowid(r)),
by = Emp]
gets you:
> dat2
Emp Color r ID
1: A Red 1 A.01.001
2: A Green 2 A.02.001
3: A Green 2 A.02.002
4: B Orange 1 B.01.001
5: B Yellow 2 B.02.001
6: C Brown 1 C.01.001
7: A Red 1 A.01.002
8: A Green 2 A.02.003
9: A Green 2 A.02.004
10: B Orange 1 B.01.002
11: B Yellow 2 B.02.002
12: C Brown 1 C.01.002
这篇关于使用dplyr为Group中的不同值分配唯一的ID的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!