使用dplyr为Group中的不同值分配唯一的ID [英] Assign unique ID to distinct values within Group with dplyr

查看：79 发布时间：2017/7/13 20:47:26 r dplyr

本文介绍了使用dplyr为Group中的不同值分配唯一的ID的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

问题：我需要为具有两个分组级别的数据创建唯一的ID字段。在这里的示例代码中，它是 Emp 和 Color 。 ID需要结构化：

Emp +每个唯一编号颜色 +重复的连续号码颜色。

这些值以句点分隔。

示例数据：

  dat < -  data.frame（Emp = c（A，A，A，B，B，C），
颜色= （Red，Green，Green，Orange，Yellow，Brown），
 stringsAsFactors = FALSE）

该ID应该显示为：

ID< c（A.01.001，A.02.001，A.02.002，B.01.001，B.02.001，C.01.001） / pre>

ID
[1]A.01.001A.02.001A.02.002B. 01.001B.02.001C.01.001

记录重复项的ID的三个字符后缀可以做为：

  group_by（dat，Emp，Color）％>％
 mutate（suffix = str_pad（row_number ，width = 3，side =left，pad =0））

但是我无法为每个独立出现的颜色指定序列号

我更喜欢dplyr解决方案，但任何方法都不胜感激。

解决方案

使用 data.table 和 sprintf ：

  library（data.table）
 setDT（dat）[，ID：= sprintf（'％s 。％02d。％03d'，
 Emp，rleid（Color），rowid（rleid（Color）））
 by = Emp]

你得到：

 > dat 
 Emp颜色ID 
 1：A红色A.01.001 
 2：A绿色A.02.001 
 3：绿色A.02.002 
 4：B橙色B .01.001 
 5：B黄色B.02.001 
 6：C棕色C.01.001

如何工作：

您将 dat 转换为 data.table with setDT（）

由 Emp 。

使用 sprintf ID C $ C> - 函数。使用 sprintf 可以根据指定的格式轻松地粘贴几个向量。

使用：= 意味着 data.table 通过引用更新。

％s 表示在第一部分（这是 Emp ）中使用一个字符串。 ％02d & ％03d 表示在需要时，数字需要有两位或三位数的前导零。两者之间的点将逐字地取代，因此在结果字符串中被排除。

接受@jsta的评论，如果 Color -column中的值不是顺序的，您可以使用：

  setDT（dat）[，r：= as.integer（factor（Color，levels = unique（Color）））by = Emp 
] [，ID：= sprintf '％s。％02d。％03d'，
 Emp，r，rowid（r）），
 by = Emp] [，r：= NULL]

这也将保持显示 Color 列的顺序。而不是 as.integer（factor（Color，levels = unique（Color）））你也可以使用 match（Color，unique（Color） 如akrun所示。

在更大的数据集上实现上述说明：

$ {code> dat2 < - rbindlist（list（dat，dat））
dat2 [，r：= match（Color，unique（Color）），by = Emp
] [，ID = = sprintf（'％s。％02d。％03d'，
Emp，r，rowid（r）），
by = Emp]

得到你：

 > ; dat2 
 Emp颜色r ID 
 1：A红色1 A.01.001 
 2：A绿色2 A.02.001 
 3：绿色2 A.02.002 
 4 ：B橙1 B.01.001 
 5：B黄2 B.02.001 
 6：C棕1 C.01.001 
 7：A红1 A.01.002 
 8：A绿色2 A.02.003 
 9：绿色2 A.02.004 
 10：B橙色1 B.01.002 
 11：B黄色2 B.02.002 
 12：C棕色1 C.01.002

Problem: I need to make a unique ID field for data that has two levels of grouping. In the example code here, it is Emp and Color. The ID needs to be structured as:

Emp + unique number of each Color + sequential number for duplicated Colors.

These values are separated by periods.
Example data:

dat <- data.frame(Emp = c("A","A","A","B","B","C"), 
              Color = c("Red","Green","Green","Orange","Yellow","Brown"),
              stringsAsFactors = FALSE)

The ID is supposed to appear as this:

ID <- c("A.01.001", "A.02.001", "A.02.002", "B.01.001", "B.02.001", "C.01.001")

ID [1] "A.01.001" "A.02.001" "A.02.002" "B.01.001" "B.02.001" "C.01.001"

The three character suffix to the ID to record the duplicates can be done as:

 group_by(dat, Emp, Color) %>%
         mutate(suffix = str_pad(row_number(), width=3, side="left", pad="0"))

But I am unable to assign sequential numbers to the unique occurrence of Color with each Emp group.

I prefer a dplyr solution, but any method would be appreciated.

解决方案

Using data.table and sprintf:

library(data.table)
setDT(dat)[, ID := sprintf('%s.%02d.%03d', 
                           Emp, rleid(Color), rowid(rleid(Color))), 
           by = Emp]

you get:

> dat
   Emp  Color       ID
1:   A    Red A.01.001
2:   A  Green A.02.001
3:   A  Green A.02.002
4:   B Orange B.01.001
5:   B Yellow B.02.001
6:   C  Brown C.01.001

How this works:

You convert dat to a data.table with setDT()
Group by Emp.
And create the ID-variable with the sprintf-function. With sprintf you paste several vector easily together according to a specified format.
The use of := means that the data.table is updated by reference.
%s indicates that a string is to be used in the first part (which is Emp). %02d & %03d indicates that a number needs to have two or three digits with a leading zero(s) when needed. The dots in between will taken literally and thus in cluded in the resulting string.

Adressing the comment of @jsta, if the values in the Color-column are not sequential you can use:

setDT(dat)[, r := as.integer(factor(Color, levels = unique(Color))), by = Emp
           ][, ID := sprintf('%s.%02d.%03d', 
                             Emp, r, rowid(r)), 
             by = Emp][, r:= NULL]

This will also maintain the order in which the Color column is presented. Instead of as.integer(factor(Color, levels = unique(Color))) you can also use match(Color, unique(Color)) as shown by akrun.

Implementing the above on a bit larger dataset to illustrate:

dat2 <- rbindlist(list(dat,dat))
dat2[, r := match(Color, unique(Color)), by = Emp
     ][, ID := sprintf('%s.%02d.%03d', 
                     Emp, r, rowid(r)), 
     by = Emp]

gets you:

> dat2
    Emp  Color r       ID
 1:   A    Red 1 A.01.001
 2:   A  Green 2 A.02.001
 3:   A  Green 2 A.02.002
 4:   B Orange 1 B.01.001
 5:   B Yellow 2 B.02.001
 6:   C  Brown 1 C.01.001
 7:   A    Red 1 A.01.002
 8:   A  Green 2 A.02.003
 9:   A  Green 2 A.02.004
10:   B Orange 1 B.01.002
11:   B Yellow 2 B.02.002
12:   C  Brown 1 C.01.002

这篇关于使用dplyr为Group中的不同值分配唯一的ID的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

使用dplyr为Group中的不同值分配唯一的ID [英] Assign unique ID to distinct values within Group with dplyr

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录关闭

使用dplyr为Group中的不同值分配唯一的ID [英] Assign unique ID to distinct values within Group with dplyr

问题描述

相关文章

其他开发语言最新文章

热门教程

热门工具

登录 关闭

登录关闭