使用dplyr为给定组创建值的唯一组合的向量 [英] Using dplyr to create vector of unique combinations of values for a given group
问题描述
我有一个数据集,其中每一行包含一个事件标识符,而各列包含有关受邀者和组织者的信息.多行将具有相同的事件标识符.我想汇总事件标识符,生成唯一的受邀者和组织者的列表.
I have a dataset where each row contains an event identifier and columns contain information on an invitee and an organizer. Multiple rows will have the same event identifier. I want to aggregate over the event identifier, generating a list of unique invitees and organizers.
假设我具有以下数据集:
Let's say I have the following dataset:
test <- data.frame(id = stringi::stri_rand_strings(100, 1, '[A-Z]'), invitee_id = floor(runif(100, min=0, max=500)), organizer_id = floor(runif(100, min=0, max=500)))
我想对'id'变量进行group_by,并创建一个新列,该列是一个用逗号分隔的矢量,其中包含vitate_id和organizer_id的所有唯一值.第一行的最终结果可能类似于:
I want to group_by the 'id' variable, and create a new column that is a comma-delimited vector of all the unique values of invitee_id and organizer_id. The end result for the first row may look like:
> final_df
id invitee_id organizer_id unique_vals
1 L 481 396 (481, 396, 300, 100, 200)
我们在final_df上崩溃的位置.
Where we have collapsed on final_df.
我尝试了类似的事情:
final_df <- test %>%
group_by(id) %>%
distinct(invitee_id, .keep_all=TRUE)
最终目标是一个邻接矩阵,其中行和列是与会者的ID,其值表示共享事件的数量.
The end goal is an adjacency matrix where rows and columns are the IDs of attendees and the values represent the number of shared events.
更清楚的例子:
假设我有这个测试数据
> test
id invitee_id organizer_id
1 A 478 444
2 A 226 346
3 A 338 320
4 A 286 497
5 B 478 327
6 B 226 354
7 B 123 272
8 C 226 297
9 C 338 144
10 C 477 73
我正在尝试按id分组并在受邀者和组织者之间进行汇总,如下所示:
I'm trying to group_by id and aggregate across invitee and organizers like so:
> final_df
id invitee_id_merged organizer_id_merged grouped_values
1 A c(478, 226, 338) c(444, 346, 320) c(478, 226, 338, 444, 346, 320)
最终目标是一个邻接矩阵,其中被邀请者和组织者ID的唯一列表代表行和列.给定的行,列的值应表示这两个人在事件中相遇的次数.所以第一行看起来像这样:
The end goal is an adjacency matrix where a unique list of both invitees and organizer IDs represent the rows and columns. The values of a given row, column should represent the number of times those two individuals met in an event. So the first row would look like this:
> final_matrix
invitee_or_organizer
478 226 338 286 123 477 ...
478 2
226 1
338 1
286 1
123 0
477 0
...
推荐答案
按'id'分组后,我们可以 summaryise
合并两列的所有 unique
元素
After grouping by 'id', we can summarise
to concatenate all the unique
elements of both columns
test %>%
group_by(id) %>%
summarise_all(funs(toString(unique(.))))
另一种选择是将 unique
元素存储为 list
library(tidyverse)
test %>%
group_by(id) %>%
summarise_all(funs(merged = list(unique(.)))) %>%
mutate(grouped_values = map2(invitee_id_merged, organizer_id_merged, c))
此外,根据描述,最终结果是否为邻接数据集中的频率计数
Also, based on the description if the end result is the frequency count in a adjacency dataset
test %>%
count(invitee_id, organizer_id) %>%
spread(organizer_id, n, fill = 0)
更新
基于OP帖子中的修改
Update
Based on the edit in the OP's post,
crossprod(table(rep(test$id, 2), unlist(test[-1])))
这篇关于使用dplyr为给定组创建值的唯一组合的向量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!