查找小标题中所选变量的重复观测值 [英] Finding duplicate observations of selected variables in a tibble
问题描述
我有一个较大的小标题(称为 df.tbl
,具有〜26k行和22列),并且我想找到每个对象的孪生",即具有相同值的每一行在第2:7栏中(日期:位置).
I have a rather large tibble (called df.tbl
with ~ 26k rows and 22 columns) and I want to find the "twins" of each object, i.e. each row that has the same values in column 2:7 (date:Pos).
如果我使用:
inner_join(df.tbl, ~ df.tbl[i,], by = c("date", "forge", "serNum", "PinMain", "PinMainNumber", "Pos"))
我要检查双胞胎"的行是 i
,一切都按预期进行,吐出2 x 22的小滴,我可以使用以下方法扩展它:
with i
being the row I want to check for "twins", everything is working as expected, spitting out a 2 x 22 tibble, and I can expand this using:
x <- NULL
for (i in 1:nrow(df.tbl)) {
x[[i]] <- as_vector(inner_join(df.tbl[,],
df.tbl[i,],
by = c("date",
"forge",
"serNum",
"PinMain",
"PinMainNumber",
"Pos")) %>%
select(rowNum.x)
}
创建一个列表,其中包含每个对象(行)的每个双胞胎的行号.
to create a list containing the row numbers for each twin for each object (row).
我不能,但是我尝试使用 map
来产生类似的结果:
I cannot, however I try, use map
to produce a similar result:
twins <- map(df.tbl, ~ inner_join(df.tbl,
.,
by = c("date",
"forge",
"serNum",
"PinMain",
"PinMainNumber",
"Pos")) %>%
select(rowNum.x) )
我得到的是以下错误:
UseMethod("tbl_vars")中的错误:没有适用于'tbl_vars'的适用方法应用于类"c('double','numeric')"的对象
如何使用 map
将 for
循环转换为等效循环?
How would I go about to convert the for
loop into an equivalent using map
?
我的原始数据如下:
>head(df.tbl, 3)
# A tibble: 3 x 22
rowNum date forge serNum PinMain PinMainNumber Pos FrontBack flow Sharped SV OP max min mean
<dbl> <date> <chr> <fct> <fct> <fct> <fct> <fct> <chr> <fct> <fct> <chr> <dbl> <dbl> <dbl>
1 1 2017-10-18 NA 179 Pin 1 W F NA 3 36237 235 77.7 55.3 64.7
2 2 2017-10-18 NA 179 Pin 2 W F NA 3 36237 235 77.5 52.1 67.4
3 3 2017-10-18 NA 179 Pin 3 W F NA 3 36237 235 79.5 58.6 69.0
# ... with 7 more variables: median <dbl>, sd <dbl>, Round2 <dbl>, Round4 <dbl>, OrigData <list>, dataSize <int>,
# fileName <chr>
,我想要一个长度与nrow(df.tbl)相同的列表,如下所示:
and I would like a list with a length the same as nrow(df.tbl) looking like this:
> twins
[[1]]
[1] 1 7
[[2]]
[1] 2 8
[[3]]
[1] 3 9
几乎所有对象都具有一个双胞胎/重复项(如上所述),但是一些对象具有两个或什至三个重复项(如上所述),即列2:7相同)
Almost all objects have one twin / duplicate (as above) but a few have two or even three duplicates (as defined above, i.e. column 2:7 are the same)
推荐答案
聚会有点晚了,但是您可以使用 nest()
整齐地完成它.
A bit late to the party, but you can do it much more neatly with nest()
.
tbl.df1 <- tbl.df %>% group_by(date, forge, serNum, PinMain, PinMainNumber, Pos) %>% nest(rowNum)
双胞胎将出现在 nest
创建的小玩意列表中.
The twins will be in the list of tibbles created by nest
.
tbl.df1$data
# [[1]]
# A tibble: 2 x 1
# rowNum
# <dbl>
# 1 1
# 2 7
#[[2]]
# A tibble: 2 x 1
# rowNum
# <dbl>
# 1 2
# 2 8
# etc
这篇关于查找小标题中所选变量的重复观测值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!