根据列条目(或排名)对数据框进行子集 [英] Subset a data frame based on column entry (or rank)

查看:47
本文介绍了根据列条目(或排名)对数据框进行子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像这样简单的 data.frame:

I have a data.frame as simple as this one:

id group idu  value
1  1     1_1  34
2  1     2_1  23
3  1     3_1  67
4  2     4_2  6
5  2     5_2  24
6  2     6_2  45
1  3     1_3  34
2  3     2_3  67
3  3     3_3  76

我想从哪里检索每个组的第一个条目的子集;类似:

from where I want to retrieve a subset with the first entries of each group; something like:

id group idu value
1  1     1_1 34
4  2     4_2 6
1  3     1_3 34

id 不是唯一的,因此该方法不应依赖它.

id is not unique so the approach should not rely on it.

我可以避免循环吗?

dput() 数据:

structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L), group = c(1L, 
1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), idu = structure(c(1L, 3L, 5L, 
7L, 8L, 9L, 2L, 4L, 6L), .Label = c("1_1", "1_3", "2_1", "2_3", 
"3_1", "3_3", "4_2", "5_2", "6_2"), class = "factor"), value = c(34L, 
23L, 67L, 6L, 24L, 45L, 34L, 67L, 76L)), .Names = c("id", "group", 
"idu", "value"), class = "data.frame", row.names = c(NA, -9L))

推荐答案

使用 Gavin 的百万行 df:

Using Gavin's million row df:

DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
                  group = factor(rep(1:1000, each = 1000)),
                  value = runif(1000000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))

我认为最快的方法是重新排序数据帧,然后使用duplicated:

I think the fastest way is to reorder the data frame and then use duplicated:

system.time({
  DF4 <- DF3[order(DF3$group), ]
  out2 <- DF4[!duplicated(DF4$group), ]
})
# user  system elapsed 
# 0.335   0.107   0.441

相比之下,Gavin 在我的计算机上使用 fastet lapply + split 方法需要 7 秒.

This compares to 7 seconds for Gavin's fastet lapply + split method on my computer.

通常,在处理数据帧时,最快的方法通常是生成所有索引,然后生成单个子集.

Generally, when working with data frames, the fastest approach is usually to generate all the indices and then do a single subset.

这篇关于根据列条目(或排名)对数据框进行子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆