快速子集在R中 [英] fast subsetting in R
问题描述
rows < - list(c(34,36,39),c(45,46))
这表示具有rownames的数据帧行(不是数字行,但是字符rownames(dat))34,36,39 组成一个分组,45,46组成另一个分组。
现在我想将数据帧中的分组从并行列表中拉出来,但是代码(下面)真的很慢。如何加快速度?
> system.time(lapply(rows,function(r){dat [r,]}))
用户系统已用
246.09 0.01 247.23
这是一个非常快的电脑,R 2.14.1 x64。
p>主要问题之一是匹配行名称 - [。data.frame
中的默认值是行名的部分匹配,您可能不希望,所以你最好用 match
。为了加快速度,您可以根据需要使用 fastmatch
中的 fmatch
。这是一个小的修改,有些加速:
#naive
> system.time(res1< - lapply(rows,function(r)dat [r,]))
用户系统已用
69.207 5.545 74.787
#match
> rn< - rownames(dat)
> system.time(res1< - lapply(rows,function(r)dat [match(r,rn),]))
用户系统已用
36.810 10.003 47.082
#fastmatch
> rn< - rownames(dat)
> system.time(res1< - lapply(rows,function(r)dat [fmatch(r,rn),]))
用户系统已用
19.145 3.012 22.226
您可以通过不使用 [
(数据缓慢)框架),但如果您的行
不重叠,并分割数据框(使用 split
),并覆盖所有行(和因此,您可以将每一行映射到行中的一个条目。)
根据实际数据,您可能会更好地使用矩阵,这些矩阵具有更快的子集运算符,因为它们是
I've got a dataframe dat of size 30000 x 50. I also have a separate list that contains points to groupings of rows from this dataframe, e.g.,
rows <- list(c("34", "36", "39"), c("45", "46"))
This says that dataframe rows with rownames (not numeric row indeces, but character rownames(dat)) "34", "36", "39" constitute one grouping, and "45", "46" constitute another grouping.
Now I want to pull out the groupings from the dataframe into a parallel list, but my code (below) is really, really slow. How can I speed it up?
> system.time(lapply(rows, function(r) {dat[r, ]}))
user system elapsed
246.09 0.01 247.23
That's on a very fast computer, R 2.14.1 x64.
One of the main issues is the matching of row names -- the default in [.data.frame
is partial matching of row names and you probably don't want that, so you're better off with match
. To speed it up even further you can use fmatch
from fastmatch
if you want. This is a minor modification with some speedup:
# naive
> system.time(res1 <- lapply(rows,function(r) dat[r,]))
user system elapsed
69.207 5.545 74.787
# match
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[match(r,rn),]))
user system elapsed
36.810 10.003 47.082
# fastmatch
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[fmatch(r,rn),]))
user system elapsed
19.145 3.012 22.226
You can get further speed up by not using [
(it is slow for data frames) but splitting the data frame (using split
) if your rows
are non-overlapping and cover all rows (and thus you can map each row to one entry in rows).
Depending on your actual data you may be better off with matrices that have by far faster subsetting operators since they are native.
这篇关于快速子集在R中的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!