在R中的数据帧的行上更快的子集 [英] Faster way to subset on rows of a data frame in R?

查看:114
本文介绍了在R中的数据帧的行上更快的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我已经使用这两种方式来互换地从R中的数据框中子集数据。

方法1

subset_df < - df [which (df $ age> 5),]

方法2

subset_df< - subset(df,age> 5)

I have been using these 2 methods interchangeably to subset data from a data frame in R.
Method 1
subset_df <- df[which(df$age>5) , ]
Method 2
subset_df <- subset(df, age>5)

我有两个问题属于这些。

1.考虑到我有非常大的数据,哪一个更快?

2.这篇文章这里在R中设置数据框建议实际上有两种方法之间有区别。其中一个准确地处理NA。

I had 2 questions belonging to these.
1. Which one is faster considering I have very large data?
2. This post here Subsetting data frames in R suggests that there is in fact difference between above 2 methods. One of them handles NA accurately. Which one is safe to use then?

推荐答案

该问题要求更快的方式对数据框的行进行子集。最快的方式是data.table。

The question asks for a faster way to subset rows of a data frame. The fastest way is with data.table.

set.seed(1)  # for reproducible example
# 1 million rows - big enough?
df <- data.frame(age=sample(1:65,1e6,replace=TRUE),x=rnorm(1e6),y=rpois(1e6,25))

library(microbenchmark)
microbenchmark(result<-df[which(df$age>5),],
               result<-subset(df, age>5), 
               result<-df[df$age>5,],
               times=10)
# Unit: milliseconds
#                               expr       min        lq    median       uq      max neval
#  result <- df[which(df$age > 5), ]  77.01055  80.62678  81.43786 133.7753 145.4756    10
#      result <- subset(df, age > 5) 190.89829 193.04221 197.49973 203.7571 263.7738    10
#         result <- df[df$age > 5, ] 169.85649 171.02084 176.47480 185.9394 191.2803    10

library(data.table)
DT <- as.data.table(df)     # data.table
microbenchmark(DT[age > 5],times=10)
# Unit: milliseconds
#         expr      min       lq  median       uq      max neval
#  DT[age > 5] 29.49726 29.93907 30.1813 30.67168 32.81204    10

所以在这个简单的情况下,data.table有点多(...)的两倍,比子集(...)快6倍以上

So in this simple case data.table is a little more than twice as fast as which(...), and more than 6 times faster than subset(...).

这篇关于在R中的数据帧的行上更快的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆