用lapply代替for循环 [英] lapply instead of for loop

查看:78
本文介绍了用lapply代替for循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下庞大的数据框:

I have the following huge dataframe :

V1  V2  V3  V4
A   E   R   12
A   R   T   18
A   T   Y   44
A   Y   U   11
B   E   R   22
B   R   T   53
B   T   Y   11
B   Y   U   153 

我想做的是从 V4 中获取每对(V1,V2)

what im trying to do is to get the outlier value from V4 for each pair of (V1,V2)

这很容易用2个for循环基于 V1 V2 的唯一值以及每个回合的 subset 来取向量每个子集的V4版本,并使用 outlier 包的任何功能获取离群值,但是问题出在速度上.

This easily handled with 2 for loops based on the unique values of V1 and V2 and a subset for each round, take the vector of V4 for each subset and get the outlier using any function of the outlier package, but the the problem is then speed.

我从没使用过 lapply ,也许有人可以指导我使用在for循环中插入的 lapply 有效地执行此操作.

i have never used lapply, maybe someone can guide me on a way to perform this efficiently using lapply insted of the for loop.

推荐答案

这是一个 data.table 解决方案:

对于将近450万行,676组和每组6500条记录,这仅花费了2秒钟以上的时间(包括数据生成).

For close to 4.5 million rows, with 676 groups and 6500 records per group, it takes just over 2 seconds (including data generation).

library(outliers)
library(data.table)

# Fake data generation and coercion to data.table
d <- as.data.table(expand.grid(x=LETTERS, y=LETTERS, z=LETTERS))
d <- do.call(rbind, replicate(250, d, FALSE))

# > d
#          x y z      value     row
#       1: A A A -1.1712284       1
#       2: B A A  0.1818000       2
#       3: C A A -1.3959594       3
#       4: D A A -0.4778956       4
#       5: E A A -2.0426768       5
#      ---                         
# 4393996: V Z Z  0.4024398 4393996
# 4393997: W Z Z  0.9891237 4393997
# 4393998: X Z Z  1.2066572 4393998
# 4393999: Y Z Z  2.3023321 4393999
# 4394000: Z Z Z -0.8343059 4394000

# Add random "value" column and a column to keep track of row numbers
d[, c('value', 'row'):=list(rnorm(nrow(d)), seq_len(nrow(d)))]

# For each group (combination of x and y), perform the outlier test
outliers <- d[, chisq.out.test(value), list(x, y)]

# Add the row numbers for min and max numbers of each group
outliers <- merge(outliers, 
                  d[, list(min.ind=row[which.min(value)], 
                           max.ind=row[which.max(value)]), list(x, y)], 
                  by=c('x', 'y'))

# Create a new outlier column. If the p.value is >= 0.05, set outlier = NA,
# else if p.value < 0.5, then if "alternative" column contains "lowest", set
# outlier = min.ind, else max.ind.
outliers[, outlier:=ifelse(p.value < 0.05, 
                  ifelse(grepl('lowest', outliers[, alternative]), min.ind, max.ind), 
                  NA)]

输出如下所示:

# > outliers
#      x y statistic                                  alternative      p.value                       method
#   1: A A  13.69290 highest value 3.70310786094858 is an outlier 2.152665e-04 chi-squared test for outlier
#   2: A B  11.99842 lowest value -3.47397308041372 is an outlier 5.324581e-04 chi-squared test for outlier
#   3: A C  12.41749 highest value 3.49833131757565 is an outlier 4.253310e-04 chi-squared test for outlier
#   4: A D  16.18416 lowest value -4.00696031141966 is an outlier 5.747273e-05 chi-squared test for outlier
#   5: A E  12.32196 lowest value -3.56650649267448 is an outlier 4.476613e-04 chi-squared test for outlier
#  ---                                                                                                     
# 672: Z V  11.66230 lowest value -3.43256736243089 is an outlier 6.377944e-04 chi-squared test for outlier
# 673: Z W  14.11816 highest value 3.75476979294983 is an outlier 1.716780e-04 chi-squared test for outlier
# 674: Z X  15.63605 highest value 3.93390421620766 is an outlier 7.677674e-05 chi-squared test for outlier
# 675: Z Y  17.05664 lowest value -4.12928000349912 is an outlier 3.628127e-05 chi-squared test for outlier
# 676: Z Z  14.44709 lowest value -3.82794835873449 is an outlier 1.441520e-04 chi-squared test for outlier
#      data.name min.ind max.ind outlier
#   1:     value 3609165 1191113 1191113
#   2:     value  105483 3476019  105483
#   3:     value 4153397 1375713 1375713
#   4:     value 3406443 2539135 3406443
#   5:     value   25117 2004445   25117
#  ---                                  
# 672:     value 1871740 2551796 1871740
# 673:     value 1003782 2158390 2158390
# 674:     value 1555424 1492556 1492556
# 675:     value 2071914 1344538 2071914
# 676:     value 2281500  426556 2281500

也许有点奇怪,但是,嘿,它最终将我们带到了那里.

A bit fiddly, perhaps, but hey, it got us there in the end.

这篇关于用lapply代替for循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆