用lapply代替for循环 [英] lapply instead of for loop
问题描述
我有以下庞大的数据框:
I have the following huge dataframe :
V1 V2 V3 V4
A E R 12
A R T 18
A T Y 44
A Y U 11
B E R 22
B R T 53
B T Y 11
B Y U 153
我想做的是从 V4
中获取每对(V1,V2)
what im trying to do is to get the outlier value from V4
for each pair of (V1,V2)
这很容易用2个for循环基于 V1
和 V2
的唯一值以及每个回合的 subset
来取向量每个子集的V4版本,并使用 outlier
包的任何功能获取离群值,但是问题出在速度上.
This easily handled with 2 for loops based on the unique values of V1
and V2
and a subset
for each round, take the vector of V4 for each subset and get the outlier using any function of the outlier
package, but the the problem is then speed.
我从没使用过 lapply
,也许有人可以指导我使用在for循环中插入的 lapply
有效地执行此操作.
i have never used lapply
, maybe someone can guide me on a way to perform this efficiently using lapply
insted of the for loop.
推荐答案
这是一个 data.table
解决方案:
对于将近450万行,676组和每组6500条记录,这仅花费了2秒钟以上的时间(包括数据生成).
For close to 4.5 million rows, with 676 groups and 6500 records per group, it takes just over 2 seconds (including data generation).
library(outliers)
library(data.table)
# Fake data generation and coercion to data.table
d <- as.data.table(expand.grid(x=LETTERS, y=LETTERS, z=LETTERS))
d <- do.call(rbind, replicate(250, d, FALSE))
# > d
# x y z value row
# 1: A A A -1.1712284 1
# 2: B A A 0.1818000 2
# 3: C A A -1.3959594 3
# 4: D A A -0.4778956 4
# 5: E A A -2.0426768 5
# ---
# 4393996: V Z Z 0.4024398 4393996
# 4393997: W Z Z 0.9891237 4393997
# 4393998: X Z Z 1.2066572 4393998
# 4393999: Y Z Z 2.3023321 4393999
# 4394000: Z Z Z -0.8343059 4394000
# Add random "value" column and a column to keep track of row numbers
d[, c('value', 'row'):=list(rnorm(nrow(d)), seq_len(nrow(d)))]
# For each group (combination of x and y), perform the outlier test
outliers <- d[, chisq.out.test(value), list(x, y)]
# Add the row numbers for min and max numbers of each group
outliers <- merge(outliers,
d[, list(min.ind=row[which.min(value)],
max.ind=row[which.max(value)]), list(x, y)],
by=c('x', 'y'))
# Create a new outlier column. If the p.value is >= 0.05, set outlier = NA,
# else if p.value < 0.5, then if "alternative" column contains "lowest", set
# outlier = min.ind, else max.ind.
outliers[, outlier:=ifelse(p.value < 0.05,
ifelse(grepl('lowest', outliers[, alternative]), min.ind, max.ind),
NA)]
输出如下所示:
# > outliers
# x y statistic alternative p.value method
# 1: A A 13.69290 highest value 3.70310786094858 is an outlier 2.152665e-04 chi-squared test for outlier
# 2: A B 11.99842 lowest value -3.47397308041372 is an outlier 5.324581e-04 chi-squared test for outlier
# 3: A C 12.41749 highest value 3.49833131757565 is an outlier 4.253310e-04 chi-squared test for outlier
# 4: A D 16.18416 lowest value -4.00696031141966 is an outlier 5.747273e-05 chi-squared test for outlier
# 5: A E 12.32196 lowest value -3.56650649267448 is an outlier 4.476613e-04 chi-squared test for outlier
# ---
# 672: Z V 11.66230 lowest value -3.43256736243089 is an outlier 6.377944e-04 chi-squared test for outlier
# 673: Z W 14.11816 highest value 3.75476979294983 is an outlier 1.716780e-04 chi-squared test for outlier
# 674: Z X 15.63605 highest value 3.93390421620766 is an outlier 7.677674e-05 chi-squared test for outlier
# 675: Z Y 17.05664 lowest value -4.12928000349912 is an outlier 3.628127e-05 chi-squared test for outlier
# 676: Z Z 14.44709 lowest value -3.82794835873449 is an outlier 1.441520e-04 chi-squared test for outlier
# data.name min.ind max.ind outlier
# 1: value 3609165 1191113 1191113
# 2: value 105483 3476019 105483
# 3: value 4153397 1375713 1375713
# 4: value 3406443 2539135 3406443
# 5: value 25117 2004445 25117
# ---
# 672: value 1871740 2551796 1871740
# 673: value 1003782 2158390 2158390
# 674: value 1555424 1492556 1492556
# 675: value 2071914 1344538 2071914
# 676: value 2281500 426556 2281500
也许有点奇怪,但是,嘿,它最终将我们带到了那里.
A bit fiddly, perhaps, but hey, it got us there in the end.
这篇关于用lapply代替for循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!