运行“申请"命令在一个非常大的数据帧上 [英] Running "apply" command on a very large data frame
问题描述
我在 R 中有一个尺寸为 15,000,000 x 140
的小标题.大小方面,大约 6 GB.
I have a tibble in R that has dimension of 15,000,000 x 140
. Size-wise it's about 6 gb.
我想检查给定行的第 11-40 列是否在特定列表中开始.我想得到一个 1 & 的向量0 的长度为 15,000,000.
I want to check if any of columns 11-40 for a given row start in a specific list. I want to get out a vector of 1 & 0's that is then 15,000,000 long.
我可以使用以下方法来做到这一点:
I can do this using the following:
subResult <- apply(rawData[,11:40], c(1,2), function(x){substring(x,1,3) %in% c("295", "296", "297", "298", "299")})
result <- apply(subResult, 1, sum)
问题是这太慢了——仅仅第一行就需要 1 天多的时间.
Problem is that this is way too slow -- it would take over 1 day to do just for the first line.
有没有办法更快地做到这一点——也许直接通过 dplyr 或 data.table?
Is there any way to do this faster -- perhaps directly through dplyr or data.table?
谢谢!
以下是修剪到第 11-40 列的数据样本.
Here's a sampling of the data trimmed to just columns 11-40.
!> head(rawData)
# A tibble: 6 x 30
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 39402 39451 3fv3i 19593 fk20 14p4 59304 329fj2 NA NA NA NA NA
2 39422 f203ff vmio2 vo2493 19149 59833 13404 394034 43920 349304 59302 1934 34834
3 3432f32 fe493 43943 H2344 53049 V602 3124 K148 K13 NA NA NA NA
# ... with 17 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>,
# X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>, X23 <chr>,
# X24 <chr>, X25 <chr>, X26 <chr>, X27 <chr>, X28 <chr>, X29 <chr>, X30 <chr>
推荐答案
根据描述,这可以通过 tidyverse
Based on the description, this can be done either with tidyverse
library(tidyverse)
rawData %>%
select(11:40) %>% #select the columns
#convert to logical columns
mutate_all(funs(substring(.,1,3) %in% c("295", "296", "297", "298", "299"))) %>%
reduce('+') %>% #get the rowwise sum
mutate(rawData, newcol = .) # assign a new column to the original data
<小时>
或者使用 data.table
通过将 'data.frame' 转换为 'data.table' (setDT(rawData)
),在 setDT(rawData)
中指定感兴趣的列code>.SDcols,遍历列,使用OP的条件将其转换为逻辑,Reduce
通过获取每一行的sum
并赋值(<代码>:=) 到 'newcol'
Or with data.table
by converting the 'data.frame' to 'data.table' (setDT(rawData)
), specify the columns of interest in .SDcols
, loop through the columns, convert it to logical by using the OP's condition, Reduce
by taking the sum
of each row and assign (:=
) to 'newcol'
library(data.table)
setDT(rawData)[, newCol := Reduce('+', lapply(.SD, function(x)
substring(x, 1, 3) %chin% c("295", "296", "297", "298", "299"))),
.SDcols = 11:40]
这篇关于运行“申请"命令在一个非常大的数据帧上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!