运行“申请"命令在一个非常大的数据帧上 [英] Running "apply" command on a very large data frame

查看:56
本文介绍了运行“申请"命令在一个非常大的数据帧上的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在 R 中有一个尺寸为 15,000,000 x 140 的小标题.大小方面,大约 6 GB.

I have a tibble in R that has dimension of 15,000,000 x 140. Size-wise it's about 6 gb.

我想检查给定行的第 11-40 列是否在特定列表中开始.我想得到一个 1 & 的向量0 的长度为 15,000,000.

I want to check if any of columns 11-40 for a given row start in a specific list. I want to get out a vector of 1 & 0's that is then 15,000,000 long.

我可以使用以下方法来做到这一点:

I can do this using the following:

subResult <- apply(rawData[,11:40], c(1,2), function(x){substring(x,1,3) %in% c("295", "296", "297", "298", "299")})

result <- apply(subResult, 1, sum)

问题是这太慢了——仅仅第一行就需要 1 天多的时间.

Problem is that this is way too slow -- it would take over 1 day to do just for the first line.

有没有办法更快地做到这一点——也许直接通过 dplyr 或 data.table?

Is there any way to do this faster -- perhaps directly through dplyr or data.table?

谢谢!

以下是修剪到第 11-40 列的数据样本.

Here's a sampling of the data trimmed to just columns 11-40.

!> head(rawData)
 # A tibble: 6 x 30                                                                                                                                                                               
   X1    X2    X3    X4    X5    X6    X7    X8    X9    X10   X11   X12   X13
   <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 39402 39451 3fv3i 19593 fk20 14p4  59304  329fj2 NA    NA    NA    NA    NA
 2 39422 f203ff vmio2  vo2493  19149 59833 13404 394034 43920  349304   59302 1934 34834
 3 3432f32 fe493  43943 H2344 53049  V602  3124  K148 K13  NA    NA    NA    NA
 # ... with 17 more variables: X14 <chr>, X15 <chr>, X16 <chr>, X17 <chr>,                                                                                                                         
 #   X18 <chr>, X19 <chr>, X20 <chr>, X21 <chr>, X22 <chr>, X23 <chr>,                                                                                                                             
 #   X24 <chr>, X25 <chr>, X26 <chr>, X27 <chr>, X28 <chr>, X29 <chr>, X30 <chr> 

推荐答案

根据描述,这可以通过 tidyverse

Based on the description, this can be done either with tidyverse

library(tidyverse)
rawData %>%
   select(11:40) %>% #select the columns
   #convert to logical columns
   mutate_all(funs(substring(.,1,3) %in% c("295", "296", "297", "298", "299"))) %>% 
   reduce('+') %>% #get the rowwise sum
   mutate(rawData, newcol = .) # assign a new column to the original data

<小时>

或者使用 data.table 通过将 'data.frame' 转换为 'data.table' (setDT(rawData)),在 setDT(rawData) 中指定感兴趣的列code>.SDcols,遍历列,使用OP的条件将其转换为逻辑,Reduce通过获取每一行的sum并赋值(<代码>:=) 到 'newcol'


Or with data.table by converting the 'data.frame' to 'data.table' (setDT(rawData)), specify the columns of interest in .SDcols, loop through the columns, convert it to logical by using the OP's condition, Reduce by taking the sum of each row and assign (:=) to 'newcol'

library(data.table)
setDT(rawData)[, newCol := Reduce('+', lapply(.SD, function(x) 
      substring(x, 1, 3) %chin% c("295", "296", "297", "298", "299"))), 
     .SDcols = 11:40]

这篇关于运行“申请"命令在一个非常大的数据帧上的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆