在不使用apply函数的情况下对data.table的每一行进行操作的方法 [英] Method to operate on each row of data.table without using apply function

查看:34
本文介绍了在不使用apply函数的情况下对data.table的每一行进行操作的方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在下面写了一个简单的函数:

mcs<-function(v){ifelse(sum((diff(sort(v))> 6)> 0),NA,sd(v))}

应该对一个向量进行排序,然后检查每个连续的差异中是否存在大于6的差异.如果差异大于6,则返回NA;如果差异大于6,则返回标准偏差.

我想将此功能应用于数据表的所有行(仅选择某些列),然后将每一行的返回值作为新的列条目附加到数据表中.

例如,给定这样的数据表

 >dat<-data.table(A = c(1,2,3,4,5),B = c(2,3,4,10,6),C = c(3,4,10,6,8),D = c(3,3,3,3,3))>达特A B C D1:1 2 3 32:2 3 4 33:3 4 10 34:4 10 6 35:5 6 8 3 

我想在下面生成输出.(我在每行的第2、3和4列上应用了函数.)

 >达特A B C D标准1:1 2 3 3 0.57735032:2 3 4 3 0.57735033:3 4 10 3 3.78593894:4 10 6 3 3.51188465:5 6 8 3 2.5166115 

我了解到可以通过以下方法对数据表进行按行操作:

 >dat [,sd:= apply(.SD,1,mcs),.SDcols =(c(2,3,4))] 

此方法行之有效,只是速度太慢.我必须在几个大型数据表上执行此操作,并编写了一个脚本来执行此操作.但是,它仅适用于较小的数据表.对于具有约300,000行的表,它会在几秒钟内完成,但是当我尝试使用具有约8亿行的表时,我的程序无法完成.我尝试等待两个小时,但我认为R中断了,因为控制台刚冻结.我试过几次运行脚本,它总是正确正确地完成前几个较小的表(我让程序将表写到文件中进行检查),但是当它到达大数据表时,它永远不会完成.我正在计算群集上运行此程序,因此我绝对不认为这是硬件限制.可能是较差的代码.

我假设瓶颈是应用中完成的循环,但是我不知道如何使其更快.我对R很陌生,所以不确定如何优化代码.我已经在Internet上看到很多关于矢量化的文章,我在想,如果我可以同时将我的函数应用于每一行,那将会更快,但是我不知道该怎么做.请帮忙.

修改
抱歉,复制我的 mcs 函数时出错.我已经更新了.

编辑2
对于那些感兴趣的人,我最终将桌子分成两半,分别对每一半进行操作,这对我很有用.

解决方案

如果您确实需要速度,一如往常,最好使用Rcpp转向C ++,这为我们提供了比原来快100倍的解决方案.

数据

我确实做了一些不同的示例数据来测试它具有1000行而不是5行:

  set.seed(123)dat<-data.table(A = rnorm(1e3,sd = 4),B = rnorm(1e3,sd = 4),C = rnorm(1e3,sd = 4),D = rnorm(1e3,sd = 4),E = rnorm(1e3,sd = 4)) 

解决方案

我使用以下C ++代码执行与您的函数相同的操作,但是现在循环是通过C ++而不是R通过apply进行的,从而节省了大量时间:

  #include< Rcpp.h>使用命名空间Rcpp;//[[Rcpp :: export]]NumericVector mcs2(DataFrame x){int n = x.nrows();int m = x.size();NumericMatrix mat(n,m);对于(int j = 0; j< m; ++ j){mat(_,j)= NumericVector(x [j]);}NumericVector result(n);对于(int i = 0; i< n; ++ i){NumericVector tmp = mat(i,_);std :: sort(tmp.begin(),tmp.end());bool do_sd = true;对于(int j = 1; j< m; ++ j){如果(tmp [j]-tmp [j-1]> 6.0){结果[i] = NA_REAL;do_sd = false;休息;}}如果(do_sd){result [i] = sd(tmp);}do_sd = true;}返回结果;} 

我们可以确保它返回相同的值:

  all.equal(apply(dat [,2:4],1,mcs1),mcs2(dat [,2:4]))[1]是 

现在让我们进行基准测试:

 基准(mcs1 = dat [,sd:= apply(.SD,1,mcs1),.SDcols =(c(2,3,4))],mcs2 = dat [,sd:= mcs2(.SD),.SDcols =(c(2,3,4))],顺序=相对",列= c('test','elapsed','relative','user.self'))测试已过去的相对user.self2 mcs2 0.19 1.000 0.1831 mcs1 21.34 112.316 20.044 

如何编译此代码

作为通过Rcpp使用C ++代码的介绍,我建议 Hadley Wickham's Advanced R的本章.如果您打算进一步使用Rcpp做任何事情,我强烈建议您也阅读官方文档和小插曲,但是Wickham的书可能更适合初学者使用.初始点.出于您的目的,您只需要启动并运行Rcpp即可编译上面的代码.

为使此代码对您有用,如果您尚未安装,则需要Rcpp软件包.您可以通过运行获取该软件包

  install.packages(Rcpp) 

来自R.请注意,您还需要一个编译器.如果您使用的是基于Debian的Linux系统(例如Ubuntu),则可以运行

  sudo apt安装r-base-dev 

从终端

.如果您使用的是Mac或Windows,请在此处中查看有关的一些说明进行设置,或在上面链接的Wickham一章中.

一旦安装了Rcpp,请将上面的C ++代码保存到文件中.在我们的示例中,文件名为"SOanswer.cpp".然后,通过在R脚本中放置以下两行,可以使其R的 mcs2()函数可用:

  library(Rcpp)sourceCpp("SOanswer.cpp")#假设文件在您的工作目录中 

就是这样!现在,您的R脚本可以调用 mcs2()并更快地运行.如果您想了解有关Rcpp的更多信息,请在上面的Wickham章节旁,阅读参考手册和此页面(其中还有大量链接,其中一些链接到此处),您还可以在 解决方案

If you really need speed, as always it's best to move to C++ using Rcpp, which gives us a solution that's over 100x faster.

Data

I did make some different example data to test this on that had 1000 rows instead of 5:

set.seed(123)
dat <- data.table(A = rnorm(1e3, sd=4), B = rnorm(1e3, sd=4), C = rnorm(1e3, sd=4),
                  D = rnorm(1e3, sd=4), E = rnorm(1e3, sd=4))

Solution

I used the following C++ code to do the same thing as your function, but now the looping is done in C++ instead of R through apply which saves considerable time:

#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
NumericVector mcs2(DataFrame x) {
    int n = x.nrows();
    int m = x.size();
    NumericMatrix mat(n, m);
    for ( int j = 0; j < m; ++j ) {
        mat(_, j) = NumericVector(x[j]);
    }
    NumericVector result(n);
    for ( int i = 0; i < n; ++i ) {
        NumericVector tmp = mat(i, _);
        std::sort(tmp.begin(), tmp.end());
        bool do_sd = true;
        for ( int j = 1; j < m; ++j ) {
            if ( tmp[j] - tmp[j-1] > 6.0 ) {
                result[i] = NA_REAL;
                do_sd = false;
                break;
            }
        }
        if ( do_sd ) {
            result[i] = sd(tmp);
        }
        do_sd = true;
    }
    return result;
}

We can make sure it's returning the same values:

all.equal(apply(dat[, 2:4], 1, mcs1), mcs2(dat[,2:4]))

[1] TRUE

Now let's benchmark:

benchmark(mcs1 = dat[, sd:=apply(.SD, 1, mcs1), .SDcols=(c(2,3,4))],
          mcs2 = dat[, sd:=mcs2(.SD), .SDcols=(c(2,3,4))],
          order = 'relative',
          columns = c('test', 'elapsed', 'relative', 'user.self'))


  test elapsed relative user.self
2 mcs2    0.19    1.000     0.183
1 mcs1   21.34  112.316    20.044

How to compile this code

As an introduction to using C++ code through Rcpp, I'd suggest this chapter of Hadley Wickham's Advanced R. If you intend on doing anything further with Rcpp I'd strongly recommend you also read the official documentation and vignettes, but Wickham's book is probably a little more beginner friendly to use as a starting point. For your purposes, you just need to get Rcpp up and running so that you can compile the code above.

For this code to work for you, you'll need the Rcpp package if you don't already have it. You can get the package by running

install.packages(Rcpp)

from R. Note you'll also need a compiler; if you're on a Debian-based Linux system such as Ubuntu you can run

sudo apt install r-base-dev

from the terminal. If you are on Mac or Windows, check here for some instructions on getting this set up, or in the Wickham chapter linked above.

Once you have Rcpp installed, save the C++ code above into a file. Let's say for our example the file is named "SOanswer.cpp". Then you can make its mcs2() function available from R by putting the following two lines in your R script:

library(Rcpp)
sourceCpp("SOanswer.cpp") # assuming the file is in your working directory

That's it! Now your R script can call mcs2() and run much faster. If you want to learn more about Rcpp, beside the Wickham chapter above, I'd check out the reference manual and the vignettes available here, this page from RStudio (which also has tons of links, some of which are linked to here), and you can also find some really useful stuff looking around the Rcpp gallery.

这篇关于在不使用apply函数的情况下对data.table的每一行进行操作的方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆