通过内部函数调用提高循环性能 [英] Improving loop performance with function call inside

查看:75
本文介绍了通过内部函数调用提高循环性能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

library(plyr);
library(sqldf);
library(data.table)
library(stringi);
library(RODBC);

dbhandle <- odbcDriverConnect('driver={SQL Server};server=.;database=TEST_DB;trusted_connection=true')
res <- sqlQuery(dbhandle, 'Select Company_ID,
       AsOfDate,
       CashFlow FROM dbo.Accounts')

resdatatable = as.data.table(res)

odbcCloseAll();


sppv <- function(i, n) {
    return((1 + i / 100) ^ (-n))
}


npv <- function(x, i) {
    npv = c()
    for (k in 1:length(i)) {
        pvs = x * sppv(i[k], 1:length(x))
        npv = c(npv, sum(pvs))
    }
    return(npv)
}


xirr <- function(cashflow, dates) {
    if (length(cashflow) != length(dates)) {
        stop("length(cashflow) != length(dates)")
    }

    cashflow_adj <- c(cashflow[1])
    for (i in 1:(length(cashflow) - 1)) {
        d1 <- as.Date(dates[i], "%d-%m-%Y", origin = "1970-01-01")
        d2 <- as.Date(dates[i + 1], "%d-%m-%Y", origin = "1970-01-01")

        # There are no checks about the monotone values of dates
        # put a check in here if the interval is negative

        interval <- as.integer(d2 - d1)

        if (length(interval) > 0 && !is.na(interval)) {
            cashflow_adj <- c(cashflow_adj, rep(0, interval - 1), cashflow[i + 1])
        }
   }

    left = -10
    right = 10
    epsilon = 1e-8
    while (abs(right - left) > 2 * epsilon) {
        midpoint = (right + left) / 2
        if (npv(cashflow_adj, left) * npv(cashflow_adj, midpoint) > 0) {
            left = midpoint
        } else {
            right = midpoint
        }
    }


    irr = (right + left) / 2 / 100
    irr <- irr * 365
    # Annualized yield (return) reflecting compounding effect of daily returns
    irr <- (1 + irr / 365) ^ 365 - 1

    irr
}




groupedCompanyNames <- unique(as.character(resdatatable$Company_ID));




groupedDatesPerCompany <- split(resdatatable$AsOfDate, resdatatable$Company_ID);




groupedCashFlowsPerCompany <- split(resdatatable$CashFlow, resdatatable$Company_ID);


resultsDataFrame <- data.table(Company_ID = character(length(groupedCompanyNames)), XIRR = numeric(length(groupedCompanyNames)));



datalist = result <- vector("list", length(groupedCompanyNames));



for (i in groupedCompanyNames) {


    datesForCompany <- groupedDatesPerCompany[i];
    dates <- datesForCompany[[i]];



    cashFlowsForCompany <- groupedCashFlowsPerCompany[i];
    cashFlows <- cashFlowsForCompany[[i]];


    xirrResult <- tryCatch(xirr(cashFlows, dates),
                           error = function(e) {

                              0
                           });

    newRow <- data.frame(Company_ID = i, XIRR = format(round(xirrResult, 2), nsmall = 2));
    datalist[[i]] <- newRow;

}

resultsDataFrame <- data.table::rbindlist(datalist)
finalDataFrame <- as.data.frame(resultsDataFrame);

print(finalDataFrame);

为了提供上下文,我尝试执行以下操作:

So to provide context, I am trying to do the following:


  1. 使用RODBC连接从数据库中获取数据

  2. 获取唯一的公司名称

  3. 拆分每个公司的现金流量和日期

  4. 初始化具有已知行数的数据表,这样就不需要
    来逐步增长。

  5. 通过唯一的公司名称和调用函数循环,可在列表
    的公司现金流量和日期中获得xirr。

  6. 在每行中添加公司名称和到新数据表的XIRR值。

  7. 使用rbindlist。

  1. Get data out of the database using an RODBC connection
  2. Get the unique company names
  3. Split the cashflows and dates per company
  4. Initialize a data table with a known number of rows so that it doesn't need to incrementally grow.
  5. Loop through the unique company names and call function get xirr on the list of cashflows and dates for the company.
  6. Add each row with the company name and the XIRR value to a new datatable.
  7. Use rbindlist.

这里是我正在使用的源数据样本

Company_ID  CashFlow    AsOfDate
3F68D729-D69D-E711-9C98-5065F34B3E7D    368608.0000 2004-11-30 00:00:00.000
3F68D729-D69D-E711-9C98-5065F34B3E7D    366999.0000 2004-12-31 00:00:00.000
3F68D729-D69D-E711-9C98-5065F34B3E7D    326174.0000 2005-01-31 00:00:00.000
3F68D729-D69D-E711-9C98-5065F34B3E7D    345666.0000 2005-02-28 00:00:00.000
3F68D729-D69D-E711-9C98-5065F34B3E7D    -1529180.0000   2005-03-31 00:00:00.000
3F68D729-D69D-E711-9C98-5065F34B3E7D    -65259.0000 2005-04-30 00:00:00.000
3F68D729-D69D-E711-9C98-5065F34B3E7D    514005.0000 2005-05-31 00:00:00.000
3F68D729-D69D-E711-9C98-5065F34B3E7D    512951.0000 2005-06-30 00:00:00.000
9B64D729-D69D-E711-9C98-5065F34B3E7D    -6792.0000  2011-06-30 00:00:00.000
9B64D729-D69D-E711-9C98-5065F34B3E7D    -6792.0000  2011-07-31 00:00:00.000
9B64D729-D69D-E711-9C98-5065F34B3E7D    -6572.0000  2011-08-31 00:00:00.000
9B64D729-D69D-E711-9C98-5065F34B3E7D    -6792.0000  2011-09-30 00:00:00.000
9B64D729-D69D-E711-9C98-5065F34B3E7D    -6572.0000  2011-10-31 00:00:00.000
9B64D729-D69D-E711-9C98-5065F34B3E7D    -6792.0000  2011-11-30 00:00:00.000
9B64D729-D69D-E711-9C98-5065F34B3E7D    -6791.0000  2011-12-31 00:00:00.000
9B64D729-D69D-E711-9C98-5065F34B3E7D    -187375.0000    2012-01-31 00:00:00.000
9B64D729-D69D-E711-9C98-5065F34B3E7D    -215902.0000    2012-02-29 00:00:00.000
9B64D729-D69D-E711-9C98-5065F34B3E7D    -6572.0000  2012-03-31 00:00:00.000
9B64D729-D69D-E711-9C98-5065F34B3E7D    -217409.0000    2012-04-30 00:00:00.000
9B64D729-D69D-E711-9C98-5065F34B3E7D    -191830.0000    2012-05-31 00:00:00.000

我是R的新手,大约有2000个独特的公司名称,平均有50个日期,每个现金流组合= 100000条记录,循环需要大约28秒的处理时间。

I'm new to R - and with circa 2000 unique company names an on average 50 date, cashflow combinations each = 100000 records the loop takes about 28 secs to process.

我看过使用asParallel库并使用了foreach,但这似乎对速度没有任何影响。如果我退出对xirr函数的调用,则循环将立即处理并结束。

I've looked at using the asParallel library and used foreach but that didn't seem to make any difference to the speed. If I take out the calling of the function xirr then the loop is processed and finished instantly.

xirr需要异常处理,因为有时无法计算xirr值

The xirr needs the exception handling as sometimes its not possible to calculate an xirr value iteratively.

我知道循环并不是R中的最佳实践-关于如何向量化以获得更好性能的任何建议?

I know that looping is not really best practice in R - any suggestions on how to vectorise this for better performance?

推荐答案

为了提高性能,我使用了doParallel库。

In order improve the performance, I used the doParallel library.

library(doParallel)
cl <- makeCluster(detectCores() - 1, type = 'PSOCK')
registerDoParallel(cl)

而不是for循环,我将逻辑放入foreach

And instead of the for loop, I put the logic into a foreach

resultsDataFrame <- foreach(n = 1:length(groupedCompanyNames), .combine = rbind) %dopar% {


    company_id <- groupedCompanyNames[n];
    datesForCompany <- groupedDatesPerCompany[n];
    dates <- unsplit(datesForCompany, company_id);


    cashFlowsForCompany <- groupedCashFlowsPerCompany[n];
    cashFlows <- unsplit(cashFlowsForCompany, company_id);

    #now calculate the xirr for the values
    xirrResult <- tryCatch(xirr(cashFlows, dates),
    error = function(e) {

    0
    });



    data.frame(Company_ID = company_id, XIRR = format(round(xirrResult, 2), nsmall = 2));
}

registerDoSEQ();

当我将完整的数据集(4000家公司)及其日期和现金流量运入其中时。原始循环总计40万行,耗时约10分钟。通过foreach循环并利用机器中的额外内核,该操作花费了60秒。

When I ran my full data set into it (4000 companies) with their dates and cashflows. A total of 400000 rows the original loop took around 10 minutes. With the foreach loop and utilising the extra cores in the machine, the operation took 60 seconds.

我希望有人可能会建议在顶部进一步提高性能其中,但我认为这是一个很好的改进。

I hope that someone will maybe be able to suggest a further performance spike on top of this but I think that is a good improvement.

这篇关于通过内部函数调用提高循环性能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆