R:数据子集的百分比计算 [英] R: Percentile calculations on subsets of data

查看:250
本文介绍了R:数据子集的百分比计算的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据集,其中包含以下标识符,rscore,gvkey,sic2,year和cdom.我想要做的是基于给定gvkey的所有时间范围(〜1500)的rscore的总和来计算百分位等级,然后基于gvkey计算给定时间范围和sic2的百分位等级.

I have a data set which contains the following identifiers, an rscore, gvkey, sic2, year, and cdom. What I am looking to do is calculate percentile ranks based on summed rscores for all temporal spans (~1500) for a given gvkey, and then calculate percentile ranks in a given temporal time span and sic2 based on gvkey.

计算所有时间范围的百分位数是一个相当快的过程,但是,一旦我加入了sic2百分位数排名的计算,它就相当慢了,但是我们可能总共要看大约65,000个子集.我想知道是否有可能加快此过程.

Calculating the percentiles for all temporal time spans is a fairly quick process, however once I add in calculating the sic2 percentile ranks it's fairly slow, but we are likely looking at about ~65,000 subsets in total. I'm wondering if there is a possibility of speeding up this process.

一个时间跨度的数据如下所示

The data for one temporal time span looks like the following

gvkey   sic2    cdom    rscoreSum   pct
1187    10  USA 8.00E-02    0.942268617
1265    10  USA -1.98E-01   0.142334654
1266    10  USA 4.97E-02    0.88565478
1464    10  USA -1.56E-02   0.445748247
1484    10  USA 1.40E-01    0.979807985
1856    10  USA -2.23E-02   0.398252565
1867    10  USA 4.69E-02    0.8791019
2047    10  USA -5.00E-02   0.286701209
2099    10  USA -1.78E-02   0.430915371
2127    10  USA -4.24E-02   0.309255308
2187    10  USA 5.07E-02    0.893020421

下面是计算行业排名的代码,并且相当简单.

The code to calculate the industry ranks is below, and fairly straightforward.

#generate 2 digit industry SICs percentile ranks
dout <- ddply(dfSum, .(sic2), function(x){
  indPct <- rank(x$rscoreSum)/nrow(x)
  gvkey <- x$gvkey
  x <- data.frame(gvkey, indPct)
})

#merge 2 digit industry SIC percentile ranks with market percentile ranks
dfSum <- merge(dfSum, dout, by = "gvkey")
names(dfSum)[2] <- 'sic2'

任何加快该过程的建议将不胜感激!

Any suggestions to speed the process would be appreciated!

推荐答案

您可以尝试使用data.table包在相对较大的数据集(如您的数据集)中进行快速操作.例如,我的机器可以通过以下方式正常工作:

You might try the data.table package for fast operations across relatively large datasets like yours. For example, my machine has no problem working through this:

library(data.table)

# Create a dataset like yours, but bigger
n.rows <- 2e6
n.sic2 <- 1e4
dfSum <- data.frame(gvkey=seq_len(n.rows),
                    sic2=sample.int(n.sic2, n.rows, replace=TRUE),
                    cdom="USA",
                    rscoreSum=rnorm(n.rows))

# Now make your dataset into a data.table
dfSum <- data.table(dfSum)

# Calculate the percentiles
# Note that there is no need to re-assign the result
dfSum[, indPct:=rank(rscoreSum)/length(rscoreSum), by="sic2"]

等效项plyr需要一段时间.

如果您喜欢plyr语法(我喜欢),则可能还对 dplyr软件包,被称为下一代plyr",并在后端支持更快的数据存储.

If you like the plyr syntax (I do), you may also be interested in the dplyr package, which is billed as "the next generation of plyr", with support for faster data stores in the backend.

这篇关于R:数据子集的百分比计算的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆