R中多个条件的有效条件求和 [英] Efficient conditional summing by multiple conditions in R

查看:1376
本文介绍了R中多个条件的有效条件求和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在努力寻找以下问题的有效解决方案:

I'm struggling with finding an efficient solution for the following problem:

我有一个大型的可操纵数据框,具有大约8列和80000行,通常包括多个数据类型。
如果要满足大型数据框中的条件,我想创建一个新的数据框,其中包括一列的总和。

I have a large manipulated data frame with around 8 columns and 80000 rows that generally includes multiple data types. I want to create a new data frame that includes the sum of one column if conditions from the large data frame are met.

想象一下原始数据的标题数据框看起来像这样。 $ years.raw列表示该公司测量了x年的数据。

Imagine the head of the original data frame looks like this. The column $years.raw indicates that the company measured data for x years.

> cbind(company.raw,years.raw,source,amount.inkg)
     company.raw years.raw source      amount.inkg
[1,] "C1"        "1"       "Ink"       "5"        
[2,] "C1"        "1"       "Recycling" "2"        
[3,] "C2"        "1"       "Coffee"    "10"       
[4,] "C2"        "1"       "Combusted" "15"       
[5,] "C2"        "2"       "Printer"   "14"       
[6,] "C2"        "2"       "Tea"       "18"   

我现在需要做的是创建一个新的数据框,该数据框汇总列$的值每个公司和每年基于某些字符串元素的amount.inkg。
我将字符串元素保存在下面的三个向量中。字符串元素是原始数据帧中$ source列的一部分。

What I need to do now is to create a new data frame that sums up the values of column $amount.inkg for every company and every year based on certain string elements. I saved the string elements in three vectors below. The string elements are part of the column $source from the original data frame.

> vector1 <- c("Tea","Coffee")
> vector2 <- c("Ink","Printer")
> vector3 <- c("Recycling","Combusted")

首选数据帧将如下所示像这样:

The preferred data frame would then look like this:

Company Year              amount.vector1    amount.vector 2 amount.vector 3
C1           1                 0             5                 2
C2           1                 10            0                15        
C2           2                 18            14                0

$ amount.vector1将为:
将每个公司和每年原始数据框列的字符串元素$ source == vector1的字符串元素的$ amount.inkg列的值相加。
$ amount.vector2列相同,除了元素当然不同。

The general approach for $amount.vector1 would be: Sum up the values of column $amount.inkg for every company and every year where string elements of the original data frame column $source== string elements of vector1. The same for column $amount.vector2 except that the elements are different of course.

如果没有可用的值,则应添加 0而不是NA错误。
这需要在整个原始数据框架中完成,包括大约250家公司,每家公司在1:8年的数据(差异很大)。

If there are no values available, a "0" should be added instead a NA error. This needs to be done for the whole raw data frame that includes around 250 companies with data for every company for the years 1:8 (differs quite a lot).

编辑:对于数据框,我需要每年每公司一行。

C1 Year 1  
C1 Year 2
C1 Year 3
C2 Year 1
C2 Year 2

我试图编写一个结合了这些条件的函数,但失败了。
我对R很陌生,不知道如何链接这些条件并将其应用于整个数据框。

I tried to write a function that combines these conditions but I failed. I'm quite new to R and didn't know how to link these conditions and apply them on the whole data frame.

推荐答案

您的数据为长格式(公司,来源,年份,...的多行)

Your data is in 'long form' (multiple rows of company, source, year, ...)

您要汇总每个公司和每年的金额.inkg,以获取多个来源值。具体来说,您想在来源字段中汇总条件。

You want to aggregate amount.inkg over each company and year, for multiple values of source. Specifically you want to aggregate with conditionals on 'source' field.

同样,请提供可复制的示例。 (感谢josilber)。
这是四行,具有Split-Apply-Combine(ddply)或逻辑索引:

Again, please give us reproducible example. (Thanks josilber). This is a four-liner with either Split-Apply-Combine(ddply) or logical indexing:

df = data.frame(company.raw = c("C1", "C1", "C2", "C2", "C2", "C2"),
                years.raw = c(1, 1, 1, 1, 2, 2),
                source = c("Ink", "Recycling", "Coffee", "Combusted", "Printer", "Tea"),
                amount.inkg = c(5, 2, 10, 15, 14, 18))

# OPTION 1. Split-Apply-Combine: ddply(...summarize) with a conditional on the data
require(plyr) # dplyr if performance on large d.f. becomes an issue
ddply(df, .(company.raw,years.raw), summarize,
    amount.vector1=sum(amount.inkg[source %in% c('Tea','Coffee')]),
    amount.vector2=sum(amount.inkg[source %in% c('Ink','Printer')]),
    amount.vector3=sum(amount.inkg[source %in% c('Recycling','Combusted')])
)


# OPTION 2. sum with logical indexing on the df:
# (This is from before you modified the question to one-row-per-company-and-per-year)
df$amount.vector1 <- sum( df[(df$source %in% c('Tea','Coffee')),]$amount.inkg )
# josilber clarifies you want one-row-per-company
...

选项3。您还可以使用汇总 (请在此处联机帮助),并带有子集(...)

Option 3. You could also use aggregate(manpage here) with subset(...), although aggregate for a sum is overkill.

aggregate(df, source %in% c('Tea','Coffee'), FUN = sum)

用于汇总的 by 参数是操作所在的位置(选择,按条件设置子集)。

The by argument to aggregate is where the action is (selecting, subsetting by criteria).

注意:%in%执行扫描操作,因此如果您的向量和df变大或为了可伸缩性,您需要将其分解为可以向量化的布尔操作:
(source =='Tea'| source =='Coffee')

Note: %in% performs a scan operation, so if your vector and d.f. get large, or for scalability, you'd need to break it into boolean operations which can be vectorized: (source=='Tea' | source=='Coffee')

如果子集为空则防止NA求和, sum(c())= 0 所以不用担心。但是,如果这样做,请使用na.omit,或者对最终结果执行 ifelse(is.na(x),0,x)

As to preventing NA sums if the subset was empty, sum(c()) = 0 so don't worry about that. But if you do, either use na.omit, or do ifelse(is.na(x),0,x) on the final result.

这篇关于R中多个条件的有效条件求和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆