合并数据帧,同时求和R中的公共列 [英] Merge data frames whilst summing common columns in R

查看:108
本文介绍了合并数据帧,同时求和R中的公共列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题与区别在于他们知道会发生冲突的列,而我需要一个通用的方法,该方法不会事先知道哪些列会发生冲突.

The difference is that they knew the columns that would be conflicting whereas I need a generic method that wont know in advance which columns conflict.

示例:

TABLE1
Date             Time    ColumnA    ColumnB
01/01/2013      08:00      10         30
01/01/2013      08:30      15         25
01/01/2013      09:00      20         20
02/01/2013      08:00      25         15
02/01/2013      08:30      30         10
02/01/2013      09:00      35         5

TABLE2
Date           ColumnA    ColumnB    ColumnC
01/01/2013      100        300         1
02/01/2013      200        400         2

表2仅具有日期,因此无论时间如何,都将其应用于表A中与日期匹配的所有字段.

Table 2 only has dates and so is applied to all fields in table A that match the date regardless on time.

我希望合并将冲突的列总计为1.结果应如下所示:

I would like the merge to sum the conflicting columns into 1. The result should look like this:

TABLE3
Date             Time    ColumnA    ColumnB    ColumnC
01/01/2013      08:00      110         330        1
01/01/2013      08:30      115         325        1
01/01/2013      09:00      120         320        1
02/01/2013      08:00      225         415        2
02/01/2013      08:30      230         410        2
02/01/2013      09:00      235         405        2

目前,我的标准合并仅创建了"ColumnA.x","ColumnA.y","ColumnB.x","ColumnB.y"的重复列.

At the moment my standard merge just creates duplicate columns of "ColumnA.x", "ColumnA.y", "ColumnB.x", "ColumnB.y".

非常感谢您的帮助

推荐答案

如果我理解正确,那么您需要一种灵活的方法,该方法不需要知道除要合并的列和要合并的列之外的每个表中还存在哪些列.您想要保存.这可能不是最优雅的解决方案,但这是一个示例示例,可以满足您的确切需求:

If I understand correctly, you want a flexible method that does not require knowing which columns exist in each table aside from the columns you want to merge by and the columns you want to preserve. This may not be the most elegant solution, but here is an example function to suit your exact needs:

merge_Sum <- function(.df1, .df2, .id_Columns, .match_Columns){
    merged_Columns <- unique(c(names(.df1),names(.df2)))
    merged_df1 <- data.frame(matrix(nrow=nrow(.df1), ncol=length(merged_Columns)))
    names(merged_df1) <- merged_Columns
    for (column in merged_Columns){
        if(column %in% .id_Columns | !column %in% names(.df2)){
            merged_df1[, column] <- .df1[, column]
        } else if (!column %in% names(.df1)){
            merged_df1[, column] <- .df2[match(.df1[, .match_Columns],.df2[, .match_Columns]), column]
        } else {
            df1_Values=.df1[, column]
            df2_Values=.df2[match(.df1[, .match_Columns],.df2[, .match_Columns]), column]
            df2_Values[is.na(df2_Values)] <- 0
            merged_df1[, column] <- df1_Values + df2_Values
        }
    }
    return(merged_df1)
}

此函数假定您有一个表'.df1'作为排序母版,并且您要合并第二个表'.df2'中的数据,该表的行与'.df1中的一个或多个行相匹配'.从主表'.df1'中保留的列被接受为数组'.id_Columns',为两个表合并提供匹配的列被接受为数组'.match_Columns'

This function assumes you have a table '.df1' that is a master of sorts, and you want to merge data from a second table '.df2' that has rows that match one or more of the rows in '.df1'. The columns to preserve from the master table '.df1' are accepted as an array '.id_Columns', and the columns that provide the match for merging the two tables are accepted as an array '.match_Columns'

对于您的示例,它将像这样工作:

For your example, it would work like this:

merge_Sum(table1, table2, c("Date","Time"), "Date")

#   Date       Time  ColumnA ColumnB ColumnC
# 1 01/01/2013 08:00     110     330       1
# 2 01/01/2013 08:30     115     325       1
# 3 01/01/2013 09:00     120     320       1
# 4 02/01/2013 08:00     225     415       2
# 5 02/01/2013 08:30     230     410       2
# 6 02/01/2013 09:00     235     405       2

以通俗易懂的语言,此函数首先查找唯一列的总数,并以主表'.df1'的形式创建一个空的数据框,以便以后保存合并的数据.然后,对于".id_Columns",将数据从".df1"复制到新的合并数据帧中.对于其他列,将'.df1'中存在的任何数据添加到'.df2'中现有的任何数据中,其中'.df2'中的行将基于'.match_Columns'进行匹配

In plain language, this function first finds the total number of unique columns and makes an empty data frame in the shape of the master table '.df1' to later hold the merged data. Then, for the '.id_Columns', the data is copied from '.df1' into the new merged data frame. For the other columns, any data that exists in '.df1' is added to any existing data in '.df2', where the rows in '.df2' are matched based on the '.match_Columns'

可能有一些类似的程序包,但是大多数程序包都需要了解所有现有列以及如何处理它们.正如我之前所说,这不是最优雅的解决方案,但它是灵活而准确的.

There is probably some package out there that does something similar, but most of them require knowledge of all the existing columns and how to treat them. As I said before, this is not the most elegant solution, but it is flexible and accurate.

更新:原始功能假设table1和table2之间存在多对一关系,OP也要求允许多对多关系.该代码已更新,但效率略有降低,但逻辑灵活度提高了100%.

Update: The original function assumed a many-to-one relationship between table1 and table2, and the OP requested the allowance of a many-to-none relationship, also. The code has been updated with a slightly less efficient but 100% more flexible logic.

这篇关于合并数据帧,同时求和R中的公共列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆