如何识别列是数据集中其他列的总和 [英] How to identify columns that are sums of other columns in a dataset

查看:58
本文介绍了如何识别列是数据集中其他列的总和的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想编写一个函数(最好使用R语言,但欢迎使用其他语言),该函数将识别数据集中列之间的关系(仅限于加/减).此方法的实际应用是在大型多列财务数据集上运行它,其中某些列是其他列的小计-并标识这些小计.

I would like to write a function (preferably in R, but other languages are welcome), which would identify relationships between columns (limited to additions/substractions) in a dataset. A practical application of this would be to run it on large multi-column financial datasets, where some of the columns are subtotals of other columns - and identify such subtotals.

理想情况下,我希望允许一些小的差异-例如考虑到四舍五入的问题,导致列的总和不完全是100%.

Ideally, I would like to allow for small discrepancies - e.g. to allow for rounding issues leading to columns not adding up exactly 100%.

我发现了以下

I found the following question which includes a solution involving matrices and ranks, but I am not sure if there is any way to incorporate the ability to handle the noise in data arising from the rounding issues.

例如:

d = data.frame(a=c(10.12, 20.02, 30.08, 20.19), b=c(12.12, 20.45, 20.52, 16.72), c=c(11, 123.25, 20.67, 20.78))
d$d = d$a + d$b
d$e = d$d + d$c
> d
      a     b      c     d      e
1 10.12 12.12  11.00 22.24  33.24
2 20.02 20.45 123.25 40.47 163.72
3 30.08 20.52  20.67 50.60  71.27
4 20.19 16.72  20.78 36.91  57.69

magic_function(d)
[1] "d$d = d$a + d$b"
[2] "d$e = d$d + d$c" # or "d$e = d$a + d$b + d$c" (first option preferred)

在将噪声引入方程式之前,链接问题中的解决方案效果很好.例如d$d[[4]] = d$d[[4]] + 0.01-然后它不再起作用.因此,我的问题是:

The solution in the linked question works well until I introduce noise into equation. e.g. d$d[[4]] = d$d[[4]] + 0.01 - then it no longer works at all. My question thus is:

  1. 是否还有其他方法可以识别之间的关系 列(尤其是仅限于简单列) 加/减)
  2. 是否有任何方法可以解决 数据质量问题不完善,还是我需要建立一些附加功能 功能(例如,在通过排名识别算法运行数据之前先对数据进行四舍五入).
  1. Are there any other methods to identifying relationships between columns (especially if they are restricted to simple addition/subtraction)
  2. Are any of the methods able to address the imperfect data quality issue or do I need to build some additional functionality for it (e.g. round the data before running it through the rank identification algorithm).

推荐答案

如果您只允许将和用于连续列,并且仅用于先前的值,则此方法的计算量可能很容易处理10-20列.此过程将检查该列是否等于先前连续列的总和,并留有一些误差:

If you allow the sums to only be for consecutive columns, and for previous values only, the computational effort for this is probably tractable for 10-20 columns. This procedure checks to see if the column is equal to the sum of previous consecutive columns, with some allowance for error:

d <- data.frame(a=c(10.12, 20.02, 30.08, 20.19),
                b=c(12.12, 20.45, 20.52, 16.72),
                c=c(11, 123.25, 20.67, 20.78));
d$d <- round(d$a + d$b + runif(4,0,0.04),2);
d$e <- round(d$d + d$c + runif(4,0,0.04),2);

## Assumptions:
## * sum columns relate to previous values only
## * sum columns relate to consecutive columns

sumColumns <- NULL;
allowedError <- 0.05;
for(col in 3:ncol(d)){
    for(subStart in 1:(col-2)){
        for(subEnd in (subStart+1):(col-1)){
            if(all(abs(d[,col] - rowSums(d[,subStart:subEnd, drop=FALSE])) <
                   allowedError)){
                cat(sprintf("Column %d is a sum of columns %d-%d\n",
                            col, subStart, subEnd));
                sumColumns[col] <- TRUE;
            }
        }
    }
}

输出:

Column 4 is a sum of columns 1-2
Column 5 is a sum of columns 3-4

可以对此进行修改,以允许连续列以及任意数量的求和列,同时保持易处理性(假设求和列的数量保持较低).这种修改并非完全无关紧要,而是留给读者练习.

This could be modified to allow for consecutive columns together with any number of sum columns while maintaining tractability (assuming the number of sum columns were kept low). That modification is not completely trivial, and is left as an exercise to the reader.

这篇关于如何识别列是数据集中其他列的总和的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆