计算具有两个给定值的变量与另一个变量的一个值相对应的出现次数 [英] Count occurrences of a variable having two given values corresponding to one value of another variable

查看:104
本文介绍了计算具有两个给定值的变量与另一个变量的一个值相对应的出现次数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从图片中可以看出,我有一列有订单号,一列有物料号.

As it can be seen in the picture, I have a column with order numbers and a column with material numbers.

我想找出一对材料以相同顺序出现的频率.

I want to find how often a pair of materials occur in the same order.

问题是我有30000行的订单号和700个唯一的物料号.甚至有可能吗?

The problem is that I have 30000 rows of order numbers and 700 unique material numbers. Is it even possible?

我在想,用行和列中的700个物料编号以及计数编号出现次数来创建矩阵是否更容易.

I was thinking if it was easier to make a matrix with the 700 material numbers both in rows and column, and count number occurrences.

第一张照片不是一个很好的例子.我上传了第二张带有随机材料编号的图片.因此,我希望它为每一对计数(如我所强调的示例10-11),它们以相同顺序出现多少次.可以看出,10& 11以3个不同的顺序出现.

The first picture was not a good example. I uploaded this second picture with random material numbers. So I want it to count for each pair (example 10-11, as I highlighted), how many times the appear in the same order. As it can be seen, 10&11 appear in 3 different orders.

推荐答案

关于内存空间的最佳解决方案是每对一对一行,即700 * 699/2.此问题仍然相对较小且简单处理700 * 700矩阵比您要保存的700 * 701/2单元更有价值,后者每个单元一个字节可以达到240kB.如果矩阵是稀疏的(即大多数材料对从未一起订购),而您使用的是适当的数据结构,则可能会更少.

The optimal solution in terms of memory space would be one row for each pair which would be 700*699 / 2. This problem is still relatively small and the simplicity of manipulating a 700*700 matrix is probably more valuable than the 700*701/2 cells you're saving, which would work out to 240kB with one byte per cell. It could be even less if the matrix is sparse (i.e. most pairs of materials are never ordered together) and you use an appropriate data structure.

代码如下:

首先,我们要创建一个数据行,其中数据行和列的数量与材料的数量一样多.矩阵更易于创建,因此我们创建一个矩阵,然后将其转换为数据框.

First we want to create a dataframe with as many rows and columns as there are materials. Matrices are easier to create so we create one that we convert to a dataframe afterwards.

all_materials = levels(as.factor(X$Materials))
number_materials = length(all_materials)
Pairs <- as.data.frame(matrix(data = 0, nrow = number_materials, ncol = number_materials))

(这里X是您的数据集)

(Here, X is your dataset)

然后,我们将行名和列名设置为能够直接使用材料的标识符(显然不必从1到700编号)访问行和列.

We then set the row names and column names to be able to access the rows and columns directly with the identifiers of the materials which are apparently not necessarily numbered from 1 to 700.

colnames(Pairs) <- all_materials
rownames(Pairs) <- all_materials

然后我们遍历数据集

for(order in levels(as.factor(X$Order.number))){
  # getting the materials in each order
  materials_for_order = X[X$Order.number==order, "Materials"]
  if (length(materials_for_order)>1) {
    # finding each possible pair from the materials list
    all_pairs_in_order = combn(x=materials_for_order, m=2)
    # incrementing the cell at the line and column corresponding to each pair
    for(i in 1:ncol(all_pairs_in_order)){
      Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] = Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] + 1
    }
  }
}

循环结束时,Pairs表应包含您需要的所有内容.

At the end of the loop, the Pairs table should contain everything you need.

这篇关于计算具有两个给定值的变量与另一个变量的一个值相对应的出现次数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆