R中的关联规则-删除冗余规则(规则) [英] Association rule in R - removing redundant rule (arules)

查看:446
本文介绍了R中的关联规则-删除冗余规则(规则)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设我们有3条规则:

[1] {A,B,D} -> {C}

[2] {A,B} -> {C}

[3] Whatever it is

规则[2]是规则[1]的子集(因为规则[1]包含规则[2]中的所有项目),因此应删除规则[1](因为规则[1]过于具体且其信息包含在规则[2])

Rule [2] is a subset of rule [1] (because rule [1] contains all the items in rule [2]), so rule [1] should be eliminated (because rule [1] is too specific and its information is included in rule [2] )

我通过互联网进行搜索,每个人都在使用以下代码来删除多余的规则:

I searched through the internet and everyone is using these code to remove redundant rules:

subset.matrix <- is.subset(rules.sorted, rules.sorted)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
which(redundant)
rules.pruned <- rules.sorted[!redundant]

我不了解代码的工作方式.

I dont understand how the code work.

在代码的第2行之后,subset.matrix将变为:

After line 2 of the code, the subset.matrix will become:

      [,1] [,2] [,3]
[1,]   NA    1    0
[2,]   NA   NA    0
[3,]   NA   NA   NA

下部三角形中的单元格设置为NA,并且由于规则[2]是规则[1]的子集,因此相应的单元格设置为1.所以我有2个问题:

The cells in the lower triangle are set to be NA and since rule [2] is a subset of rule [1], the corresponding cell is set to 1. So I have 2 questions:

  1. 为什么我们必须将下部三角形设置为NA?如果这样做,那么如何检查规则[2]是否是规则[3]的子集? (该单元格已设置为NA)

  1. Why do we have to set the lower triangle as NA? If we do so then how can we check whether rule [2] is subset of rule [3] or not? (the cell has been set as NA)

在我们的例子中,规则[1]应该是要消除的一条,但是这些代码消除了规则[2]而不是规则[1]. (由于第2列的第一个单元格为1,并且根据代码的第3行,第2列的列总和> = 1,因此将被视为冗余)

In our case, rule [1] should be the one to be eliminated, but these code eliminate rule [2] instead of rule [1]. (Because the first cell in column 2 is 1, and according to line 3 of the code, the column sums of column 2 >= 1, therefore will be treated as redundant)

任何帮助将不胜感激!

Any help would be appreciated !!

推荐答案

为使代码正常工作,您需要一种兴趣度量(置信度或提升),并且rules.sorted需要按置信度或提升进行排序.无论如何,由于is.subset()创建大小为n ^ 2的矩阵,因此代码效率极低,其中n是规则数.另外,规则的is.subset合并不正确的规则的rhs和lhs.因此,不必太担心实现细节.

For your code to work you need an interest measure (confidence or lift) and rules.sorted needs to be sorted by either confidence or lift. Anyway, the code is horribly inefficient since is.subset() creates a matrix of size n^2, where n is the number of rules. Also, is.subset for rules merges rhs and lhs of the rule which is not correct. So don't worry too much about the implementation details.

现在更有效的方法是在软件包规则中使用功能is.redundant()实施(在版本1.4-2中可用). 该说明来自手册页:

A more efficient way to do this is now implemented as function is.redundant() in package arules (available in version 1.4-2). This explanation comes from the manual page:

如果更通用的规则相同或更高,则该规则是多余的 有信心.也就是说,更具体的规则是多余的 仅比更一般的规则具有同等甚至更少的预测能力.一条规则 如果具有相同的RHS但已删除一个或多个项目,则更为通用 来自LHS.形式上,规则X-> Y是多余的

A rule is redundant if a more general rules with the same or a higher confidence exists. That is, a more specific rule is redundant if it is only equally or even less predictive than a more general rule. A rule is more general if it has the same RHS but one or more items removed from the LHS. Formally, a rule X -> Y is redundant if

对于某些X'子集X,conf(X'-> Y)> = conf(X-> Y).

for some X' subset X, conf(X' -> Y) >= conf(X -> Y).

这等同于由定义为负或零的改进 Bayardo等. (2000).在此实施中,除 信心,例如提升力也可以使用.

This is equivalent to a negative or zero improvement as defined by Bayardo et al. (2000). In this implementation other measures than confidence, e.g. improvement of lift, can be used as well.

查看? is.redundant中的示例.

这篇关于R中的关联规则-删除冗余规则(规则)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆