R - 先验算法的 For 循环 [英] R - For loop for apriori Algorithm

查看:31
本文介绍了R - 先验算法的 For 循环的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

今天一个关于充满数据挖掘先验算法的for循环的问题.我正在研究先验算法中的结果分析,但是,正如您已经知道的那样,算法的两个主要参数(置信度和支持度)是之前设置的,但不知道结果.这意味着有时您必须尝试不同的参数组合才能达到令人满意的结果.我决定尝试在 R 中设置一个 for 循环,我打算达到这种类型的结果:

Today a question about a for loop filled with the data mining apriori Algorithm. I'am working on the analysis of the results in a apriori Algorithm but, as you already know, the two main parameters (confidence and support) of the algorithm are setted before, without knowing the results. This means sometimes you ought to have to try different combinations of parameters to reach a satysfing result. I decided to try to set a for loop in R, with this type of result I intend to reach:

vector  s  c
x1      y1 z1
x2      y1 z2
x3      y1 z3
x4      y2 z1
x5      y2 z2
x6      y2 z3
...
xn      yn zn

以 x 的向量作为创建规则的数量,向量 s 为支持参数(0<=s<=1),c 为置信度参数(0<=s<=1).这意味着对于每个我想要的每个级别的支持值,我需要创建的规则数量,所有这些都存储在一个漂亮的三列数据框中.

with the vector of the x as the number of rules created, the vector s with the support parameter (0<=s<=1), and c the confidence parameter (0<=s<=1). This means that for each value I want of the support per each level I want of the confidence, I'll have the number of the rules created, all stored in a nice data frame of three columns.

显然我自己开始寻找解决方案.我认为这两个参数应该是一对序列,所以不知道用两个序列进行 for 循环,并使用我的一个旧问题:

Clearly I started by myself to find the solution. I've thought that the two parameters should be a pair of sequences so, having no idea of doing a for loop with two sequencies, and using one of my old question:

带有小数的 for 循环并将结果存储在一个矢量

我尝试制作一个只有一个移动"参数的简单 for 循环,第二个是固定的.首先我创建了一些假数据,因为非常小而有用.

I tried to make a simple for loop with only one "moving" parameters, with the second fixed. First of all I created some fake data, useful because very small.

# here the data
id <- c("1","1","1","2","2","2","3","3","3")
obj <- c("a", "b", "j", "a", "g","c", "a","k","c")
df <- data.frame(id,obj)

然后进行转换,使数据可被 arules 包的先验函数消化:

Then, a conversion, to make the data digestible for the apriori function of arules package:

# here the rewritten data
library(arules)
transactions <- as(split(df$obj, df$id), "transactions")
inspect(transactions)

最后,只有一个移动参数的函数,支持:

And last, the function with only one moving parameter, the support:

  test <- function(x, y1, y2, y3, z){

# the sequence for the support
  s <- seq(y1, y2, by = y3)

# empty vector
  my_vector <- vector("numeric")

# for loop with moving support (in the seq) and fixed confidence
  for(i in seq_along(s)){my_vector <- nrow( data.frame(

# this is a small trick to have the row of the rules, do not know if it is perfect
  labels(lhs(apriori(x,parameter=list(supp = s[i], conf = z))))))} 
my_vector

# put the result in a data frame
data <- data.frame (vector = as.numeric(my_vector),s = as.numeric(s))
return(data)
}

这里是第一个有结果的应用程序:

And here the first application with some result:

# the function applied
test(transactions, 0.01, 0.1, 0.01, 0.1)

# the result: the apriori function generates also its output, avoided here
   vector    s
1      31 0.01
2      31 0.02
3      31 0.03
4      31 0.04
5      31 0.05
6      31 0.06
7      31 0.07
8      31 0.08
9      31 0.09
10     31 0.10

如果你提交这个

apriori(transactions,parameter=list(supp = 0.01, conf = 0.1))
apriori(transactions,parameter=list(supp = 0.1, conf = 0.1))

结果是一致的.

现在是困难的部分(对我来说).我也希望置信参数有所不同.我研究了一下这个:

Now the difficult part (to me). I would like also the confidence parameter to vary. I studied a bit this:

在for循环中包含多个条件

但是我有一个很大的限制,我无法想象如何应用它.我可以改变第一个参数,并为每个值尝试移动"第二个.在这种情况下,如果支持在 0.1 和 0.01 之间变化 0.01,那么置信度也是如此,结果应该是一个 100 行的向量.

But I got a great limitation, I cannot imagine how I could apply it. I could make vary the first parameter, and for each value try to make "moving" the second. In this case if the support vary between 0.1 and 0.01 by 0.01, and so the confidence, the result should be a vector of 100 rows.

另外,我有一些技术问题,我不能做提到的事情.我知道这个程序对机器来说可能有点苛刻,但我想要一个可以使用的程序.

Also, I have some technical issue, I am not capable to do such thing mentioned. I know that this procedure could be a bit harsh for the machine, but I would like to have one that is capable to be used.

我想帮忙,并提前感谢您的时间.

I'd like to have an help, and thanks in advance for your time.

推荐答案

With dplyr.
首先,创建一个参数网格.
然后为每个参数组合构建一个模型,并将其存储在一个列表列中(用于进一步计算).
然后在每个模型上使用 length() 函数,这似乎完全符合您的小技巧":

With dplyr.
First, create a grid of parameters.
Then build a model for each combination of parameters, and store it in a list-column (useful for further computations).
Then use the length() function on each model, which seems to do exactly what you want with your "small trick":

grid <- expand.grid(support = seq(0.01, 0.1, 0.01),
                    confidence = seq(0.01, 0.1, 0.01))
library(dplyr)
res <- 
  grid %>% 
  group_by(support, confidence) %>% 
  do(model = apriori(
    transactions,
    parameter = list(support = .$support, confidence = .$confidence)
  )) %>% 
  mutate(n_rules = length(model)) %>%
  ungroup()

# # A tibble: 100 × 4
#    support confidence       model n_rules
#      <dbl>      <dbl>      <list>   <int>
# 1     0.01       0.01 <S4: rules>      31
# 2     0.01       0.02 <S4: rules>      31
# 3     0.01       0.03 <S4: rules>      31
# 4     0.01       0.04 <S4: rules>      31
# 5     0.01       0.05 <S4: rules>      31
# 6     0.01       0.06 <S4: rules>      31
# 7     0.01       0.07 <S4: rules>      31
# 8     0.01       0.08 <S4: rules>      31
# 9     0.01       0.09 <S4: rules>      31
# 10    0.01       0.10 <S4: rules>      31
# # ... with 90 more rows

您可能希望重复使用每个模型.由于它们都存储在您的结果数据框中,因此应该更方便.
要检查单个模型,您可以执行以下操作:

You may want to re-use each model. Since they're all stored in your resulting dataframe, it should be more convenient.
To examine a single model, you could do for instance:

summary(res$model[res$confidence == 0.03 & res$support == 0.04][[1]])

# set of 31 rules
# 
# rule length distribution (lhs + rhs):sizes
#  1  2  3 
#  6 16  9 
# 
#    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#   1.000   2.000   2.000   2.097   3.000   3.000 
# 
# summary of quality measures:
#     support         confidence          lift      
#  Min.   :0.3333   Min.   :0.3333   Min.   :1.000  
#  1st Qu.:0.3333   1st Qu.:0.4167   1st Qu.:1.000  
#  Median :0.3333   Median :1.0000   Median :1.000  
#  Mean   :0.3871   Mean   :0.7419   Mean   :1.387  
#  3rd Qu.:0.3333   3rd Qu.:1.0000   3rd Qu.:1.500  
#  Max.   :1.0000   Max.   :1.0000   Max.   :3.000  
# 
# mining info:
#          data ntransactions support confidence
#  transactions             3    0.04       0.03

这篇关于R - 先验算法的 For 循环的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆