Rpart软件包生成的测试规则 [英] Testing rules generated by Rpart package

查看:165
本文介绍了Rpart软件包生成的测试规则的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想以编程方式测试从树生成的一条规则.在树中,根和叶(终端节点)之间的路径可以解释为规则.

I want to test in a programmatically way one rule generated from a tree. In the trees the path between the root and a leaf (terminal node) could be interpreted as a rule.

在R中,我们可以使用rpart包并执行以下操作: (在本文中,我将使用iris数据集,仅供参考)

In R, we could use the rpart package and do the following: (In this post, I will use the iris data set, for example purposes only)

library(rpart)
model <- rpart(Species ~ ., data=iris)

通过这两行,我得到了一棵名为model的树,其树为rpart.object(rpart文档,第21页).该对象具有大量信息,并支持多种方法.特别是,该对象具有frame变量(可以以标准方式访问:model$frame)( idem )和方法path.rpath(rpart文档,第7页) ,它为您提供了从根节点到感兴趣的节点的路径(函数中的node参数)

With this two lines I got a tree named model, whose class is rpart.object (rpart documentation, page 21). This object has a lot of information, and supports a variety of methods. In particular, the object has a frame variable (which can be accessed in the standard way: model$frame)(idem) and the method path.rpath (rpart documentation, page 7), which gives you the path from the root node to the node of interest (node argument in the function)

frame变量的row.names包含树的节点号. var列提供节点中的分割变量,yval拟合值和yval2类概率以及其他信息.

The row.names of the frame variable contains the node numbers of the tree. The var column gives the split variable in the node, yval the fitted value and yval2 class probabilities and other information.

> model$frame
           var   n  wt dev yval complexity ncompete nsurrogate     yval2.1     yval2.2     yval2.3     yval2.4     yval2.5     yval2.6     yval2.7
1 Petal.Length 150 150 100    1       0.50        3          3  1.00000000 50.00000000 50.00000000 50.00000000  0.33333333  0.33333333  0.33333333
2       <leaf>  50  50   0    1       0.01        0          0  1.00000000 50.00000000  0.00000000  0.00000000  1.00000000  0.00000000  0.00000000
3  Petal.Width 100 100  50    2       0.44        3          3  2.00000000  0.00000000 50.00000000 50.00000000  0.00000000  0.50000000  0.50000000
6       <leaf>  54  54   5    2       0.00        0          0  2.00000000  0.00000000 49.00000000  5.00000000  0.00000000  0.90740741  0.09259259
7       <leaf>  46  46   1    3       0.01        0          0  3.00000000  0.00000000  1.00000000 45.00000000  0.00000000  0.02173913  0.97826087

但只有var列中标记为<leaf>的是终端节点( leafs ).在这种情况下,节点为2、6和7.

But only the marked as <leaf> in the var column are terminal nodes (leafs). In this case the nodes are 2, 6 and 7.

如上所述,您可以使用path.rpart方法提取规则(此技术用于rattle程序包和文章

As mentioned above you can use the path.rpart method for extract a rule (this technique is used in the rattle package and in the article Sharma Credit Score, as follows:

另外,模型将预测值保留在

Aditionally, the model keeps the values of the predicted value in

predicted.levels <- attr(model, "ylevels")

此值与model$frame数据集中的列yval对应.

This value correspond with the column yval in the model$frame data set.

对于节点号为7(行号为5)的叶子,预测值为

For the leaf with node number 7 (row number 5), the predicted value is

> ylevels[model$frame[5, ]$yval]
[1] "virginica"

规则是

> rule <- path.rpart(model, nodes = 7)

 node number: 7 
   root
   Petal.Length>=2.45
   Petal.Width>=1.75

因此,该规则可以理解为

So, the rule could be read as

If Petal.Length >= 2.45 AND Petal.Width >= 1.75 THEN Species = Virginica

我知道我可以对此规则进行测试(在测试数据集中,我将再次使用虹膜数据集)多少真实阳性,将子集设置为新数据集,如下所示:

I know that I can test (in a testing data set, I will use the iris data set again) how many true positives I have for this rule, subsetting the new data set as follows

> hits <- subset(iris, Petal.Length >= 2.45 & Petal.Width >= 1.75)

然后计算混淆矩阵

> table(hits$Species, hits$Species == "virginica")

             FALSE TRUE
  setosa         0    0
  versicolor     1    0
  virginica      0   45

(注意:我使用与测试相同的虹膜数据集)

(Note: I used the same iris data set as testing)

我如何以编程方式评估规则?我可以从规则中提取条件,如下所示:

How I could evaluate the rule in a programmatically way? I could extract the conditions from the rule as follows

> unlist(rule, use.names = FALSE)[-1]
[1] "Petal.Length>=2.45" "Petal.Width>=1.75" 

但是,我如何才能从这里继续?我不能使用subset函数

But, how I can continue from here? I can not use the subset function

预先感谢

注意: 此问题已过大量编辑,以提高清晰度

推荐答案

我可以通过以下方式解决此问题

I could solve this in the following way

免责声明:显然,一定是解决此问题的更好方法,但是这种黑客工具确实可以工作,并且可以完成我想要的...(我并不为此感到骄傲...是黑客工具,但是可以工作)

DISCLAIMER: Obviously must be better ways of solving this, but this hacks works and do what I want... (I am not very proud of it...is hackish, but works)

好,让我们开始吧.基本上,这个想法是使用包sqldf

Ok, lets start. Basically the idea is using the package sqldf

如果您检查问题,则最后一段代码将在树的每条路径中放入一个列表.所以,我将从那里开始

If you check the question, the last piece of code, puts in a list every piece of the path of the tree. So, I will start from there

        library(sqldf)
        library(stringr)

        # Transform to a character vector
        rule.v <- unlist(rule, use.names=FALSE)[-1]
        # Remove all the dots, sqldf doesn't handles dots in names 
        rule.v <- str_replace_all(rule.v, pattern="([a-zA-Z])\\.([a-zA-Z])", replacement="\\1_\\2")
        # We have to remove all the equal signs to 'in ('
        rule.v <- str_replace_all(rule.v, pattern="([a-zA-Z0-9])=", replacement="\\1 in ('")
        # Embrace all the elements in the lists of values with " ' " 
        # The last element couldn't be modified in this way (Any ideas?) 
        rule.v <- str_replace_all(rule.v, pattern=",", replacement="','")

        # Close the last element with apostrophe and a ")" 
        for (i in which(!is.na(str_extract(pattern="in", string=rule.v)))) {
          rule.v[i] <- paste(append(rule.v[i], "')"), collapse="")
        }

        # Collapse all the list in one string joined by " AND "
        rule.v <- paste(rule.v, collapse = " AND ")

        # Generate the query
        # Use any metric that you can get from the data frame
        query <- paste("SELECT Species, count(Species) FROM iris WHERE ", rule.v, " group by Species", sep="")

        # For debug only...
        print(query)

        # Execute and print the results
        print(sqldf(query))

仅此而已!

我警告过你,这很黑...

I warned you, It was hackish...

希望这对其他人有帮助...

Hopefully this helps someone else ...

感谢所有帮助和建议!

Thanks for all the help and suggestions!

这篇关于Rpart软件包生成的测试规则的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆