R:具有随机分割的递归树算法 [英] R: recursive tree algorithm with a random split
问题描述
我对编写递归二叉树算法感兴趣。给定以下数据,我已经在其中对协变量 x
I'm interested in writing up a recursive binary tree algorithm. Given the following data where I've already sorted the covariate x
mydata <- data.frame(x = c(10, 20, 25, 35), y = c(-10.5, 6.5, 7.5, -7.5))
> mydata
x y
1 10 -10.5
2 20 6.5
3 25 7.5
4 35 -7.5
假设我的最后一棵树看起来像这样:
Suppose my final tree looks something like this:
[-10.5, 6.5, 7.5, -7.5]
/ \
[-10.5] [6.5, 7.5, -7.5]
/ \
[6.5, 7.5] [ -7.5]
我希望函数的最终输出返回包含所有节点的列表:
I want the final output of my function to return a list that contains all of the nodes:
> final_tree
[[1]]
[[1]][[1]]
x y
1 10 -10.5
2 20 6.5
3 25 7.5
4 35 -7.5
[[2]]
[[2]][[1]]
x y
1 10 -10.5
[[2]][[2]]
x y
1 20 6.5
2 25 7.5
3 35 -7.5
[[3]]
[[3]][[1]]
NULL
[[3]][[2]]
NULL
[[3]][[3]]
x y
1 20 6.5
2 25 7.5
[[3]][[4]]
x y
1 35 -7.5
我正在使用 best_split_ind
在每个节点上随机拆分树。如果 best_split_ind = 1
,则意味着 node_parent
中的第一个实例将以 node_left
,其余部分以 node_right
结尾。如果 best_split_ind = 3
,则这意味着 node_parent
中的前三个实例将以<$ c $结尾c> node_left ,剩下的最后是 node_right
。
I am splitting my tree at every node with a random split by using best_split_ind
. If best_split_ind = 1
, then that means the 1st instance in the node_parent
will end up in the node_left
, and the rest end up in node_right
. If best_split_ind = 3
, then that means the first three instances in the node_parent
will end up in the node_left
, and the rest end up in node_right
.
这就是我的意思到目前为止:
Here's what I have so far:
# Initialize empty tree
create_empty_tree <- function(max_height) sapply(1:max_height, function(k) replicate(2**(k-1),c()))
# Create empty tree with max_height = 3
tree_struc <- create_empty_tree(max_height = 3)
grow_tree <- function(node_parent, max_height, tree_struc, height){
# Sort x
sorted_x <- sort(node_parent$x)
# Determine best split
best_split_ind <- sample(1:(nrow(node_parent) - 1), 1)
# Assign instances to left or right nodes
group <- ifelse(node_parent$x <= node_parent$x[best_split_ind], "left", "right")
node_left <- node_parent[which(group == "left"), ]
node_right <- node_parent[which(group == "right"), ]
# Recursive call on left and right nodes
if(height < max_height){
tree_struc[[height]] <- node_parent
tree_struc[[height + 1]][[1]] <- grow_tree(node_parent = node_left, max_height = max_height, tree_struc = tree_struc, height = height + 1)
tree_struc[[height + 1]][[2]] <- grow_tree(node_parent = node_right, max_height = max_height, tree_struc = tree_struc, height = height + 1)
}
return(tree_struc)
}
grow_tree(node_parent = mydata, max_height = 3, tree_struc = tree_struc, height = 1)
结果树不正确。我认为这与我在左侧和右侧子节点上递归调用函数的方式有关。有人可以指出我正确的方向吗?
The resulting tree is not correct. I think it has to do with how I recursively called the function on the left and right child nodes. Can anyone point me in the right direction?
推荐答案
我可能误会了您,但是您可以通过使用以下方法在这里简化很多两个互相递归调用的函数。
I may have misunderstood you, but you can simplify quite a bit here by using two functions that call each other recursively. There's no need to set up an initial container.
第一个函数是我们甚至不需要手动调用的函数,但将从我们的<$内部调用c $ c> grow_tree 函数。它只是检查它是否尚未达到最大树深以及是否有足够的元素要拆分。如果是这样,它将在其内容上调用 grow_tree
。否则,它返回其内容不变:
The first function is one that we don't even need to call manually, but will be called from inside our grow_tree
function. It simply checks that it has not reached the maximum tree depth and that there are enough elements left to split. If so, it calls grow_tree
on its contents. Otherwise, it returns its contents unchanged:
conditional_split <- function(df, depth, max_depth)
{
if(nrow(df) == 1 | depth == max_depth) return(df)
else grow_tree(df, depth + 1, max_depth)
}
我们的主函数可以安全地拆分给定的数据帧并递归调用 conditional_split
和 lapply
:
Our main function can then safely split the given data frame and recursively call conditional_split
with lapply
:
grow_tree <- function(df, depth = 1, max_depth = 3)
{
break_at <- sample(nrow(df) - 1, 1)
branched <- list(left = df[1:break_at,], right = df[-seq(break_at),])
lapply(branched, conditional_split, depth, max_depth)
}
我认为这符合您的需求:
I think this does what you're looking for:
grow_tree(mydata, max_depth = 3)
#> $left
#> x y
#> 1 10 -10.5
#>
#> $right
#> $right$left
#> $right$left$left
#> x y
#> 2 20 6.5
#>
#> $right$left$right
#> x y
#> 3 25 7.5
#>
#>
#> $right$right
#> x y
#> 4 35 -7.5
您可以像以下那样轻松地更改最大树深度:
And you can change the maximum tree depth as easily as:
grow_tree(mydata, max_depth = 2)
#> $left
#> $left$left
#> x y
#> 1 10 -10.5
#>
#> $left$right
#> x y
#> 2 20 6.5
#> 3 25 7.5
#>
#>
#> $right
#> x y
#> 4 35 -7.5
这篇关于R:具有随机分割的递归树算法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!