用于变量选择的 R 滚动随机森林 [英] R Rolling Random Forest for Variables Selection

查看:69
本文介绍了用于变量选择的 R 滚动随机森林的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个自 2008 年以来欧洲斯托克 50 指数的每日 OHLC 数据集,看起来像这样:

I've got a daily OHLC dataset of the Euro Stoxx 50 index since 2008 which looks like that :

              Open    High     Low   Close Volume Adjusted
2008-01-02 4393.53 4411.59 4330.73 4339.23      0  4339.23
2008-01-03 4335.91 4344.36 4312.34 4333.42      0  4333.42
2008-01-04 4331.25 4343.46 4253.69 4270.53      0  4270.53
2008-01-07 4268.43 4294.45 4257.22 4283.37      0  4283.37
2008-01-08 4292.40 4330.56 4292.40 4295.23      0  4295.23
2008-01-09 4285.34 4285.34 4246.92 4258.32      0  4258.32

我使用 TTR 包计算了几个技术规则.因此,我得到了一个更大的数据集:

I've computed several technical rules using the TTRpackage. I thus get a bigger dataset like that :

               RSI2     RSI3     RSI4     RSI5    RSI10    RSI20     SMA5    SMA20    SMA60     EMA5    EMA20    EMA60      atr      SMI
2009-01-07 97.964071 92.62210 87.21605 82.40040 66.95642 55.19221 19720.64 18655.29 17758.68 2556.777 2556.777 2556.777 82.06602 27.52145
2009-01-08 43.766573 58.62387 62.97794 64.03382 60.23197 52.99739 19756.44 18666.60 17754.07 2566.499 2566.499 2566.499 80.33416 29.12141
2009-01-09 27.182247 44.97072 52.29336 55.50633 56.74068 51.80171 19776.92 18674.31 17750.34 2523.372 2523.372 2523.372 78.65886 29.37878
2009-01-12 13.371347 30.46561 39.97055 45.24210 52.16207 50.17764 19788.02 18683.05 17748.76 2524.466 2524.466 2524.466 78.58966 28.17871
2009-01-13  6.141462 19.52298 29.30404 35.68593 47.25383 48.32987 19772.25 18693.01 17749.35 2488.165 2488.165 2488.165 76.08326 25.34705
2009-01-14  2.712386 11.97834 20.69541 27.26891 42.10718 46.23469 19747.87 18694.16 17742.88 2449.353 2449.353 2449.353 75.42231 20.65686

我想知道每个工作季度最重要的技术规则是什么.我决定使用在 randomForest 包中编码的随机森林 RI 算法,计算 Breiman 重要性度量(感谢 importance 函数)并选择具有可变重要性度量的技术规则大于季度样本的平均值.最后,我想得到整个期间技术规则的缩减数据集,以进行统计等.

I would like to know for each working quarter what are the most significant technical rules. I've decided to use the Random Forest-RI algorithm which have been coded in the randomForestpackage, compute the Breiman importance measure (thanks to the importancefunction) and selection the technical rules that have a variable importance measure greater that the mean of the quarterly sample. Eventually, I would like to get the reduced dataset of technical rules during the whole period to compute statistics and so on.

鉴于重要技术规则的数量可能随时间变化,包含最重要技术规则的数组的维度从四分之一到另一个都不相同.因此,我无法将所有值都放在一个对象中.

Given that the number of significant technical rules can vary over time, the dimensions of the array which contains the most significant technical rules are not the same from a quarter to antoher. As a consequence, I can't put all my values in a single object.

有没有一种方便的方法来存储我所有的季度数据集?

Is there a convenient way to store all my quarter dataset?

谢谢.

推荐答案

使用数据框或 xts 对象.它们都可以很好地处理不同数量的列.在您的情况下,由于您的所有数据列都是数字类型,因此您可以使用 xts 对象.

Use a data frame or an xts object. They both cope well with varying numbers of columns. In your case, as all your data columns are numeric type, you can use the xts object.

您在标题中说了滚动".您的意思是要分析 90 天的重叠时段吗?例如.2008-01-02 到 2008-04-02,然后是 2008-01-03 到 2008-04-03,依此类推?如果是这样,可以使用 rollapply(data,width=90,FUN).如果你想处理一个季度,一次一个, quarters <- split(data,'quarters') 然后(因为它给你一个 xts 对象的列表)lapply(宿舍,有趣)

You said "rolling" in your title. Did you mean you want to analyze 90 day overlapping periods? E.g. 2008-01-02 to 2008-04-02, then 2008-01-03 to 2008-04-03, and so on? If so rollapply(data,width=90,FUN) can be used. If you wanted to deal with quarters, one at a time, quarters <- split(data,'quarters') and then (as that gives you a list of xts objects) lapply(quarters,FUN)

我认为您使用单一数据结构的问题是 SMA5 从 2008-01-08 开始可用,但 SMA200 几乎要到年底才可用;这意味着在前三个季度 SMA200 列将只包含 NAs 吗?这可以.保留 NA 并在将数据传递给 RandomForest 之前处理它们.

I think your issue with using a single data structure was that SMA5 is available from 2008-01-08, but that SMA200 is not available until almost the end of the year; meaning that in the first three quarters the SMA200 column will contain nothing but NAs? This is fine. Keep the NAs and deal with them just before you pass the data to RandomForest.

在 FUN 中,您将像这样删除包含 NA 的列(其中 xq 是一个仅包含四分之一数据的 xts 对象):

In FUN you will remove the columns that contain NA like this (where xq is an xts object containing data for just one quarter):

xq = xq[,!apply(is.na(x),2,any)]

更新:在重新阅读您的问题和您的后续问题后,我认为以上回答了您没有的问题!我认为问题在于您的 TTR 表中有 NA,而 RandomForest 不喜欢它们.

UPDATE: After re-reading your question, and your follow-up question I think the above answers the question you didn't have! I thought the issue was having NAs in your TTR table, and that RandomForest does not like them.

经过反思,我认为您的实际问题是随机森林通过对每个季度的分析为我提供了不同数量的良好指标,我该如何处理?"答案是一个参差不齐的数据结构,一个列表.每季度一个列表条目.列表条目本身可以是任何东西,甚至是 xts 对象,但在这种情况下,指标名称的简单字符向量似乎是完美的.这在 Zach 对您的其他问题的回答中很好地展示了.

On reflection, I think your actual question was "The RandomForest gives me a varying number of good indicators from its analysis of each quarter, how do I deal with that?" The answer is a ragged data structure, a list. One list entry per quarter. The list entry itself can be anything, even an xts object, but in this case a simple character vector of indicators names seems to be perfect. This is shown nicely in Zach's answer to your other question.

这篇关于用于变量选择的 R 滚动随机森林的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆