R - caret createDataPartition 返回比预期更多的样本 [英] R - caret createDataPartition returns more samples than expected

查看:22
本文介绍了R - caret createDataPartition 返回比预期更多的样本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将 iris 数据集拆分为训练集和测试集.我像这样使用 createDataPartition():

I'm trying to split the iris dataset into a training set and a test set. I used createDataPartition() like this:

library(caret)
createDataPartition(iris$Species, p=0.1)
# [1]  12  22  26  41  42  57  63  79  89  93 114 117 134 137 142

createDataPartition(iris$Sepal.Length, p=0.1)
# [1]   1  27  44  46  54  68  72  77  83  84  93  99 104 109 117 132 134

我理解第一个查询.我有一个 0.1*150 个元素的向量(150 是数据集中的样本数).但是,我应该在第二个查询中使用相同的向量,但我得到的向量包含 17 个元素而不是 15 个.

I understand the first query. I have a vector of 0.1*150 elements (150 is the number of samples in the dataset). However, I should have the same vector on the second query but I am getting a vector of 17 elements instead of 15.

关于我为什么得到这些结果的任何想法?

Any ideas as to why I get these results?

推荐答案

Sepal.Length 是一个数值特征;来自在线文档:

Sepal.Length is a numeric feature; from the online documentation:

对于数字 y,样本会根据百分位数分成几组部分,并在这些子组内进行抽样.对于 createDataPartition,百分位数通过 groups 参数设置.

For numeric y, the sample is split into groups sections based on percentiles and sampling is done within these subgroups. For createDataPartition, the number of percentiles is set via the groups argument.

groups:对于数字y,分位数中的断点数

groups: for numeric y, the number of breaks in the quantiles

使用默认值:

groups = min(5, length(y))

您的情况如下:

由于你没有指定groups,它的值是min(5, 150) = 5个breaks;现在,在这种情况下,这些中断与自然分位数一致,即最小值、第一个分位数、中位数、第三个分位数和最大值 - 您可以从 summary 中看到:p>

Since you do not specify groups, it takes a value of min(5, 150) = 5 breaks; now, in that case, these breaks coincide with the natural quantiles, i.e. the minimum, the 1st quantile, the median, the 3rd quantile, and the maximum - which you can see from the summary:

> summary(iris$Sepal.Length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.300   5.100   5.800   5.843   6.400   7.900 

对于数字特征,该函数将从上述中断(分位数)定义的(4)个区间中的每个中获取一定百分比的p = 0.1;让我们看看每个这样的时间间隔有多少样本:

For numeric features, the function will take a percentage of p = 0.1 from each one of the (4) intervals defined by the above breaks (quantiles); let's see how many samples we have per such interval:

l1 = length(which(iris$Sepal.Length >= 4.3 & iris$Sepal.Length <= 5.1)) # 41
l2 = length(which(iris$Sepal.Length > 5.1 & iris$Sepal.Length <= 5.8))  # 39
l3 = length(which(iris$Sepal.Length > 5.8 & iris$Sepal.Length <= 6.4))  # 35
l4 = length(which(iris$Sepal.Length > 6.4 & iris$Sepal.Length <= 7.9))  # 35

每个间隔将返回多少样本?这是捕获 - 根据 源代码,它将是产品的天花板.样本和您的p;让我们看看 p = 0.1 的情况应该是什么:

Exactly how many samples will be returned from each interval? Here is the catch - according to line # 140 of the source code, it will be the ceiling of the product between the no. of samples and your p; let's see what this should be in your case for p = 0.1:

ceiling(l1*p) + ceiling(l2*p) + ceiling(l3*p) + ceiling(l4*p)
# 17

宾果!:)

这篇关于R - caret createDataPartition 返回比预期更多的样本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆