R-插入符createDataPartition返回的样本比预期的多 [英] R - caret createDataPartition returns more samples than expected
问题描述
我正在尝试将虹膜数据集分为训练集和测试集。我像这样使用 createDataPartition()
:
I'm trying to split the iris dataset into a training set and a test set. I used createDataPartition()
like this:
library(caret)
createDataPartition(iris$Species, p=0.1)
# [1] 12 22 26 41 42 57 63 79 89 93 114 117 134 137 142
createDataPartition(iris$Sepal.Length, p=0.1)
# [1] 1 27 44 46 54 68 72 77 83 84 93 99 104 109 117 132 134
我了解第一个查询。我有一个0.1 * 150元素的向量(150是数据集中的样本数)。但是,在第二个查询中我应该具有相同的向量,但是我得到的是17个元素的向量,而不是15个。
I understand the first query. I have a vector of 0.1*150 elements (150 is the number of samples in the dataset). However, I should have the same vector on the second query but I am getting a vector of 17 elements instead of 15.
关于为什么得到这些结果的任何想法?
Any ideas as to why I get these results?
推荐答案
Sepal.Length
是数字功能;从在线文档中获取:
Sepal.Length
is a numeric feature; from the online documentation:
对于数字
y
,样本将根据百分位数分为几组并在这些子组中进行抽样。对于createDataPartition
,百分位数是通过groups
参数设置的。
For numeric
y
, the sample is split into groups sections based on percentiles and sampling is done within these subgroups. ForcreateDataPartition
, the number of percentiles is set via thegroups
argument.
个组
:对于数字 y
,分位数的中断次数
groups
: for numeric y
, the number of breaks in the quantiles
,默认值为:
groups = min(5,length(y)
)
这是您的情况:
因为您未指定 groups
,它的值 min(5,150)= 5
个中断;现在,在这种情况下,这些中断与自然分位数重合,即最小值,第一分位数,中位数,第三分位数和最大值-您可以从摘要$ c中看到$ c>:
Since you do not specify groups
, it takes a value of min(5, 150) = 5
breaks; now, in that case, these breaks coincide with the natural quantiles, i.e. the minimum, the 1st quantile, the median, the 3rd quantile, and the maximum - which you can see from the summary
:
> summary(iris$Sepal.Length)
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.300 5.100 5.800 5.843 6.400 7.900
对于数字功能,该函数将使用<$ c $的百分比从上述中断(分位数)定义的(4)个间隔中的每个 中,c> p = 0.1 ;让我们看看每个这样的间隔有多少个样本:
For numeric features, the function will take a percentage of p = 0.1
from each one of the (4) intervals defined by the above breaks (quantiles); let's see how many samples we have per such interval:
l1 = length(which(iris$Sepal.Length >= 4.3 & iris$Sepal.Length <= 5.1)) # 41
l2 = length(which(iris$Sepal.Length > 5.1 & iris$Sepal.Length <= 5.8)) # 39
l3 = length(which(iris$Sepal.Length > 5.8 & iris$Sepal.Length <= 6.4)) # 35
l4 = length(which(iris$Sepal.Length > 6.4 & iris$Sepal.Length <= 7.9)) # 35
每个间隔确切返回多少个样本?这是要抓的-根据源代码,它将是产品编号之间的上限。样本和您的 p
;让我们看看在 p = 0.1
的情况下应该是什么:
Exactly how many samples will be returned from each interval? Here is the catch - according to line # 140 of the source code, it will be the ceiling of the product between the no. of samples and your p
; let's see what this should be in your case for p = 0.1
:
ceiling(l1*p) + ceiling(l2*p) + ceiling(l3*p) + ceiling(l4*p)
# 17
宾果! :)
这篇关于R-插入符createDataPartition返回的样本比预期的多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!