R-插入符createDataPartition返回的样本比预期的多 [英] R - caret createDataPartition returns more samples than expected

查看:153
本文介绍了R-插入符createDataPartition返回的样本比预期的多的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将虹膜数据集分为训练集和测试集。我像这样使用 createDataPartition()

I'm trying to split the iris dataset into a training set and a test set. I used createDataPartition() like this:

library(caret)
createDataPartition(iris$Species, p=0.1)
# [1]  12  22  26  41  42  57  63  79  89  93 114 117 134 137 142

createDataPartition(iris$Sepal.Length, p=0.1)
# [1]   1  27  44  46  54  68  72  77  83  84  93  99 104 109 117 132 134

我了解第一个查询。我有一个0.1 * 150元素的向量(150是数据集中的样本数)。但是,在第二个查询中我应该具有相同的向量,但是我得到的是17个元素的向量,而不是15个。

I understand the first query. I have a vector of 0.1*150 elements (150 is the number of samples in the dataset). However, I should have the same vector on the second query but I am getting a vector of 17 elements instead of 15.

关于为什么得到这些结果的任何想法?

Any ideas as to why I get these results?

推荐答案

Sepal.Length 是数字功能;从在线文档中获取:

Sepal.Length is a numeric feature; from the online documentation:


对于数字 y ,样本将根据百分位数分为几组并在这些子组中进行抽样。对于 createDataPartition ,百分位数是通过 groups 参数设置的。

For numeric y, the sample is split into groups sections based on percentiles and sampling is done within these subgroups. For createDataPartition, the number of percentiles is set via the groups argument.

个组:对于数字 y ,分位数的中断次数

groups: for numeric y, the number of breaks in the quantiles

,默认值为:


groups = min(5,length(y)

这是您的情况:

因为您未指定 groups ,它的值 min(5,150)= 5 个中断;现在,在这种情况下,这些中断与自然分位数重合,即最小值,第一分位数,中位数,第三分位数和最大值-您可以从摘要

Since you do not specify groups, it takes a value of min(5, 150) = 5 breaks; now, in that case, these breaks coincide with the natural quantiles, i.e. the minimum, the 1st quantile, the median, the 3rd quantile, and the maximum - which you can see from the summary:

> summary(iris$Sepal.Length)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.300   5.100   5.800   5.843   6.400   7.900 

对于数字功能,该函数将使用<$ c $的百分比从上述中断(分位数)定义的(4)个间隔中的每个 中,c> p = 0.1 ;让我们看看每个这样的间隔有多少个样本:

For numeric features, the function will take a percentage of p = 0.1 from each one of the (4) intervals defined by the above breaks (quantiles); let's see how many samples we have per such interval:

l1 = length(which(iris$Sepal.Length >= 4.3 & iris$Sepal.Length <= 5.1)) # 41
l2 = length(which(iris$Sepal.Length > 5.1 & iris$Sepal.Length <= 5.8))  # 39
l3 = length(which(iris$Sepal.Length > 5.8 & iris$Sepal.Length <= 6.4))  # 35
l4 = length(which(iris$Sepal.Length > 6.4 & iris$Sepal.Length <= 7.9))  # 35

每个间隔确切返回多少个样本?这是要抓的-根据源代码,它将是产品编号之间的上限。样本和您的 p ;让我们看看在 p = 0.1 的情况下应该是什么:

Exactly how many samples will be returned from each interval? Here is the catch - according to line # 140 of the source code, it will be the ceiling of the product between the no. of samples and your p; let's see what this should be in your case for p = 0.1:

ceiling(l1*p) + ceiling(l2*p) + ceiling(l3*p) + ceiling(l4*p)
# 17

宾果! :)

这篇关于R-插入符createDataPartition返回的样本比预期的多的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆