如何从插入符号包拆分数据的createDataPartition功能? [英] How does createDataPartition function from caret package split data?

查看:276
本文介绍了如何从插入符号包拆分数据的createDataPartition功能?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从文档中:


对于引导程序样本,使用简单的随机抽样。

For bootstrap samples, simple random sampling is used.

对于其他数据拆分,当y是试图平衡$ b内的类分布的因素时,在y
的水平内进行随机采样$ b分割。

For other data splitting, the random sampling is done within the levels of y when y is a factor in an attempt to balance the class distributions within the splits.

对于数字y,样本将根据百分位数
分为几组,并在这些子组内进行抽样。

For numeric y, the sample is split into groups sections based on percentiles and sampling is done within these subgroups.

对于createDataPartition,百分位数是通过组
参数设置的。

For createDataPartition, the number of percentiles is set via the groups argument.

我不明白为什么需要这种平衡的东西。我认为我表面上了解它,但是任何其他见解都将真正有用。

I don't understand why this "balance" thing is needed. I think I understand it superficially, but any additional insight would be really helpful.

推荐答案

这意味着,如果您有数据集 ds 有10000行

It means, if you have a data set ds with 10000 rows

set.seed(42)
ds <- data.frame(values = runif(10000))

具有2个类分布不均(9000 vs 1000)

with 2 "classes" with unequal distribution (9000 vs 1000)

ds$class <- c(rep(1, 9000), rep(2, 1000))
ds$class <- as.factor(ds$class)
table(ds$class)
#    1    2 
# 9000 1000 

您可以创建一个样本,该样本试图保持因子的比率/余额 类。

you can create a sample, which tries to maintain the ratio / "balance" of the factor classes.

dpart <- createDataPartition(ds$class, p = 0.1, list = F)
dsDP <- ds[dpart, ]
table(dsDP$class)
#   1   2 
# 900 100 

这篇关于如何从插入符号包拆分数据的createDataPartition功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆