如何从插入符号包拆分数据的createDataPartition功能? [英] How does createDataPartition function from caret package split data?
问题描述
从文档中:
对于引导程序样本,使用简单的随机抽样。
For bootstrap samples, simple random sampling is used.
对于其他数据拆分,当y是试图平衡$ b内的类分布的因素时,在y
的水平内进行随机采样$ b分割。
For other data splitting, the random sampling is done within the levels of y when y is a factor in an attempt to balance the class distributions within the splits.
对于数字y,样本将根据百分位数
分为几组,并在这些子组内进行抽样。
For numeric y, the sample is split into groups sections based on percentiles and sampling is done within these subgroups.
对于createDataPartition,百分位数是通过组
参数设置的。
For createDataPartition, the number of percentiles is set via the groups argument.
我不明白为什么需要这种平衡的东西。我认为我表面上了解它,但是任何其他见解都将真正有用。
I don't understand why this "balance" thing is needed. I think I understand it superficially, but any additional insight would be really helpful.
推荐答案
这意味着,如果您有数据集 ds
有10000行
It means, if you have a data set ds
with 10000 rows
set.seed(42)
ds <- data.frame(values = runif(10000))
具有2个类分布不均(9000 vs 1000)
with 2 "classes" with unequal distribution (9000 vs 1000)
ds$class <- c(rep(1, 9000), rep(2, 1000))
ds$class <- as.factor(ds$class)
table(ds$class)
# 1 2
# 9000 1000
您可以创建一个样本,该样本试图保持因子的比率/余额
类。
you can create a sample, which tries to maintain the ratio / "balance" of the factor
classes.
dpart <- createDataPartition(ds$class, p = 0.1, list = F)
dsDP <- ds[dpart, ]
table(dsDP$class)
# 1 2
# 900 100
这篇关于如何从插入符号包拆分数据的createDataPartition功能?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!