随机分配给训练/测试数据集时，将R数据帧中的行分组在一起 [英] Grouping rows from an R dataframe together when randomly assigning to training/testing datasets

查看：129 发布时间：2020/10/11 20:09:01 r sampling cross-validation

本文介绍了随机分配给训练/测试数据集时，将R数据帧中的行分组在一起的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我有一个数据框，该数据框由X行的块组成，每个行对应一个单独的人（其中每个人的X可以不同）。我想将这些人随机分配到训练样本，测试样本和验证样本中，但是到目前为止，我还无法获得正确的语法来确保用户的X行中的每一行始终被收集到同一子样本中。

I have a dataframe that consists of blocks of X rows, each corresponding to a single individual (where X can be different for each individual). I'd like to randomly distribute these individuals into train, test and validation samples but so far I haven't been able to get the syntax correct to ensure that each of a user's X rows are always collected into the same subsample.

例如，数据可以简化为：

For example, the data can be simplified to look like:

user    feature1     feature2
 1        "A"           "B"
 1        "L"           "L"
 1        "Q"           "B"
 1        "D"           "M"
 1        "D"           "M"
 1        "P"           "E"
 2        "A"           "B"
 2        "R"           "P"
 2        "A"           "F"
 3        "X"           "U"
...       ...           ...

，然后如果我最终将用户随机分配给火车，测试或验证集，则该用户的所有行（用户号是唯一的）将在同一组中，并分组在一起，以便用户1在训练中，例如足够，则格式仍为：

and then if I ended up randomly assigning the users to a train, test or validation set all of the rows for that user (the user number is unique) would be in the same set, and grouped together so that if user 1 was in the traininng set, for example, then the format would still be:

user    feature1     feature2
 1        "A"           "B"
 1        "L"           "L"
 1        "Q"           "B"
 1        "D"           "M"
 1        "D"           "M"
 1        "P"           "E"

作为奖励，我很想知道是否可以扩展此解决方案k折交叉验证，但到目前为止，我什至都没有想出这个更简单的第一步。

As a bonus I'd love to know if the solution to this could be extended to do k-folds cross validation, but so far I haven't even figured out this more simple first step.

预先感谢。

推荐答案

我们首先可以创建一个索引来指示每组数据。我选择测试：60％，训练：40％，验证：10％，但是您可以使用 prob = 参数选择所需的比率>样本。然后我们由 user 拆分数据帧。最后，我们根据创建的索引 rbind 用户。然后，我们可以调用 all_dfs [['train']] ，依此类推：

We can first create an index to indicate each set of data. I chose test: 60%, train: 40%, validation: 10%, but you can choose the ratio that you need with the prob= argument of sample. Then we split the data frame, by user. Lastly, we rbind the users based on the index we created. We can then call all_dfs[['train']] and so on:

indx <- sample(1:3, length(unique(df$user)), replace=TRUE, prob=c(.6,.4,.1))
s <- split(df, df$user)
all_dfs <- lapply(1:3, function(x) do.call(rbind, s[indx==x]))
names(all_dfs) <- c('train', 'test', 'validation')

这篇关于随机分配给训练/测试数据集时，将R数据帧中的行分组在一起的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

随机分配给训练/测试数据集时，将R数据帧中的行分组在一起 [英] Grouping rows from an R dataframe together when randomly assigning to training/testing datasets

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

随机分配给训练/测试数据集时，将R数据帧中的行分组在一起 [英] Grouping rows from an R dataframe together when randomly assigning to training/testing datasets

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭