根据多类观测值对r中的数据集进行分区 [英] Partitioning data set in r based on multiple classes of observations

查看：83 发布时间：2020/7/4 0:39:01 r random partitioning

本文介绍了根据多类观测值对r中的数据集进行分区的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试对R中的数据集进行分区，其中2/3用于训练，而1/3用于测试.我有一个分类变量和七个数字变量.每个观察结果都分为A，B，C或D.

I'm trying to partition a data set that I have in R, 2/3 for training and 1/3 for testing. I have one classification variable, and seven numerical variables. Each observation is classified as either A, B, C, or D.

为简单起见，假设分类变量cl是前100个观察值的A，观察值101到200的B，C直到300的D，以及直到400的D的值. A，B，C和D的每个观测值的2/3(而不是简单地获取整个数据集的2/3的观测值，因为每种分类的数量可能不相等).

For simplicity's sake, let's say that the classification variable, cl, is A for the first 100 observations, B for observations 101 to 200, C till 300, and D till 400. I'm trying to get a partition that has 2/3 of the observations for each of A, B, C, and D (as opposed to simply getting 2/3 of the observations for the entire data set since it will likely not have equal amounts of each classification).

当我尝试从数据的一个子集(例如sample(subset(data, cl=='A')))采样时，列将重新排序，而不是行.

When I try to sample from a subset of the data, such as sample(subset(data, cl=='A')), the columns are reordered instead of the rows.

总而言之，我的目标是从A，B，C和D的每一个中获取67个随机观测值作为我的训练数据，并将每个A，B，C和D的剩余33个观测值存储为测试数据. .我发现了一个与我非常相似的问题，但是它没有考虑多个变量.

To summarize, my goal is to have 67 random observations from each of A, B, C, and D as my training data, and store the remaining 33 observations for each of A, B, C, and D as testing data. I have found a very similar question to mine, but it did not factor in multiple variables.

推荐答案

这可能会更长一些，但我认为它更直观，可以在基数R中完成；)

this may be longer but i think it's more intuitive and can be done in base R ;)

# create the data frame you've described
x <-
    data.frame(
        cl = 
            c( 
                rep( 'A' , 100 ) ,
                rep( 'B' , 100 ) ,
                rep( 'C' , 100 ) ,
                rep( 'D' , 100 ) 
            ) ,

        othernum1 = rnorm( 400 ) ,
        othernum2 = rnorm( 400 ) ,
        othernum3 = rnorm( 400 ) ,
        othernum4 = rnorm( 400 ) ,
        othernum5 = rnorm( 400 ) ,
        othernum6 = rnorm( 400 ) ,
        othernum7 = rnorm( 400 ) 
    )

# sample 67 training rows within classification groups
training.rows <-
    tapply( 
        # numeric vector containing the numbers
        # 1 to nrow( x )
        1:nrow( x ) , 

        # break the sample function out by
        # the classification variable
        x$cl , 

        # use the sample function within
        # each classification variable group
        sample , 

        # send the size = 67 parameter
        # through to the sample() function
        size = 67 
    )

# convert your list back to a numeric vector
tr <- unlist( training.rows )

# split your original data frame into two:

# all the records sampled as training rows
training.df <- x[ tr , ]

# all other records (NOT sampled as training rows)
testing.df <- x[ -tr , ]

这篇关于根据多类观测值对r中的数据集进行分区的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

根据多类观测值对r中的数据集进行分区 [英] Partitioning data set in r based on multiple classes of observations

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

根据多类观测值对r中的数据集进行分区 [英] Partitioning data set in r based on multiple classes of observations

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭