将唯一值拆分为多个列的单独列 [英] Split unique values into separate columns for multiple columns

查看:49
本文介绍了将唯一值拆分为多个列的单独列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我数据的每个列将重新缩放并放入0到100的bin中.bin列将用作模型的特征.为了分别测试每个bin,我想将每个bin列分为每个值的单独列.新列将保持为0或1,具体取决于单元格中的值是否与列的bin匹配.从这样的东西:

Each of my data's columns will be rescaled and put into bins from 0 to 100. The bin columns will be used as features for a model. In order to test each bin separately, I'd like to split each bin column into separate columns for each of it's values. The new column will hold either a 0 or 1, dependent upon whether the value in the cell matched the column's bin. From something like this:

row values
  1     10
  2     20
  3     30
  4     40
  5     10
  6     30
  7     40

对此:

row values_10 values_20 values_30 values_40
  1         1         0         0         0
  2         0         1         0         0
  3         0         0         1         0
  4         0         0         0         1
  5         1         0         0         0
  6         0         0         1         0
  7         0         0         0         1

这种蛮力方法可以完成任务,但是必须有一种更好的(非循环)方法:

This brute force approach does the job, but there must be a better (non-loop) way:

values <- c( 10,20,30,40,10,30,40)
dat <- data.frame(values)

columnNames <- unique(dat$values)

for( n in 1:length(columnNames) )
{
    dat[as.character(columnNames[n])]  <- 0
}

columnNames2 <- colnames(dat)

for( c in 2:ncol(dat))
{
    hdr <- columnNames2[c]

    for( r in 1:nrow(dat))
    {
        if( dat$values[r]==as.integer(hdr) )
            dat[r,c]=1
    }
}

非常感谢!

编辑

这些都是很好的答案,谢谢大家.最终对象(无论是矩阵,表还是data.table)将仅包含单独的bin列(不包含源列).下面的解决方案如何用于2000多个源列?

These are all great answers, thank you everyone. The final object, whether a matrix, table, or data.table, will contain only the separate bin columns (no source columns). How can the solutions below be used for 2000+ source columns?

EDIT2

基于对我的后续问题的回答,以下是将来遇到此问题的任何人所用每种方法的实现.

Based on the answers to my follow-up question, below are implementations for each of the methods for anyone coming to this question in the future.

# read in some data with multiple columns

df_in  <- read.table(text="row val1 val2
                  1     10     100
                  2     20     200
                  3     30     300
                  4     40     400
                  5     10     100
                  6     30     300
                  7     40     400", header=TRUE, stringsAsFactors=FALSE)

#   @Zelazny7 's method using a matrix

df_in$row <- NULL

col_names <- names(df_in)

for( c in 1:length(col_names)){

    uniq <- unlist(unique(df_in[col_names[c]]))

    m <- matrix(0, nrow(df_in), length(uniq), 
                dimnames = list(NULL, paste0(col_names[c], "_", uniq)))

    for (i in seq_along(df_in[[col_names[c]]])) {
        k <- match(df_in[[col_names[c]]][i], uniq, 0)
        m[i,k] <- 1
    }

    if( c==1 )
        df_out <- m
    else
        df_out <- cbind(df_out,m)
}


#   @P Lapointe 's method using 'table'

col_names <- names(df_in)

for( c in 2:length(col_names)){

    m <- table(df_in$row,df_in[[col_names[c]]])    
    uniq <- unlist(unique(df_in[col_names[c]]))
    newNames <- toString(paste0(col_names[c],'_',uniq))

    if( c==2 ){
        df_out <- m
        hdrs <- newNames
    }
    else{
        df_out <- cbind(df_out,m)
        hdrs <- paste(hdrs,newNames,sep=", ")
    }
}

colnames(df_out) <- unlist(strsplit(hdrs, split=", "))


#   @bdemarest 's method using 'data.table'
#   read in data first

library(data.table)

df_in = fread("row val1 val2
            1     10     100
            2     20     200
            3     30     300
            4     40     400
            5     10     100
            6     30     300
            7     40     400")

df_in$count = 1L

col_names <- names(df_in)

for( c in 2:length(col_names)-1){

    m = dcast(df_in, paste( 'row', '~', col_names[c]), value.var="count", fill=0L)

    uniq <- unlist(unique(df_in[,get(col_names[c])]))
    newNames <- toString(paste0(col_names[c],'_',uniq))

    m$row <- NULL

    if( c==2 ){
        df_out <- m
        hdrs <- newNames
    }
    else if( c>2 ){
        df_out <- cbind(df_out,m)
        hdrs <- paste(hdrs,newNames,sep=", ")
    }
}

colnames(df_out) <- unlist(strsplit(hdrs, split=", "))

所有答案都是适当且可用的,因此最好的答案将被授予最快的初始响应.再次感谢您的帮助!

All answers were appropriate and usable so the best answer was awarded to the quickest initial response. Thanks again for your help!!

推荐答案

我经常这样做.这是我用来创建假人的方法.非常快.

I do this quite often. This is the method I use to create dummies. It is very fast.

## reading in your example data
df <- read.table(file = "clipboard", header=TRUE)
df$row <- NULL

uniq <- unique(df$values)
m <- matrix(0, nrow(df), length(uniq), dimnames = list(NULL, paste0("column_", uniq)))

for (i in seq_along(df$values)) {
  k <- match(df$values[i], uniq, 0)
  m[i,k] <- 1
}

结果:

> m
     column_10 column_20 column_30 column_40
[1,]         1         0         0         0
[2,]         0         1         0         0
[3,]         0         0         1         0
[4,]         0         0         0         1
[5,]         1         0         0         0
[6,]         0         0         1         0
[7,]         0         0         0         1

另一种通过使用矩阵索引矩阵来避免循环的变体:

Another variant that avoids the loop by indexing the matrix with a matrix:

m[cbind(seq.int(nrow(m)), match(df$values, uniq))] <- 1

这篇关于将唯一值拆分为多个列的单独列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆