将唯一值拆分为多个列的单独列 [英] Split unique values into separate columns for multiple columns
问题描述
我数据的每个列将重新缩放并放入0到100的bin中.bin列将用作模型的特征.为了分别测试每个bin,我想将每个bin列分为每个值的单独列.新列将保持为0或1,具体取决于单元格中的值是否与列的bin匹配.从这样的东西:
Each of my data's columns will be rescaled and put into bins from 0 to 100. The bin columns will be used as features for a model. In order to test each bin separately, I'd like to split each bin column into separate columns for each of it's values. The new column will hold either a 0 or 1, dependent upon whether the value in the cell matched the column's bin. From something like this:
row values
1 10
2 20
3 30
4 40
5 10
6 30
7 40
对此:
row values_10 values_20 values_30 values_40
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
5 1 0 0 0
6 0 0 1 0
7 0 0 0 1
这种蛮力方法可以完成任务,但是必须有一种更好的(非循环)方法:
This brute force approach does the job, but there must be a better (non-loop) way:
values <- c( 10,20,30,40,10,30,40)
dat <- data.frame(values)
columnNames <- unique(dat$values)
for( n in 1:length(columnNames) )
{
dat[as.character(columnNames[n])] <- 0
}
columnNames2 <- colnames(dat)
for( c in 2:ncol(dat))
{
hdr <- columnNames2[c]
for( r in 1:nrow(dat))
{
if( dat$values[r]==as.integer(hdr) )
dat[r,c]=1
}
}
非常感谢!
编辑
这些都是很好的答案,谢谢大家.最终对象(无论是矩阵,表还是data.table)将仅包含单独的bin列(不包含源列).下面的解决方案如何用于2000多个源列?
These are all great answers, thank you everyone. The final object, whether a matrix, table, or data.table, will contain only the separate bin columns (no source columns). How can the solutions below be used for 2000+ source columns?
EDIT2
基于对我的后续问题的回答,以下是将来遇到此问题的任何人所用每种方法的实现.
Based on the answers to my follow-up question, below are implementations for each of the methods for anyone coming to this question in the future.
# read in some data with multiple columns
df_in <- read.table(text="row val1 val2
1 10 100
2 20 200
3 30 300
4 40 400
5 10 100
6 30 300
7 40 400", header=TRUE, stringsAsFactors=FALSE)
# @Zelazny7 's method using a matrix
df_in$row <- NULL
col_names <- names(df_in)
for( c in 1:length(col_names)){
uniq <- unlist(unique(df_in[col_names[c]]))
m <- matrix(0, nrow(df_in), length(uniq),
dimnames = list(NULL, paste0(col_names[c], "_", uniq)))
for (i in seq_along(df_in[[col_names[c]]])) {
k <- match(df_in[[col_names[c]]][i], uniq, 0)
m[i,k] <- 1
}
if( c==1 )
df_out <- m
else
df_out <- cbind(df_out,m)
}
# @P Lapointe 's method using 'table'
col_names <- names(df_in)
for( c in 2:length(col_names)){
m <- table(df_in$row,df_in[[col_names[c]]])
uniq <- unlist(unique(df_in[col_names[c]]))
newNames <- toString(paste0(col_names[c],'_',uniq))
if( c==2 ){
df_out <- m
hdrs <- newNames
}
else{
df_out <- cbind(df_out,m)
hdrs <- paste(hdrs,newNames,sep=", ")
}
}
colnames(df_out) <- unlist(strsplit(hdrs, split=", "))
# @bdemarest 's method using 'data.table'
# read in data first
library(data.table)
df_in = fread("row val1 val2
1 10 100
2 20 200
3 30 300
4 40 400
5 10 100
6 30 300
7 40 400")
df_in$count = 1L
col_names <- names(df_in)
for( c in 2:length(col_names)-1){
m = dcast(df_in, paste( 'row', '~', col_names[c]), value.var="count", fill=0L)
uniq <- unlist(unique(df_in[,get(col_names[c])]))
newNames <- toString(paste0(col_names[c],'_',uniq))
m$row <- NULL
if( c==2 ){
df_out <- m
hdrs <- newNames
}
else if( c>2 ){
df_out <- cbind(df_out,m)
hdrs <- paste(hdrs,newNames,sep=", ")
}
}
colnames(df_out) <- unlist(strsplit(hdrs, split=", "))
所有答案都是适当且可用的,因此最好的答案将被授予最快的初始响应.再次感谢您的帮助!
All answers were appropriate and usable so the best answer was awarded to the quickest initial response. Thanks again for your help!!
推荐答案
我经常这样做.这是我用来创建假人的方法.非常快.
I do this quite often. This is the method I use to create dummies. It is very fast.
## reading in your example data
df <- read.table(file = "clipboard", header=TRUE)
df$row <- NULL
uniq <- unique(df$values)
m <- matrix(0, nrow(df), length(uniq), dimnames = list(NULL, paste0("column_", uniq)))
for (i in seq_along(df$values)) {
k <- match(df$values[i], uniq, 0)
m[i,k] <- 1
}
结果:
> m
column_10 column_20 column_30 column_40
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
[5,] 1 0 0 0
[6,] 0 0 1 0
[7,] 0 0 0 1
另一种通过使用矩阵索引矩阵来避免循环的变体:
Another variant that avoids the loop by indexing the matrix with a matrix:
m[cbind(seq.int(nrow(m)), match(df$values, uniq))] <- 1
这篇关于将唯一值拆分为多个列的单独列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!