从数据框中提取信息 [英] extract information from a data frame
问题描述
我有一个像下面这样的数据框
I have a data frame like below
df<- structure(list(s1 = structure(1:3, .Label = c("3-4", "4-1", "5-4"
), class = "factor"), s2 = structure(1:3, .Label = c("2-4", "3-15",
"7-16"), class = "factor")), .Names = c("s1", "s2"), row.names = c(NA,
-3L), class = "data.frame")
如下所示
> df
# s1 s2
#1 3-4 2-4
#2 4-1 3-15
#3 5-4 7-16
我要做的是先搜索并找到类似的值-例如,
在这里4是在s1的第一行,s2的第一行和s1的第三行
what I want to do is to first search and find those values that are similar after - for example here 4 is in first row of s1, first row of s2 and third row of s1
-第二列表示找到这些值的次数
-The second column indicates how many times those values were found
-第三列显示df第一列中有多少
-The third column shows how many of them are from first column of df
-第四列显示其中许多是来自df的第二列
-The fourth column shows how many of them are from second column of df
-第五个是来自第一列的字符串
-The fifth is which strings are from the first columns
-第六个是来自第二列的字符串
-The sixth is which strings are from teh second columns
输出看起来像这样
Value repeated s1N s1N ss1 ss2
4 3 2 1 3,5 2
1 1 1 - 4 -
15 1 - 1 - 3
16 1 - 1 - 7
推荐答案
出人意料的棘手问题。最好将其分解为几个逻辑步骤:
Surprisingly tough problem. It's good to break it down into several logical steps:
## 1: split into (val,ss) pairs, and capture ci (column index) association
res <- setNames(do.call(rbind,lapply(seq_along(df),function(ci)
do.call(rbind,lapply(strsplit(as.character(df[[ci]]),'-'),function(x)
data.frame(x[2L],x[1L],ci,stringsAsFactors=F)
))
)),c('val','ss','ci'));
res;
## val ss ci
## 1 4 3 1
## 2 1 4 1
## 3 4 5 1
## 4 4 2 2
## 5 15 3 2
## 6 16 7 2
## 2: aggregate ss (joining on comma) by (val,ci), and capture record count as n
res <- do.call(rbind,by(res,res[c('val','ci')],function(x)
data.frame(val=x$val[1L],ci=x$ci[1L],n=nrow(x),ss=paste(x$ss,collapse=','),stringsAsFactors=F)
));
res;
## val ci n ss
## 1 1 1 1 4
## 2 4 1 2 3,5
## 3 15 2 1 3
## 4 16 2 1 7
## 5 4 2 1 2
## 3: reshape to wide format
res <- reshape(res,idvar='val',timevar='ci',dir='w');
res;
## val n.1 ss.1 n.2 ss.2
## 1 1 1 4 NA <NA>
## 2 4 2 3,5 1 2
## 3 15 NA <NA> 1 3
## 4 16 NA <NA> 1 7
## 4: add repeated column; can be calculated by summing all n.* columns
## note: leveraging psum() from <http://stackoverflow.com/questions/12139431/add-variables-whilst-ignoring-nas-using-transform-function>
psum <- function(...,na.rm=F) { x <- list(...); rowSums(matrix(unlist(x),ncol=length(x)),na.rm=na.rm); };
res$repeated <- do.call(psum,c(res[grep('^n\\.[0-9]+$',names(res))],na.rm=T));
res;
## val n.1 ss.1 n.2 ss.2 repeated
## 1 1 1 4 NA <NA> 1
## 2 4 2 3,5 1 2 3
## 3 15 NA <NA> 1 3 1
## 4 16 NA <NA> 1 7 1
关于NA,您可以根据需要将其修复。但是,我建议 n。*
列的正确类型是整数,因为它们表示计数,因此使用'-'
(如您的示例输出中)表示空单元格是不合适的。我建议改为零。破折号对于 ss。*
列很好,因为它们是字符串。操作方法如下:
With regard to the NAs, you can fix them up afterward if you want. However, I would advise that the proper type of the n.*
columns is integer, since they represent counts, therefore the use of '-'
(as in your sample output) to represent null cells is inappropriate. I would suggest zero instead. The dash is fine for the ss.*
columns, since they are strings. Here's how you can do this:
n.cis <- grep('^n\\.[0-9]+$',names(res));
ss.cis <- grep('^ss\\.[0-9]+$',names(res));
res[n.cis][is.na(res[n.cis])] <- 0L;
res[ss.cis][is.na(res[ss.cis])] <- '-';
res;
## val n.1 ss.1 n.2 ss.2 repeated
## 1 1 1 4 0 - 1
## 2 4 2 3,5 1 2 3
## 3 15 0 - 1 3 1
## 4 16 0 - 1 7 1
这篇关于从数据框中提取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!