从数据框中提取信息 [英] extract information from a data frame

查看:62
本文介绍了从数据框中提取信息的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个像下面这样的数据框

I have a data frame like below

df<- structure(list(s1 = structure(1:3, .Label = c("3-4", "4-1", "5-4"
    ), class = "factor"), s2 = structure(1:3, .Label = c("2-4", "3-15", 
    "7-16"), class = "factor")), .Names = c("s1", "s2"), row.names = c(NA, 
    -3L), class = "data.frame")

如下所示

> df
#   s1   s2
#1 3-4  2-4
#2 4-1 3-15
#3 5-4 7-16

我要做的是先搜索并找到类似的值-例如,
在这里4是在s1的第一行,s2的第一行和s1的第三行

what I want to do is to first search and find those values that are similar after - for example here 4 is in first row of s1, first row of s2 and third row of s1


-第二列表示找到这些值的次数

-The second column indicates how many times those values were found

-第三列显示df第一列中有多少

-The third column shows how many of them are from first column of df

-第四列显示其中许多是来自df的第二列

-The fourth column shows how many of them are from second column of df

-第五个是来自第一列的字符串

-The fifth is which strings are from the first columns

-第六个是来自第二列的字符串

-The sixth is which strings are from teh second columns

输出看起来像这样

Value    repeated     s1N   s1N   ss1    ss2
4           3         2      1    3,5     2
1           1         1      -     4      -
15          1         -      1     -      3
16          1         -      1     -      7


推荐答案

出人意料的棘手问题。最好将其分解为几个逻辑步骤:

Surprisingly tough problem. It's good to break it down into several logical steps:

## 1: split into (val,ss) pairs, and capture ci (column index) association
res <- setNames(do.call(rbind,lapply(seq_along(df),function(ci)
    do.call(rbind,lapply(strsplit(as.character(df[[ci]]),'-'),function(x)
        data.frame(x[2L],x[1L],ci,stringsAsFactors=F)
    ))
)),c('val','ss','ci'));
res;
##   val ss ci
## 1   4  3  1
## 2   1  4  1
## 3   4  5  1
## 4   4  2  2
## 5  15  3  2
## 6  16  7  2

## 2: aggregate ss (joining on comma) by (val,ci), and capture record count as n
res <- do.call(rbind,by(res,res[c('val','ci')],function(x)
    data.frame(val=x$val[1L],ci=x$ci[1L],n=nrow(x),ss=paste(x$ss,collapse=','),stringsAsFactors=F)
));
res;
##   val ci n  ss
## 1   1  1 1   4
## 2   4  1 2 3,5
## 3  15  2 1   3
## 4  16  2 1   7
## 5   4  2 1   2

## 3: reshape to wide format
res <- reshape(res,idvar='val',timevar='ci',dir='w');
res;
##   val n.1 ss.1 n.2 ss.2
## 1   1   1    4  NA <NA>
## 2   4   2  3,5   1    2
## 3  15  NA <NA>   1    3
## 4  16  NA <NA>   1    7

## 4: add repeated column; can be calculated by summing all n.* columns
## note: leveraging psum() from <http://stackoverflow.com/questions/12139431/add-variables-whilst-ignoring-nas-using-transform-function>
psum <- function(...,na.rm=F) { x <- list(...); rowSums(matrix(unlist(x),ncol=length(x)),na.rm=na.rm); };
res$repeated <- do.call(psum,c(res[grep('^n\\.[0-9]+$',names(res))],na.rm=T));
res;
##   val n.1 ss.1 n.2 ss.2 repeated
## 1   1   1    4  NA <NA>        1
## 2   4   2  3,5   1    2        3
## 3  15  NA <NA>   1    3        1
## 4  16  NA <NA>   1    7        1

关于NA,您可以根据需要将其修复。但是,我建议 n。* 列的正确类型是整数,因为它们表示计数,因此使用'-'(如您的示例输出中)表示空单元格是不合适的。我建议改为零。破折号对于 ss。* 列很好,因为它们是字符串。操作方法如下:

With regard to the NAs, you can fix them up afterward if you want. However, I would advise that the proper type of the n.* columns is integer, since they represent counts, therefore the use of '-' (as in your sample output) to represent null cells is inappropriate. I would suggest zero instead. The dash is fine for the ss.* columns, since they are strings. Here's how you can do this:

n.cis <- grep('^n\\.[0-9]+$',names(res));
ss.cis <- grep('^ss\\.[0-9]+$',names(res));
res[n.cis][is.na(res[n.cis])] <- 0L;
res[ss.cis][is.na(res[ss.cis])] <- '-';
res;
##   val n.1 ss.1 n.2 ss.2 repeated
## 1   1   1    4   0    -        1
## 2   4   2  3,5   1    2        3
## 3  15   0    -   1    3        1
## 4  16   0    -   1    7        1

这篇关于从数据框中提取信息的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆