每行最后一个非NA行的列名;使用tidyverse解决方案? [英] Column name of last non-NA row per row; using tidyverse solution?

查看:147
本文介绍了每行最后一个非NA行的列名;使用tidyverse解决方案?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简要数据集描述:我有从Qualtrics生成的调查数据,我已将它们作为小标题导入了R.每列对应一个调查问题,并且我保留了原始的列顺序(以与调查中问题的顺序相对应).

使用简单语言的问题:由于正常的参与者损耗,并非所有参与者都完成了调查中的所有问题.我想知道每个参与者参与调查的程度,以及他们在停下来之前各自回答的最后一个问题.

R中的问题陈述:我想生成(使用tidyverse):

  • 1)一个新列( lastq ),该列为每行(即每个参与者)列出最后一个非NA列的名称(即他们完成的最后一个问题的名称) ).
  • 2)第二个新列列出了 lastq
  • 中的列号

示例数据帧df

df <- tibble(
  year = c(2015, 2015, 2016, 2016),
  grade = c(1, NA, 1, NA),
  height = c("short", "tall", NA, NA),
  gender = c(NA, "m", NA, "f")
 )

原始df

  # A tibble: 4 x 4
   year grade height gender
  <dbl> <dbl>  <chr>  <chr>
1  2015     1  short   <NA>
2  2015    NA   tall      m
3  2016     1   <NA>   <NA>
4  2016    NA   <NA>      f

所需的最终df

   # A tibble: 4 x 6
   year grade height gender  lastq lastqnum
  <dbl> <dbl>  <chr>  <chr>  <chr>    <dbl>
1  2015     1  short   <NA> height        3
2  2015    NA   tall      m gender        4
3  2016     1   <NA>   <NA>  grade        2
4  2016    NA   <NA>      f gender        4


还有其他一些相关问题,但是我似乎找不到集中于提取列名的答案(与查找最后一个非na-值的行的列名称)

我一直在尝试的-我知道这里缺少一些东西...:

  • ds %>% map(which(!is.na(.)))
  • ds %>% map(tail(!is.na(.), 2))
  • ds %>% rowwise() %>% mutate(last = which(!is.na(ds)))

?


非常感谢您的帮助!

解决方案

编写一个解决问题的函数,遵循James的建议,但功能更强大(处理所有答案均为NA的情况)

f0 = function(df) {
    idx = ifelse(is.na(df), 0L, col(df))
    apply(idx, 1, max)
}

L将0设为整数,而不是数字.为了提高速度(在有很多行的情况下),请使用matrixStats软件包

f1 = function(df) {
    idx = ifelse(is.na(df), 0L, col(df))
    matrixStats::rowMaxs(idx, na.rm=TRUE)
}

按照markus的建议在dplyr上下文​​中使用它

mutate(df, lastqnum = f1(df), lastq = c(NA, names(df))[lastqnum + 1])
df %>% mutate(lastqnum = f1(.), lastq = c(NA, names(.))[lastqnum + 1])

或者只是做

lastqnum = f1(df)
cbind(df, lastq=c(NA, names(df))[lastqnum + 1], lastqnum)

接受后编辑,我想整洁的方法将是首先将数据整理成长格式

df1 = cbind(gather(df), id = as.vector(row(df)), event = as.vector(col(df)))

然后进行分组和总结

group_by(df1, id) %>%
    summarize(lastq = tail(event[!is.na(value)], 1), lastqname = key[lastq])

在没有答案的情况下,这不能解决问题.

Brief Dataset description: I have survey data generated from Qualtrics, which I've imported into R as a tibble. Each column corresponds to a survey question, and I've preserved the original column order (to correspond with the order of the questions in the survey).

Problem in plain language: Due to normal participant attrition, not all participants completed all of the questions in the survey. I want to know how far each participant got in the survey, and the last question they each answered before stopping.

Problem statement in R: I want to generate (using tidyverse):

  • 1) A new column (lastq) that lists, for each row (i.e. for each participant), the name of the last non-NA column (i.e. the name of the last question they completed).
  • 2) A second new column that lists the number of the column in lastq

Sample dataframe df

df <- tibble(
  year = c(2015, 2015, 2016, 2016),
  grade = c(1, NA, 1, NA),
  height = c("short", "tall", NA, NA),
  gender = c(NA, "m", NA, "f")
 )

Original df

  # A tibble: 4 x 4
   year grade height gender
  <dbl> <dbl>  <chr>  <chr>
1  2015     1  short   <NA>
2  2015    NA   tall      m
3  2016     1   <NA>   <NA>
4  2016    NA   <NA>      f

Desired final df

   # A tibble: 4 x 6
   year grade height gender  lastq lastqnum
  <dbl> <dbl>  <chr>  <chr>  <chr>    <dbl>
1  2015     1  short   <NA> height        3
2  2015    NA   tall      m gender        4
3  2016     1   <NA>   <NA>  grade        2
4  2016    NA   <NA>      f gender        4


There are some other related questions, but I can't seem to find any answers focused on extracting the column names (vs. the values themselves) based on a tibble of mixed variable classes (vs. all numeric), using a tidyverse solution

What I've been trying - I know there's something I'm missing here... :

  • ds %>% map(which(!is.na(.)))
  • ds %>% map(tail(!is.na(.), 2))
  • ds %>% rowwise() %>% mutate(last = which(!is.na(ds)))

?


Thank you so much for your help!

解决方案

Write a function that solves the problem, following James' suggestion but a little more robust (handles the case when all answers are NA)

f0 = function(df) {
    idx = ifelse(is.na(df), 0L, col(df))
    apply(idx, 1, max)
}

The L makes the 0 an integer, rather than numeric. For a speed improvement (when there are many rows), use the matrixStats package

f1 = function(df) {
    idx = ifelse(is.na(df), 0L, col(df))
    matrixStats::rowMaxs(idx, na.rm=TRUE)
}

Follow markus' suggestion to use this in a dplyr context

mutate(df, lastqnum = f1(df), lastq = c(NA, names(df))[lastqnum + 1])
df %>% mutate(lastqnum = f1(.), lastq = c(NA, names(.))[lastqnum + 1])

or just do it

lastqnum = f1(df)
cbind(df, lastq=c(NA, names(df))[lastqnum + 1], lastqnum)

Edited after acceptance I guess the tidy approach would be first to tidy the data into long form

df1 = cbind(gather(df), id = as.vector(row(df)), event = as.vector(col(df)))

and then to group and summarize

group_by(df1, id) %>%
    summarize(lastq = tail(event[!is.na(value)], 1), lastqname = key[lastq])

This doesn't handle the case when here are no answers.

这篇关于每行最后一个非NA行的列名;使用tidyverse解决方案?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆