获取每一行的最后一个非空列的值 [英] Get Value of last non-empty column for each row

查看:133
本文介绍了获取每一行的最后一个非空列的值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

获取此示例数据:

  data.frame(a_1 = c( Apple, Grapes, Melon , Peach),a_2 = c( Nuts, Kiwi, Lime, Honey),a_3 = c( Plum, Apple,NA,NA),a_4 = c(黄瓜,NA,NA,NA))

a_1 a_2 a_3 a_4
1苹果坚果李子黄瓜
2葡萄猕猴桃苹果< NA>
3甜瓜石灰< NA> < NA>
4桃子蜂蜜< NA> < NA>

基本上我想在每行的最后一列上运行grep,而不是NA。因此,我在grep( pattern,x)中的x应该是:

 黄瓜
苹果
酸橙
蜂蜜

我有一个整数,告诉我最后一个a_N:

  numcol<-rowSums(!is.na(df [,grep((^ a_)\\d, colnames(df))]))

到目前为止,我已经尝试过与ave( ),apply()和dplyr:

  grepl( pattern,df [,sprintf( a_%i,numcol )])

但是我不太能做到。请记住,我的数据集非常大,因此我希望使用矢量化解决方案或mb dplyr。
帮助将不胜感激。



/ e:谢谢,这是一个非常好的解决方案。我的想法太复杂了。 (正则表达式归因于我的更具体的数据)

解决方案

这里不需要正则表达式。只需使用 apply + tail + na.omit

 > apply(mydf,1,function(x)tail(na.omit(x),1))
[1]黄瓜 Apple石灰蜂蜜






我不知道这在速度方面有何不同,但是您您还可以结合使用 data.table和 reshape2,例如:

  library(data .table)
library(reshape2)
na.omit(melt(as.data.table(mydf,keep.rownames = TRUE),
id.vars = rn))[ ,value [.N],由= rn]
#rn V1
#1:1黄瓜
#2:2苹果
#3:3酸橙
# 4:4蜂蜜

或者甚至更好:

 融化(as.data.table(df,keep.rownames = TRUE),
id.vars = rn,na.rm = TRUE)[,值[.N],由= rn]
#rn V1
#1:1黄瓜
#2:2苹果
#3:3石灰
#4: 4 Honey

这会更快。在一个80万行的数据集上, apply 花费了约50秒,而 data.table 方法花费了约2.5秒。 / p>

Take this sample data:

data.frame(a_1=c("Apple","Grapes","Melon","Peach"),a_2=c("Nuts","Kiwi","Lime","Honey"),a_3=c("Plum","Apple",NA,NA),a_4=c("Cucumber",NA,NA,NA)) 

   a_1    a_2   a_3     a_4
1  Apple  Nuts  Plum    Cucumber
2 Grapes  Kiwi  Apple    <NA>
3  Melon  Lime  <NA>     <NA>
4  Peach  Honey  <NA>    <NA>

Basically I want to run a grep on the last column of each row which is not NA. Thus my x in grep("pattern",x) should be:

Cucumber
Apple
Lime
Honey

I have an integer which tells me which a_N is the last one:

numcol <- rowSums(!is.na(df[,grep("(^a_)\\d", colnames(df))])) 

So far I have tried something like this in combination with ave(), apply() and dplyr:

grepl("pattern",df[,sprintf("a_%i",numcol)])

However I dont quite can make it work. Keep in mind that my dataset is very large thus I was hoping vor a vectorized solution or mb dplyr. Help would be greatly appreciated.

/e: Thanks, that is a really good solution. My thinking was too complicated. (the regex is due to my more specific data )

解决方案

There's no need for regex here. Just use apply + tail + na.omit:

> apply(mydf, 1, function(x) tail(na.omit(x), 1))
[1] "Cucumber" "Apple"    "Lime"     "Honey" 


I don't know how this compares in terms of speed, but you You can also use a combination of "data.table" and "reshape2", like this:

library(data.table)
library(reshape2)
na.omit(melt(as.data.table(mydf, keep.rownames = TRUE), 
             id.vars = "rn"))[, value[.N], by = rn]
#    rn       V1
# 1:  1 Cucumber
# 2:  2    Apple
# 3:  3     Lime
# 4:  4    Honey

Or, even better:

melt(as.data.table(df, keep.rownames = TRUE), 
     id.vars = "rn", na.rm = TRUE)[, value[.N], by = rn]
#    rn       V1
# 1:  1 Cucumber
# 2:  2    Apple
# 3:  3     Lime
# 4:  4    Honey

This would be much faster. On an 800k-row dataset, apply took ~ 50 seconds while the data.table approach took about 2.5 seconds.

这篇关于获取每一行的最后一个非空列的值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆