将数据帧中的列与每个部分信息相结合 [英] Combining columns in a dataframe each with partial information

查看:83
本文介绍了将数据帧中的列与每个部分信息相结合的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大的数据集,在不同的时间段内对同一变量使用不同的编码方案。每个时间段内的编码都表示为在其活动年份的值列和其他地方的NA。



我能够通过使用嵌套 ifelse命令与dplyr的mutate一起[参见下面的编辑],但是我使用ifelse做一些稍微不同的事情遇到问题。我想根据任何前一个变量是否满足条件来编写一个新的变量。但是由于某种原因,下面的ifelse结构不起作用。



MWE:

  library(dplyr)
库(magrittr)
df< - data.frame(id = 1:12,year = c(rep(1995,5))rep (1996,5),rep(1997,2)),varA = c(A,C,A,C,B,rep(NA,7)),varB = c rep(NA,5),B,A,C,A,B,rep(NA,2)))
df%>%mutate(varC = ifelse varA ==C| varB ==C,C,D))

输出:

 > df 
id year varA varB varC
1 1 1995 A< NA> < NA>
2 2 1995 C< NA> C
3 3 1995 A< NA> < NA>
4 4 1995 C C
5 5 1995 B< NA> < NA>
6 6 1996< NA> B< NA>
7 7 1996< NA> A< NA>
8 8 1996< NA> C C
9 9 1996< NA> A< NA>
10 10 1996< NA> B< NA>
11 11 1997< NA> < NA> < NA>
12 12 1997< NA> < NA> < NA>

如果我不使用 | 操作员,并且仅针对varA进行测试,它将按预期出现结果,但仅适用于varA不为NA的那些年份。



输出:

 > df%<%mutate(varC = ifelse(varA ==C,C,D))
> df
id year varA varB varC
1 1 1995 A< NA> D
2 2 1995 C C
3 3 1995 A< NA> D
4 4 1995 C C
5 5 1995 B< NA> D
6 6 1996< NA> B< NA>
7 7 1996< NA> A< NA>
8 8 1996< NA> C NA
9 9 1996< NA> A< NA>
10 10 1996< NA> B< NA>
11 11 1997< NA> < NA> < NA>
12 12 1997< NA> < NA> < NA>

期望的输出:

 > df 
id year varA varB varC
1 1 1995 A< NA> D
2 2 1995 C C
3 3 1995 A< NA> D
4 4 1995 C C
5 5 1995 B< NA> D
6 6 1996< NA> B D
7 7 1996< NA> A D
8 8 1996< NA> C C
9 9 1996< NA> A D
10 10 1996< NA> B D
11 11 1997< NA> < NA> < NA>
12 12 1997< NA> < NA> < NA>

如何获得我正在寻找的内容?



为了使这个问题更适用于更广泛的受众,并从这种情况中学习,很好的解释了使用 | 导致它不按预期工作。感谢提前!



编辑:这是我成功地将它们与嵌套ifeles组合的意思

 > df%>%mutate(varC = ifelse(year == 1995,as.character(varA))
+ ifelse(year == 1996,as.character(varB),NA)))
id year varA varB varC
1 1 1995 A< NA> A
2 2 1995 C C
3 3 1995 A< NA> A
4 4 1995 C C
5 5 1995 B< NA> B
6 6 1996< NA> B B
7 7 1996< NA> A A
8 8 1996< NA> C C
9 9 1996< NA> A A
10 10 1996< NA> B B
11 11 1997< NA> < NA> < NA>
12 12 1997< NA> < NA> < NA>


解决方案

R有一个令人讨厌的趋势,涉及NA的条件只是NA,而不是真或假。
即NA> 0 = NA而不是FALSE



NA与TRUE交互,就像虚假一样。即TRUE | NA = TRUE。 TRUE& NA = NA。



有趣的是,它也与FALSE进行交互,就好像是TRUE一样。即FALSE | NA = NA。 FALSE& NA = FALSE



实际上,NA就像一个TRUE和FALSE之间的逻辑值。例如NA | TRUE | FALSE = TRUE。



所以这里有一种方式来破解:

  ifelse varA =='C'&!is.na(varA))|(varB =='C'&!is.na(varB))

我们如何解释这一点?在OR的左侧,我们有以下内容:如果varA是NA,那么我们有NA& FALSE,因为NA是上面一步FALSE在逻辑层次结构中,&将会强制整个事情为FALSE,否则,如果varA不是NA,但不是'C',那么你将会有FALSE& TRUE,它会根据需要给予FALSE如果它是'C',它们都是真的,对于OR右边的东西也是如此。



当使用涉及x的条件时,x可以是NA,我喜欢使用
((x)&!is.na(x)的条件)完全排除NA输出,并强制在我想要的情况下的TRUE或FALSE值。 p>

编辑:我只记得你想要一个NA输出,如果他们都是NA,这不是最终这样做,所以这是我的坏的,除非你都可以使用'D'输出,当他们都是NA。



EDIT2:这应该输出你想要的NAs:

  ifelse(is.na(varA)& is.na(varB),NA,ifelse((varA =='C'& !is.na(varA))|(varB =='C'&!is.na(varB)),'C','D'))


I have a large data set which used different coding schemes for the same variables over different time periods. The coding in each time period is represented as a column with values during the year it was active and NA everywhere else.

I was able to "combine" them by using nested ifelse commands together with dplyr's mutate [see edit below], but I am running into a problem using ifelse to do something slightly different. I want to code a new variable based on whether ANY of the previous variables meets a condition. But for some reason, the ifelse construct below does not work.

MWE:

library("dplyr")
library("magrittr")
df <- data.frame(id = 1:12, year = c(rep(1995, 5), rep(1996, 5), rep(1997, 2)), varA = c("A","C","A","C","B",rep(NA,7)), varB = c(rep(NA,5),"B","A","C","A","B",rep(NA,2)))
df %>% mutate(varC = ifelse(varA == "C" | varB == "C", "C", "D"))

Output:

> df
   id year varA varB varC
1   1 1995    A <NA> <NA>
2   2 1995    C <NA>    C
3   3 1995    A <NA> <NA>
4   4 1995    C <NA>    C
5   5 1995    B <NA> <NA>
6   6 1996 <NA>    B <NA>
7   7 1996 <NA>    A <NA>
8   8 1996 <NA>    C    C
9   9 1996 <NA>    A <NA>
10 10 1996 <NA>    B <NA>
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>

If I don't use the | operator, and test against only varA, it will come out with the results as expected, but it will only apply to those years that varA is not NA.

Output:

> df %<>% mutate(varC = ifelse(varA == "C", "C", "D"))
> df
   id year varA varB varC
1   1 1995    A <NA>    D
2   2 1995    C <NA>    C
3   3 1995    A <NA>    D
4   4 1995    C <NA>    C
5   5 1995    B <NA>    D
6   6 1996 <NA>    B <NA>
7   7 1996 <NA>    A <NA>
8   8 1996 <NA>    C <NA>
9   9 1996 <NA>    A <NA>
10 10 1996 <NA>    B <NA>
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>

Desired output:

> df
   id year varA varB varC
1   1 1995    A <NA>    D
2   2 1995    C <NA>    C
3   3 1995    A <NA>    D
4   4 1995    C <NA>    C
5   5 1995    B <NA>    D
6   6 1996 <NA>    B    D
7   7 1996 <NA>    A    D
8   8 1996 <NA>    C    C
9   9 1996 <NA>    A    D
10 10 1996 <NA>    B    D
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>

How do I get what I'm looking for?

To make this question more applicable to a wider audience, and to learn from this situation, it would be great have an explanation as to what is happening with the comparison using | that causes it not to work as expected. Thanks in advance!

EDIT: This is what I meant by successfully combining them with nested ifelses

> df %>% mutate(varC = ifelse(year == 1995, as.character(varA), 
+                             ifelse(year == 1996, as.character(varB), NA)))
   id year varA varB varC
1   1 1995    A <NA>    A
2   2 1995    C <NA>    C
3   3 1995    A <NA>    A
4   4 1995    C <NA>    C
5   5 1995    B <NA>    B
6   6 1996 <NA>    B    B
7   7 1996 <NA>    A    A
8   8 1996 <NA>    C    C
9   9 1996 <NA>    A    A
10 10 1996 <NA>    B    B
11 11 1997 <NA> <NA> <NA>
12 12 1997 <NA> <NA> <NA>

解决方案

R has this annoying tendency where the logical value of a condition that involves NA is just NA, rather than true or false. i.e. NA>0 = NA rather than FALSE

NA interacts with TRUE just like false does. i.e. TRUE|NA = TRUE. TRUE&NA = NA.

Interestingly, it also interacts with FALSE as if it was TRUE. i.e. FALSE|NA=NA. FALSE&NA=FALSE

In fact, NA is like a logical value between TRUE and FALSE. e.g. NA|TRUE|FALSE = TRUE.

So here's a way to hack this:

ifelse((varA=='C'&!is.na(varA))|(varB=='C'&!is.na(varB))

How do we interpret this? On the left side of the OR, we have the following: If varA is NA, then we have NA&FALSE. Since NA is one step above FALSE in the hierarchy of logicals, the & is going to force the whole thing to be FALSE. Otherwise, if varA is not NA but it's not 'C', you'll have FALSE&TRUE which gives FALSE as you want. Otherwise, if it's 'C', they're both true. Same goes for the thing on the right of the OR.

When using a condition that involves x, but x can be NA, I like to use ((condition for x)&!is.na(x)) to completely rule out the NA output and force the TRUE or FALSE values in the situations I want.

EDIT: I just remembered that you want an NA output if they're both NA. This doesn't end up doing it, so that's my bad. Unless you're okay with a 'D' output when they're both NA.

EDIT2: This should output the NAs as you want:

ifelse(is.na(varA)&is.na(varB), NA, ifelse((varA=='C'&!is.na(varA))|(varB=='C'&!is.na(varB)), 'C','D'))

这篇关于将数据帧中的列与每个部分信息相结合的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆