R:计数(每行)满足几列的OR条件多少次 [英] R: counting (for each row) how many times an OR condition on several columns is satisfied
问题描述
我的问题类似于这个,除了有点不同。在最初的问题中,我试图计算(对于每一行)有多少列满足条件。我想做类似的事情,除了条件涉及几个具有OR条件的列,我的真实数据有很多列,所以理想情况下,我想使用正则表达式引用列。
我有以下数据:
colnames< - c(粘贴(col ,rep(LETTERS [1:2],each = 4),rep(1:4,2),sep =),c(meh,muh))
df < .data.frame(matrix(sample(c(Yes,No),200,replace = TRUE),ncol = 10))
名称(df)< - colnames
df
colA1 colA2 colA3 colA4 colB1 colB2 colB3 colB4 meh muh
1否是否否否是是否是是
2否是是是是否是否否否
3否否否是否否否否是否
4是否是是是是是是否是
5是否是否否否否是否是
6是否否否是是否否否否否
7是否否否是是Y es否是否
8是否是否是是否是是否
9否是否否否是是否否否
10是是否否是否是否是否否b $ b 11否是否否是否是是否否
12否是是是否否是否否否
13否否是是否是是是是否
14是是否否否否是否否是
15是否是是否是否是否否
16否是是否否否是否否否
17是否否否否是是是是否是
18是否是是否否否否否是
19否否否否否是否否否是
20否是否否是是是否否否
我想创建一个新列 Nb
记录,每一行:至少一个的次数colA2,colA3,colA4是==是,加上colB2,colB3,colB4中至少有一个为是的次数。
如果有当查看一组列[colA2,colA3,colA4]时,并不是这个OR条件,而且我正在添加满足条件的列数,我可以使用以下内容:
df $ Nb< - rowSums(df [,grep(^ col [AB] [2-4],names(df))] ==是)
如果可能,我想使用正则表达式引用列,如我的实际数据字母和数字分别比B和5进一步。
谢谢!
您可以将 rowSums
方法修改为每个OR条件中的列组,然后添加> 0
使其至少一个。因此,至少有一个A值为是将如下所示:
rowSums(df [,grep colA [2-4],名称(df))] ==是)> 0
然后,您可以使用 +
:
(rowSums(df [,grep(^ colA [2-4],names(df))] = =是)> 0)+
(rowSums(df [,grep(^ colB [2-4],names(df))] ==是)> 0)
顺便说一句,你会更容易的回答这些问题如果您的数据位于整理格式:也就是说,如果每列是一个单独的变量。现在看来,您正在将数据的属性(A,B,1-4)存储为列名称的一部分,这就是使用值为A的列使用操作非常尴尬的原因。如果您使用dplyr和tidyr软件包重新排列数据,请执行以下操作:
库(dplyr)
库(tidyr)
df $ index< - 1:nrow(df)
newdf< - df%>%gather(key,value,colA1:colB4)%>%
分别(key,c(col,letter,number),c(-3,-2))%>%
mutate(number = as.numeric(number))
这将您的数据重新排列(请注意,我给每行您自己的index变量):
meh muh index col letter number value
1是否1 col A 1是
2是否2 col A 1是
3否否3 col A 1是
4是否4 col A 1否
5是是5 col A 1否
6是是6 col A 1是
然后,您可以更自然地对这些观察进行分组,总结,过滤和操作。例如,您似乎想删除列号为1:而不是需要正则表达式,您可以简单地执行 newdf%>%filter(number> 1)
。
您将如何执行您所描述的OR操作:
hasyes< - newdf%>%group_by(index,letter)%>%filter(number> 1)%>%
summarize(hasyes = any 是))
对于每个原始行+字母组合,您现在有一个逻辑值是否是
出现:
索引字母hasyes
1 1 A TRUE
2 1 B TRUE
3 2 A TRUE
4 2 B TRUE
5 3 A FALSE
6 3 B TRUE
另外一个总结操作可以将它转换成你想要的形式:
result< - hasyes%>%group_by(index)%>%summarize(yese s = sum(hasyes))
这个解决方案的重要性在于它将适用于您有的任何数量的字母(也就是说,如果它来自AZ而不是A和B)。
My question is similar to this one, except a bit different. In the initial question, I was trying to count (for each row) how many columns satisfied a condition. I would like to do something similar, except that the condition involves several columns with an OR condition, and my real data has many columns, so ideally, I'd like to reference the columns using a regular expression.
I have the following data:
colnames <- c(paste("col",rep(LETTERS[1:2],each=4),rep(1:4,2),sep=""),c("meh","muh"))
df <- as.data.frame(matrix(sample(c("Yes","No"),200,replace=TRUE),ncol=10))
names(df) <- colnames
df
colA1 colA2 colA3 colA4 colB1 colB2 colB3 colB4 meh muh
1 No Yes No No No Yes Yes No Yes Yes
2 No Yes Yes Yes Yes No Yes No No No
3 No No No Yes No No No No Yes No
4 Yes No Yes Yes Yes Yes Yes Yes No Yes
5 Yes No Yes No No No No Yes No Yes
6 Yes No No No Yes Yes No No No No
7 Yes No No No Yes Yes Yes No Yes No
8 Yes No Yes No Yes Yes No Yes Yes No
9 No Yes No No No Yes Yes No No No
10 Yes Yes No No Yes No Yes No Yes No
11 No Yes No No Yes No Yes Yes No No
12 No Yes Yes Yes No No Yes No No No
13 No No Yes Yes No Yes Yes Yes Yes No
14 Yes Yes No No No No Yes No No Yes
15 Yes No Yes Yes No Yes No Yes No No
16 No Yes Yes No No No Yes No No No
17 Yes No No No No Yes Yes Yes No Yes
18 Yes No Yes Yes No No No No No Yes
19 No No No No No Yes No No No Yes
20 No Yes No No Yes Yes Yes No No No
I would like to create a new column Nb
that records, for each line: the number of times at least one of colA2, colA3,colA4 is =="Yes" plus the number of times at least one of colB2, colB3,colB4 is =="Yes".
If there was not this "OR" condition implied when look at a group of columns [colA2, colA3,colA4], and I was adding the number of columns satisfying the condition, I could have used something like:
df$Nb <- rowSums(df[, grep("^col[A-B][2-4]", names(df))] == "Yes")
I would like to use regex if possible to reference the columns, as in my real data letters and numbers go further than B and 5 respectively.
Thank you!
You could adapt your rowSums
approach to just the groups of columns in each of your OR conditions, then add > 0
to make it "at least one." Thus, "at least one of the A values is Yes" would look like:
rowSums(df[, grep("^colA[2-4]", names(df))] == "Yes") > 0
Then you can combine them using +
:
(rowSums(df[, grep("^colA[2-4]", names(df))] == "Yes") > 0) +
(rowSums(df[, grep("^colB[2-4]", names(df))] == "Yes") > 0)
Incidentally, you would have an easier time answering questions like these if your data were in a tidy format: that is, if each column were a separate variable. Right now it looks like you're storing attributes of your data (A, B, 1-4) as parts of your column names, which is the reason operations like "using columns with the value 'A'" are very awkward. What if you instead rearranged your data, using the dplyr and tidyr packages, as:
library(dplyr)
library(tidyr)
df$index <- 1:nrow(df)
newdf <- df %>% gather(key, value, colA1:colB4) %>%
separate(key, c("col", "letter", "number"), c(-3, -2)) %>%
mutate(number = as.numeric(number))
This rearranges your data as (note that I gave each of your rows its own "index" variable):
meh muh index col letter number value
1 Yes No 1 col A 1 Yes
2 Yes No 2 col A 1 Yes
3 No No 3 col A 1 Yes
4 Yes No 4 col A 1 No
5 Yes Yes 5 col A 1 No
6 Yes Yes 6 col A 1 Yes
You can then group, summarize, filter and manipulate these observations more naturally. For example, you seem to want to drop the columns with the number 1: rather than needing a regular expression, you could simply do newdf %>% filter(number > 1)
.
Here's how you would perform the kind of OR operation you're describing:
hasyes <- newdf %>% group_by(index, letter) %>% filter(number > 1) %>%
summarize(hasyes = any(value == "Yes"))
For each of your original row+letter combinations, you now have a logical value for whether Yes
appears:
index letter hasyes
1 1 A TRUE
2 1 B TRUE
3 2 A TRUE
4 2 B TRUE
5 3 A FALSE
6 3 B TRUE
One more summarizing operation gets this into the form you want:
result <- hasyes %>% group_by(index) %>% summarize(yeses = sum(hasyes))
What's important about this solution is that it will work for any number of letters you have (that is, if it goes from A-Z instead of just A and B) equally easily.
这篇关于R:计数(每行)满足几列的OR条件多少次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!