R:计数(每行)满足几列的OR条件多少次 [英] R: counting (for each row) how many times an OR condition on several columns is satisfied

查看:177
本文介绍了R:计数(每行)满足几列的OR条件多少次的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的问题类似于这个,除了有点不同。在最初的问题中,我试图计算(对于每一行)有多少列满足条件。我想做类似的事情,除了条件涉及几个具有OR条件的列,我的真实数据有很多列,所以理想情况下,我想使用正则表达式引用列。



我有以下数据:

  colnames<  -  c(粘贴(col ,rep(LETTERS [1:2],each = 4),rep(1:4,2),sep =),c(meh,muh))
df < .data.frame(matrix(sample(c(Yes,No),200,replace = TRUE),ncol = 10))
名称(df)< - colnames
df
colA1 colA2 colA3 colA4 colB1 colB2 colB3 colB4 meh muh
1否是否否否是是否是是
2否是是是是否是否否否
3否否否是否否否否是否
4是否是是是是是是否是
5是否是否否否否是否是
6是否否否是是否否否否否
7是否否否是是Y es否是否
8是否是否是是否是是否
9否是否否否是是否否否
10是是否否是否是否是否否b $ b 11否是否否是否是是否否
12否是是是否否是否否否
13否否是是否是是是是否
14是是否否否否是否否是
15是否是是否是否是否否
16否是是否否否是否否否
17是否否否否是是是是否是
18是否是是否否否否否是
19否否否否否是否否否是
20否是否否是是是否否否

我想创建一个新列 Nb 记录,每一行:至少一个的次数colA2,colA3,colA4是==是,加上colB2,colB3,colB4中至少有一个为是的次数。



如果有当查看一组列[colA2,colA3,colA4]时,并不是这个OR条件,而且我正在添加满足条件的列数,我可以使用以下内容:

  df $ Nb<  -  rowSums(df [,grep(^ col [AB] [2-4],names(df))] ==是)

如果可能,我想使用正则表达式引用列,如我的实际数据字母和数字分别比B和5进一步。



谢谢!

解决方案>

您可以将 rowSums 方法修改为每个OR条件中的列组,然后添加> 0 使其至少一个。因此,至少有一个A值为是将如下所示:

  rowSums(df [,grep colA [2-4],名称(df))] ==是)> 0 

然后,您可以使用 +

 (rowSums(df [,grep(^ colA [2-4],names(df))] = =是)> 0)+ 
(rowSums(df [,grep(^ colB [2-4],names(df))] ==是)> 0)






顺便说一句,你会更容易的回答这些问题如果您的数据位于整理格式:也就是说,如果每列是一个单独的变量。现在看来,您正在将数据的属性(A,B,1-4)存储为列名称的一部分,这就是使用值为A的列使用操作非常尴尬的原因。如果您使用dplyr和tidyr软件包重新排列数据,请执行以下操作:

 库(dplyr)
库(tidyr)
df $ index< - 1:nrow(df)
newdf< - df%>%gather(key,value,colA1:colB4)%>%
分别(key,c(col,letter,number),c(-3,-2))%>%
mutate(number = as.numeric(number))

这将您的数据重新排列(请注意,我给每行您自己的index变量):

  meh muh index col letter number value 
1是否1 col A 1是
2是否2 col A 1是
3否否3 col A 1是
4是否4 col A 1否
5是是5 col A 1否
6是是6 col A 1是

然后,您可以更自然地对这些观察进行分组,总结,过滤和操作。例如,您似乎想删除列号为1:而不是需要正则表达式,您可以简单地执行 newdf%>%filter(number> 1)



您将如何执行您所描述的OR操作:

  hasyes<  -  newdf%>%group_by(index,letter)%>%filter(number> 1)%>%
summarize(hasyes = any 是))

对于每个原始行+字母组合,您现在有一个逻辑值是否出现:

 索引字母hasyes 
1 1 A TRUE
2 1 B TRUE
3 2 A TRUE
4 2 B TRUE
5 3 A FALSE
6 3 B TRUE

另外一个总结操作可以将它转换成你想要的形式:

  result<  -  hasyes%>%group_by(index)%>%summarize(yese s = sum(hasyes))

这个解决方案的重要性在于它将适用于您有的任何数量的字母(也就是说,如果它来自AZ而不是A和B)。


My question is similar to this one, except a bit different. In the initial question, I was trying to count (for each row) how many columns satisfied a condition. I would like to do something similar, except that the condition involves several columns with an OR condition, and my real data has many columns, so ideally, I'd like to reference the columns using a regular expression.

I have the following data:

colnames <- c(paste("col",rep(LETTERS[1:2],each=4),rep(1:4,2),sep=""),c("meh","muh"))
df <- as.data.frame(matrix(sample(c("Yes","No"),200,replace=TRUE),ncol=10))
names(df) <- colnames
df
   colA1 colA2 colA3 colA4 colB1 colB2 colB3 colB4 meh muh
1     No   Yes    No    No    No   Yes   Yes    No Yes Yes
2     No   Yes   Yes   Yes   Yes    No   Yes    No  No  No
3     No    No    No   Yes    No    No    No    No Yes  No
4    Yes    No   Yes   Yes   Yes   Yes   Yes   Yes  No Yes
5    Yes    No   Yes    No    No    No    No   Yes  No Yes
6    Yes    No    No    No   Yes   Yes    No    No  No  No
7    Yes    No    No    No   Yes   Yes   Yes    No Yes  No
8    Yes    No   Yes    No   Yes   Yes    No   Yes Yes  No
9     No   Yes    No    No    No   Yes   Yes    No  No  No
10   Yes   Yes    No    No   Yes    No   Yes    No Yes  No
11    No   Yes    No    No   Yes    No   Yes   Yes  No  No
12    No   Yes   Yes   Yes    No    No   Yes    No  No  No
13    No    No   Yes   Yes    No   Yes   Yes   Yes Yes  No
14   Yes   Yes    No    No    No    No   Yes    No  No Yes
15   Yes    No   Yes   Yes    No   Yes    No   Yes  No  No
16    No   Yes   Yes    No    No    No   Yes    No  No  No
17   Yes    No    No    No    No   Yes   Yes   Yes  No Yes
18   Yes    No   Yes   Yes    No    No    No    No  No Yes
19    No    No    No    No    No   Yes    No    No  No Yes
20    No   Yes    No    No   Yes   Yes   Yes    No  No  No

I would like to create a new column Nb that records, for each line: the number of times at least one of colA2, colA3,colA4 is =="Yes" plus the number of times at least one of colB2, colB3,colB4 is =="Yes".

If there was not this "OR" condition implied when look at a group of columns [colA2, colA3,colA4], and I was adding the number of columns satisfying the condition, I could have used something like:

df$Nb <- rowSums(df[, grep("^col[A-B][2-4]", names(df))] == "Yes")

I would like to use regex if possible to reference the columns, as in my real data letters and numbers go further than B and 5 respectively.

Thank you!

解决方案

You could adapt your rowSums approach to just the groups of columns in each of your OR conditions, then add > 0 to make it "at least one." Thus, "at least one of the A values is Yes" would look like:

rowSums(df[, grep("^colA[2-4]", names(df))] == "Yes") > 0

Then you can combine them using +:

(rowSums(df[, grep("^colA[2-4]", names(df))] == "Yes") > 0) +
 (rowSums(df[, grep("^colB[2-4]", names(df))] == "Yes") > 0)


Incidentally, you would have an easier time answering questions like these if your data were in a tidy format: that is, if each column were a separate variable. Right now it looks like you're storing attributes of your data (A, B, 1-4) as parts of your column names, which is the reason operations like "using columns with the value 'A'" are very awkward. What if you instead rearranged your data, using the dplyr and tidyr packages, as:

library(dplyr)
library(tidyr)
df$index <- 1:nrow(df)
newdf <- df %>% gather(key, value, colA1:colB4) %>%
    separate(key, c("col", "letter", "number"), c(-3, -2)) %>%
    mutate(number = as.numeric(number))

This rearranges your data as (note that I gave each of your rows its own "index" variable):

  meh muh index col letter number value
1 Yes  No     1 col      A      1   Yes
2 Yes  No     2 col      A      1   Yes
3  No  No     3 col      A      1   Yes
4 Yes  No     4 col      A      1    No
5 Yes Yes     5 col      A      1    No
6 Yes Yes     6 col      A      1   Yes

You can then group, summarize, filter and manipulate these observations more naturally. For example, you seem to want to drop the columns with the number 1: rather than needing a regular expression, you could simply do newdf %>% filter(number > 1).

Here's how you would perform the kind of OR operation you're describing:

hasyes <- newdf %>% group_by(index, letter) %>% filter(number > 1) %>%
              summarize(hasyes = any(value == "Yes"))

For each of your original row+letter combinations, you now have a logical value for whether Yes appears:

  index letter hasyes
1     1      A   TRUE
2     1      B   TRUE
3     2      A   TRUE
4     2      B   TRUE
5     3      A  FALSE
6     3      B   TRUE

One more summarizing operation gets this into the form you want:

result <- hasyes %>% group_by(index) %>% summarize(yeses = sum(hasyes))

What's important about this solution is that it will work for any number of letters you have (that is, if it goes from A-Z instead of just A and B) equally easily.

这篇关于R:计数(每行)满足几列的OR条件多少次的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆