子集数据以仅包含名称与条件匹配的列 [英] Subset data to contain only columns whose names match a condition

查看:45
本文介绍了子集数据以仅包含名称与条件匹配的列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有没有办法根据以特定字符串开头的列名对数据进行子集化?我有一些列类似于 ABC_1 ABC_2 ABC_3 和一些类似于 XYZ_1, XYZ_2,XYZ_3 假设.

Is there a way for me to subset data based on column names starting with a particular string? I have some columns which are like ABC_1 ABC_2 ABC_3 and some like XYZ_1, XYZ_2,XYZ_3 let's say.

如何仅根据包含上述文本部分的列(比方说,ABCXYZ)来子集我的 df?我可以使用索引,但列在数据中过于分散,而且硬编码过多.

How can I subset my df based only on columns containing the above portions of text (lets say, ABC or XYZ)? I can use indices, but the columns are too scattered in data and it becomes too much of hard coding.

另外,我只想包含这些列中的每一列的行,其中任何值都是 >0 所以如果上面的 6 列有一个 1 在行中,它会切入我的最终数据框.

Also, I want to only include rows from each of these columns where any of their value is >0 so if either of the 6 columns above has a 1 in the row, it makes a cut into my final data frame.

推荐答案

在您的 data.frame 的名称上尝试 grepl.grepl 将正则表达式与目标匹配,如果找到匹配则返回 TRUE,否则返回 FALSE.该函数是矢量化的,因此您可以传递字符串向量进行匹配,您将获得返回的布尔值向量.

Try grepl on the names of your data.frame. grepl matches a regular expression to a target and returns TRUE if a match is found and FALSE otherwise. The function is vectorised so you can pass a vector of strings to match and you will get a vector of boolean values returned.

#  Data
df <- data.frame( ABC_1 = runif(3),
            ABC_2 = runif(3),
            XYZ_1 = runif(3),
            XYZ_2 = runif(3) )

#      ABC_1     ABC_2     XYZ_1     XYZ_2
#1 0.3792645 0.3614199 0.9793573 0.7139381
#2 0.1313246 0.9746691 0.7276705 0.0126057
#3 0.7282680 0.6518444 0.9531389 0.9673290

#  Use grepl
df[ , grepl( "ABC" , names( df ) ) ]
#      ABC_1     ABC_2
#1 0.3792645 0.3614199
#2 0.1313246 0.9746691
#3 0.7282680 0.6518444

#  grepl returns logical vector like this which is what we use to subset columns
grepl( "ABC" , names( df ) )
#[1]  TRUE  TRUE FALSE FALSE

要回答第二部分,我会制作子集 data.frame,然后制作一个向量来索引要保留的行(逻辑向量),就像这样...

To answer the second part, I'd make the subset data.frame and then make a vector that indexes the rows to keep (a logical vector) like this...

set.seed(1)
df <- data.frame( ABC_1 = sample(0:1,3,repl = TRUE),
            ABC_2 = sample(0:1,3,repl = TRUE),
            XYZ_1 = sample(0:1,3,repl = TRUE),
            XYZ_2 = sample(0:1,3,repl = TRUE) )

# We will want to discard the second row because 'all' ABC values are 0:
#  ABC_1 ABC_2 XYZ_1 XYZ_2
#1     0     1     1     0
#2     0     0     1     0
#3     1     1     1     0


df1 <- df[ , grepl( "ABC" , names( df ) ) ]

ind <- apply( df1 , 1 , function(x) any( x > 0 ) )

df1[ ind , ]
#  ABC_1 ABC_2
#1     0     1
#3     1     1

这篇关于子集数据以仅包含名称与条件匹配的列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆