使用带有$的逻辑向量对数据帧进行子集 [英] Subset a dataframe using a logical vector with $
问题描述
在子集a中,我无法理解 $
符号的使用原因和行为
I'm having trouble understanding both the reason for use and behavior of the $
symbol in subsetting a data.frame
in R. The following example was presented in a beginner's class I'm taking (not with a live professor so can't ask there):
temp_mat <- matrix(1:9, nrow=3)
colnames(temp_mat) <- c('a', 'b', 'c')
temp_df <- data.frame(temp_mat)
调用 temp_df
显然会输出:
a b c
1 1 4 7
2 2 5 8
3 3 6 9
示例在课程中给出的则为:
The example given in the course is then:
temp_df[temp_df$c < 10]
哪个输出:
a b c
1 1 4 7
2 2 5 8
3 3 6 9
使用原因问题:该课程表明 $
用于部分匹配,并且 x $ y
是 x [[ y,确切= FALSE]]
的精确替代。我们为什么要在这里使用部分匹配运算符?我们使用它是因为我们确定在我们的 temp_df
中没有其他类似 c的列会被错误地选中吗?另外,如何测量部分匹配?至少有百分之几的字符匹配?似乎有一个 getElement
函数,如果使用具有未知或相似列名的数据集(例如,家用电话与手机,将它们视为一个有效的部分匹配?)
Reason for use question: The course indicates that $
is used for partial matching, and that x$y
is an exact substitute for x[["y", exact=FALSE]]
. Why would we want to use a partial matching operator here? Do we use it because we know for sure that in our temp_df
there is no other column similar to "c" that could be mistakenly picked up? Additionally, how is partial match measured? A minimum % of characters matching or something? It appears there is a getElement
function that would be much more appropriate if working with datasets with unknown or similar column names (e.g. Home Phone versus Cell Phone, would these be seen as a valid partial match?)
行为问题::出现上面的示例 temp_df [temp_df $ c< 10]
表示从temp_df返回元素的子集,其中c列小于10,并且由于所有c列元素均符合条件,因此将返回整个数据帧。我的解释显然是错误的,因为 temp_df [temp_df $ c< 9]
返回:
Behavior question: it appears the above example temp_df[temp_df$c < 10]
is saying "return the subset of elements from temp_df where column c is less than 10" and because all column c elements meet the criteria, the entire dataframe is returned. My interpretation is obviously wrong because temp_df[temp_df$c < 9]
returns:
a b
1 1 4
2 2 5
3 3 6
尽管c列中的第1行和第2行确实符合标准小于9时,整个列将被省略。然后我的问题变成双重的:逻辑向量实际上在说/做什么?以及如何写成从temp_df返回元素的子集(其中c列小于9的列)并返回的解释:
Although the row 1 and 2 elements in column c do meet the criteria of being less than 9, the entire column is omitted. My question then becomes twofold: what is that logical vector actually saying/doing? And how would I write my interpretation of "return the subset of elements from temp_df where column c is less than 9" and have it return:
a b c
1 1 4 7
2 2 5 8
因为在我看来,元素1和2(行1和2)符合条件,因为它们的列c值小于9,因此应返回。
Because in my mind, elements 1 and 2 (rows 1 and 2) met that criteria as their column c values are less than 9 and thus should be returned.
推荐答案
尝试分步分解操作。
temp_df$c < 9
给出一个向量,如下所示:
gives a vector as follows:
[1] TRUE TRUE FALSE
通过此向量时按照您显示的方式:
temp_df [c(TRUE,TRUE,FALSE)]
具有对列进行操作的作用。
When you pass this vector in the manner you have shown:
temp_df[c(TRUE, TRUE, FALSE)]
has the effect of operating on columns.
以 data.frame
为列表,以列名作为键,列内容为向量价值观。该操作保留TRUE键(即列),并删除FALSE。
Think about a data.frame
as a list, with column names as the keys and the column contents as vector values. The operation preserves the TRUE keys (i.e. columns) and drops the FALSE.
逗号用于将向量标记为行索引。前两行将保留,最后一行将被删除。因此, temp_df [c(TRUE,TRUE,FALSE),]
给出:
The comma serves to mark the vector as row index. The first two rows are retained and the last one is dropped. Thus, temp_df[c(TRUE, TRUE, FALSE), ]
gives:
a b c
1 1 4 7
2 2 5 8
这篇关于使用带有$的逻辑向量对数据帧进行子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!