使用带有$的逻辑向量对数据帧进行子集 [英] Subset a dataframe using a logical vector with $

查看:111
本文介绍了使用带有$的逻辑向量对数据帧进行子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在子集a中,我无法理解 $ 符号的使用原因行为 data.frame 。下面的示例在我正在参加的初学者课程中演示(没有在校教授,所以不能在那里提问):

I'm having trouble understanding both the reason for use and behavior of the $ symbol in subsetting a data.frame in R. The following example was presented in a beginner's class I'm taking (not with a live professor so can't ask there):

temp_mat <- matrix(1:9, nrow=3)
colnames(temp_mat) <- c('a', 'b', 'c')
temp_df <- data.frame(temp_mat)

调用 temp_df 显然会输出:

  a b c
1 1 4 7
2 2 5 8
3 3 6 9

示例在课程中给出的则为:

The example given in the course is then:

temp_df[temp_df$c < 10]

哪个输出:

  a b c
1 1 4 7
2 2 5 8
3 3 6 9

使用原因问题:该课程表明 $ 用于部分匹配,并且 x $ y x [[ y,确切= FALSE]] 的精确替代。我们为什么要在这里使用部分匹配运算符?我们使用它是因为我们确定在我们的 temp_df 中没有其他类似 c的列会被错误地选中吗?另外,如何测量部分匹配?至少有百分之几的字符匹配?似乎有一个 getElement 函数,如果使用具有未知或相似列名的数据集(例如,家用电话​​与手机,将它们视为一个有效的部分匹配?)

Reason for use question: The course indicates that $ is used for partial matching, and that x$y is an exact substitute for x[["y", exact=FALSE]]. Why would we want to use a partial matching operator here? Do we use it because we know for sure that in our temp_df there is no other column similar to "c" that could be mistakenly picked up? Additionally, how is partial match measured? A minimum % of characters matching or something? It appears there is a getElement function that would be much more appropriate if working with datasets with unknown or similar column names (e.g. Home Phone versus Cell Phone, would these be seen as a valid partial match?)

行为问题::出现上面的示例 temp_df [temp_df $ c< 10] 表示从temp_df返回元素的子集,其中c列小于10,并且由于所有c列元素均符合条件,因此将返回整个数据帧。我的解释显然是错误的,因为 temp_df [temp_df $ c< 9] 返回:

Behavior question: it appears the above example temp_df[temp_df$c < 10] is saying "return the subset of elements from temp_df where column c is less than 10" and because all column c elements meet the criteria, the entire dataframe is returned. My interpretation is obviously wrong because temp_df[temp_df$c < 9] returns:

  a b
1 1 4
2 2 5
3 3 6

尽管c列中的第1行和第2行确实符合标准小于9时,整个列将被省略。然后我的问题变成双重的:逻辑向量实际上在说/做什么?以及如何写成从temp_df返回元素的子集(其中c列小于9的列)并返回的解释:

Although the row 1 and 2 elements in column c do meet the criteria of being less than 9, the entire column is omitted. My question then becomes twofold: what is that logical vector actually saying/doing? And how would I write my interpretation of "return the subset of elements from temp_df where column c is less than 9" and have it return:

  a b c
1 1 4 7
2 2 5 8

因为在我看来,元素1和2(行1和2)符合条件,因为它们的列c值小于9,因此应返回。

Because in my mind, elements 1 and 2 (rows 1 and 2) met that criteria as their column c values are less than 9 and thus should be returned.

推荐答案

尝试分步分解操作。

temp_df$c < 9

给出一个向量,如下所示:

gives a vector as follows:

[1]  TRUE  TRUE FALSE

通过此向量时按照您显示的方式:
temp_df [c(TRUE,TRUE,FALSE)] 具有对列进行操作的作用。

When you pass this vector in the manner you have shown: temp_df[c(TRUE, TRUE, FALSE)] has the effect of operating on columns.

data.frame 为列表,以列名作为键,列内容为向量价值观。该操作保留TRUE键(即列),并删除FALSE。

Think about a data.frame as a list, with column names as the keys and the column contents as vector values. The operation preserves the TRUE keys (i.e. columns) and drops the FALSE.

逗号用于将向量标记为行索引。前两行将保留,最后一行将被删除。因此, temp_df [c(TRUE,TRUE,FALSE),] 给出:

The comma serves to mark the vector as row index. The first two rows are retained and the last one is dropped. Thus, temp_df[c(TRUE, TRUE, FALSE), ] gives:

  a b c
1 1 4 7
2 2 5 8

这篇关于使用带有$的逻辑向量对数据帧进行子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆