为什么将数据框中的列与数据块进行子集化会产生不同的结果 [英] Why does subsetting a column from a data frame vs. a tibble give different results
问题描述
由于聚合,我有一个 tibble
dplyr
> str(urls)
类'tbl_df','tbl'和'data.frame':144 obs。的4个变量:
$ BRAND:chrBobbi BrownCalvin KleinChanelClarins...
$网站:chrhttp://www.bobbibrowncosmetics.com/ http://www.calvinklein.com/shop/en/ckhttp://www.chanel.com/en_US/http://www.clarinsusa.com/...
$域:chrbobbibrowncosmetics.com/calvinklein.com/shop/en/ckchanel.com/en_US/clarinsusa.com/...
$ final_domain:chrbobbibrowncosmetics.com/ calvinklein.com/shop/en/ckchanel.com/en_US/clarinsusa.com/...
当我尝试提取列final_domain作为字符向量时,会发生什么:
>长度(as.character(urls [,4]))
[1] 1
当我反过来,强制数据框架,然后做,我得到我实际想要的:
>长度(as.character(as.data.frame(urls)[,4]))
[1] 144
数据帧与数据帧的 str
看起来是一样的,但输出却不同。我想知道为什么
潜在的原因是,当只选择一列时,子集化tbl和数据框会产生不同的结果。 >
- 默认情况下,如果结果只有1,
[。data.frame
将删除维度列,类似于矩阵子集的工作原理。所以结果是一个向量。 -
[。tbl_df
将永远不会删除这样的维度;它总是返回一个tbl。
反过来, as.character
忽略一个tbl的类,将其视为一个简单的列表。而在列表上调用的 as.character
表现为 deparse
:返回的字符表示是可以解析的R代码并执行以重现列表。
在大多数情况下,tbl行为可以说是正确的做法,因为删除维度可能会导致错误:通常会将数据框分散导致另一个数据帧,但有时它不会。在这个具体情况下,它不会做你想要的。
如果要从tbl作为向量提取列,可以使用列表风格的索引: urls [[4]]
或 urls $ final_domain
。
This is a 'why' question and not a 'How to' question.
I have a tibble
as a result of an aggregation dplyr
> str(urls)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 144 obs. of 4 variables:
$ BRAND : chr "Bobbi Brown" "Calvin Klein" "Chanel" "Clarins" ...
$ WEBSITE : chr "http://www.bobbibrowncosmetics.com/" "http://www.calvinklein.com/shop/en/ck" "http://www.chanel.com/en_US/" "http://www.clarinsusa.com/" ...
$ domain : chr "bobbibrowncosmetics.com/" "calvinklein.com/shop/en/ck" "chanel.com/en_US/" "clarinsusa.com/" ...
$ final_domain: chr "bobbibrowncosmetics.com/" "calvinklein.com/shop/en/ck" "chanel.com/en_US/" "clarinsusa.com/" ...
When I try to extract the column final_domain as a character vector here's what happens:
> length(as.character(urls[ ,4]))
[1] 1
When I instead, coerce to data frame and then do it, I get what I actually want:
> length(as.character(as.data.frame(urls)[ ,4]))
[1] 144
The str
of the tibble vs. dataframe looks the same but output differs. I'm wondering why?
The underlying reason is that subsetting a tbl and a data frame produces different results when only one column is selected.
- By default,
[.data.frame
will drop the dimensions if the result has only 1 column, similar to how matrix subsetting works. So the result is a vector. [.tbl_df
will never drop dimensions like this; it always returns a tbl.
In turn, as.character
ignores the class of a tbl, treating it as a plain list. And as.character
called on a list acts like deparse
: the character representation it returns is R code that can be parsed and executed to reproduce the list.
The tbl behaviour is arguably the right thing to do in most circumstances, because dropping dimensions can easily lead to bugs: subsetting a data frame usually results in another data frame, but sometimes it doesn't. In this specific case it doesn't do what you want.
If you want to extract a column from a tbl as a vector, you can use list-style indexing: urls[[4]]
or urls$final_domain
.
这篇关于为什么将数据框中的列与数据块进行子集化会产生不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!