为什么将数据框中的列与数据块进行子集化会产生不同的结果 [英] Why does subsetting a column from a data frame vs. a tibble give different results

查看:182
本文介绍了为什么将数据框中的列与数据块进行子集化会产生不同的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个为什么的问题,而不是如何的问题。



由于聚合,我有一个 tibble dplyr

 > str(urls)
类'tbl_df','tbl'和'data.frame':144 obs。的4个变量:
$ BRAND:chrBobbi BrownCalvin KleinChanelClarins...
$网站:chrhttp://www.bobbibrowncosmetics.com/ http://www.calvinklein.com/shop/en/ckhttp://www.chanel.com/en_US/http://www.clarinsusa.com/...
$域:chrbobbibrowncosmetics.com/calvinklein.com/shop/en/ckchanel.com/en_US/clarinsusa.com/...
$ final_domain:chrbobbibrowncosmetics.com/ calvinklein.com/shop/en/ckchanel.com/en_US/clarinsusa.com/...

当我尝试提取列final_domain作为字符向量时,会发生什么:

 >长度(as.character(urls [,4]))
[1] 1

当我反过来,强制数据框架,然后做,我得到我实际想要的:

 >长度(as.character(as.data.frame(urls)[,4]))
[1] 144

数据帧与数据帧的 str 看起来是一样的,但输出却不同。我想知道为什么

解决方案

潜在的原因是,当只选择一列时,子集化tbl和数据框会产生不同的结果。 >


  • 默认情况下,如果结果只有1, [。data.frame 将删除维度列,类似于矩阵子集的工作原理。所以结果是一个向量。

  • [。tbl_df 永远不会删除这样的维度;它总是返回一个tbl。



反过来, as.character 忽略一个tbl的类,将其视为一个简单的列表。而在列表上调用的 as.character 表现为 deparse :返回的字符表示是可以解析的R代码并执行以重现列表。



在大多数情况下,tbl行为可以说是正确的做法,因为删除维度可能会导致错误:通常会将数据框分散导致另一个数据帧,但有时它不会。在这个具体情况下,它不会做你想要的。



如果要从tbl作为向量提取列,可以使用列表风格的索引: urls [[4]] urls $ final_domain


This is a 'why' question and not a 'How to' question.

I have a tibble as a result of an aggregation dplyr

> str(urls)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   144 obs. of  4 variables:
 $ BRAND       : chr  "Bobbi Brown" "Calvin Klein" "Chanel" "Clarins" ...
 $ WEBSITE     : chr  "http://www.bobbibrowncosmetics.com/" "http://www.calvinklein.com/shop/en/ck" "http://www.chanel.com/en_US/" "http://www.clarinsusa.com/" ...
 $ domain      : chr  "bobbibrowncosmetics.com/" "calvinklein.com/shop/en/ck" "chanel.com/en_US/" "clarinsusa.com/" ...
 $ final_domain: chr  "bobbibrowncosmetics.com/" "calvinklein.com/shop/en/ck" "chanel.com/en_US/" "clarinsusa.com/" ...

When I try to extract the column final_domain as a character vector here's what happens:

> length(as.character(urls[ ,4]))
[1] 1

When I instead, coerce to data frame and then do it, I get what I actually want:

> length(as.character(as.data.frame(urls)[ ,4]))
[1] 144

The str of the tibble vs. dataframe looks the same but output differs. I'm wondering why?

解决方案

The underlying reason is that subsetting a tbl and a data frame produces different results when only one column is selected.

  • By default, [.data.frame will drop the dimensions if the result has only 1 column, similar to how matrix subsetting works. So the result is a vector.
  • [.tbl_df will never drop dimensions like this; it always returns a tbl.

In turn, as.character ignores the class of a tbl, treating it as a plain list. And as.character called on a list acts like deparse: the character representation it returns is R code that can be parsed and executed to reproduce the list.

The tbl behaviour is arguably the right thing to do in most circumstances, because dropping dimensions can easily lead to bugs: subsetting a data frame usually results in another data frame, but sometimes it doesn't. In this specific case it doesn't do what you want.

If you want to extract a column from a tbl as a vector, you can use list-style indexing: urls[[4]] or urls$final_domain.

这篇关于为什么将数据框中的列与数据块进行子集化会产生不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆