为什么从数据框和小标题中子集一列会给出不同的结果 [英] Why does subsetting a column from a data frame vs. a tibble give different results

查看:21
本文介绍了为什么从数据框和小标题中子集一列会给出不同的结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这是一个为什么"的问题,而不是一个如何"的问题.

This is a 'why' question and not a 'How to' question.

我有一个 tibble 作为聚合 dplyr

I have a tibble as a result of an aggregation dplyr

> str(urls)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   144 obs. of  4 variables:
 $ BRAND       : chr  "Bobbi Brown" "Calvin Klein" "Chanel" "Clarins" ...
 $ WEBSITE     : chr  "http://www.bobbibrowncosmetics.com/" "http://www.calvinklein.com/shop/en/ck" "http://www.chanel.com/en_US/" "http://www.clarinsusa.com/" ...
 $ domain      : chr  "bobbibrowncosmetics.com/" "calvinklein.com/shop/en/ck" "chanel.com/en_US/" "clarinsusa.com/" ...
 $ final_domain: chr  "bobbibrowncosmetics.com/" "calvinklein.com/shop/en/ck" "chanel.com/en_US/" "clarinsusa.com/" ...

当我尝试将列 final_domain 提取为字符向量时,会发生以下情况:

When I try to extract the column final_domain as a character vector here's what happens:

> length(as.character(urls[ ,4]))
[1] 1

当我改为强制使用数据框然后执行时,我得到了我真正想要的:

When I instead, coerce to data frame and then do it, I get what I actually want:

> length(as.character(as.data.frame(urls)[ ,4]))
[1] 144

tibble 与数据帧的 str 看起来相同但输出不同.我想知道为什么?

The str of the tibble vs. dataframe looks the same but output differs. I'm wondering why?

推荐答案

根本原因是当只选择一列时,对 tbl 和数据框进行子集会产生不同的结果.

The underlying reason is that subsetting a tbl and a data frame produces different results when only one column is selected.

  • 默认情况下,如果结果只有 1 列,[.data.frame 将删除维度,类似于矩阵子集的工作方式.所以结果是一个向量.
  • [.tbl_df永远删除这样的维度;它总是返回一个 tbl.
  • By default, [.data.frame will drop the dimensions if the result has only 1 column, similar to how matrix subsetting works. So the result is a vector.
  • [.tbl_df will never drop dimensions like this; it always returns a tbl.

反过来,as.character 忽略 tbl 的类,将其视为普通列表.并且在列表上调用 as.character 就像 deparse:它返回的字符表示是 R 代码,可以解析和执行以重现列表.

In turn, as.character ignores the class of a tbl, treating it as a plain list. And as.character called on a list acts like deparse: the character representation it returns is R code that can be parsed and executed to reproduce the list.

在大多数情况下,tbl 行为可以说是正确的做法,因为删除维度很容易导致错误:对数据框进行子集化通常会产生另一个数据框,但有时不会.在这种特定情况下,它不会执行您想要的操作.

The tbl behaviour is arguably the right thing to do in most circumstances, because dropping dimensions can easily lead to bugs: subsetting a data frame usually results in another data frame, but sometimes it doesn't. In this specific case it doesn't do what you want.

如果您想从 tbl 中提取一列作为向量,您可以使用列表样式索引:urls[[4]]urls$final_domain.

If you want to extract a column from a tbl as a vector, you can use list-style indexing: urls[[4]] or urls$final_domain.

这篇关于为什么从数据框和小标题中子集一列会给出不同的结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆