长而宽的数据–什么时候使用? [英] Long and wide data – when to use what?

查看:149
本文介绍了长而宽的数据–什么时候使用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在将来自不同数据集的数据汇总为一个数据集以进行分析。我将进行数据探索,尝试不同的方法以发现数据中可能隐藏哪些规律性,因此我目前尚不具备特定的方法。现在我想知道是否应该将数据编译为长格式或宽格式。

I'm in the process of compiling data from different data sets into one data set for analysis. I'll be doing data exploration, trying different things to find out what regularities may be hidden in the data, so I don't currently have a specific method in mind. Now I'm wondering if I should compile my data into long or wide format.

我应该使用哪种格式,为什么?

我知道数据可以从长到宽进行重塑,反之亦然,但是仅此功能的存在就意味着有时会出现重塑的需求,而这又意味着特定格式可能更适合特定任务。所以我什么时候需要哪种格式,为什么?

I understand that data can be reshaped from long to wide or vice versa, but the mere existence of this functionality implies that the need to reshape sometimes arises and this need in turn implies that a specific format might be better suited for a certain task. So when do I need which format, and why?

我不是在问性能。

推荐答案

Hadley Wickham的整理数据纸,以及 tidyr 软件包,这是他(最新)实施的

Hadley Wickham's Tidy Data paper, and the tidyr package that is his (latest) implementation of its principles, is a great place to start.

这个问题的粗略答案是,在处理过程中,数据应始终很长,并且仅应扩展以用于显示目的。不过,请对此谨慎,因为长在这里更多是指整洁,而不是纯长形式。

The rough answer to the question is that data, during processing, should always be long, and should only be widened for display purposes. Be cautious with this, though, as here "long" refers more to "tidy", rather than the pure long form.

示例

例如,以 mtcars 数据集为例。这已经是整齐的形式,因为每一行代表一个观察值。因此,加长它,以获得类似的内容

Take, for example, the mtcars dataset. This is already in tidy form, in that each row represents a single observation. So "lengthening" it, to get something like this

        model type   value
1 AMC Javelin  mpg  15.200
2 AMC Javelin  cyl   8.000
3 AMC Javelin disp 304.000
4 AMC Javelin   hp 150.000
5 AMC Javelin drat   3.150
6 AMC Javelin   wt   3.435

适得其反; mpg cyl 在任何有意义的方式上都是不可比的。

is counterproductive; mpg and cyl are not comparable in any meaningful way.

获取 ChickWeight 数据集(采用长格式)并将其按时间转换为宽幅

Taking the ChickWeight dataset (which is in long form) and transforming it to wide by time

require(tidyr)
ChickWeight %>% spread(Time, weight)
   Chick Diet  0  2  4  6   8  10  12  14  16  18  20  21
1     18    1 39 35 NA NA  NA  NA  NA  NA  NA  NA  NA  NA
2     16    1 41 45 49 51  57  51  54  NA  NA  NA  NA  NA
3     15    1 41 49 56 64  68  68  67  68  NA  NA  NA  NA
4     13    1 41 48 53 60  65  67  71  70  71  81  91  96
5      9    1 42 51 59 68  85  96  90  92  93 100 100  98
6     20    1 41 47 54 58  65  73  77  89  98 107 115 117
7     10    1 41 44 52 63  74  81  89  96 101 112 120 124
8      8    1 42 50 61 71  84  93 110 116 126 134 125  NA
9     17    1 42 51 61 72  83  89  98 103 113 123 133 142
10    19    1 43 48 55 62  65  71  82  88 106 120 144 157
11     4    1 42 49 56 67  74  87 102 108 136 154 160 157
12     6    1 41 49 59 74  97 124 141 148 155 160 160 157
13    11    1 43 51 63 84 112 139 168 177 182 184 181 175
...

提供了可能有用的可视化效果,但出于数据分析的目的,这非常不方便,因为诸如增长率之类的计算事情变得繁琐。

gives a visualization that may be useful, but for data analysis purposes, is very inconvenient, as computing things like growth rate become cumbersome.

这篇关于长而宽的数据–什么时候使用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆