R vs. MATLAB中的高维数据结构化方法 [英] Methodology of high-dimensional data structuring in R vs. MATLAB

查看:75
本文介绍了R vs. MATLAB中的高维数据结构化方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

用R进行探索性分析的重复试验中积累的分类标签构造多元数据的正确方法是什么?我不想回到MATLAB.

What is the right way to structure multivariate data with categorical labels accumulated over repeated trials for exploratory analysis in R? I don't want to slip back to MATLAB.

我比R更好地喜欢R的分析函数和语法(以及令人惊叹的图表),并且一直在努力重构我的东西.但是,我一直迷恋于工作中数据的组织方式.

I like R's analysis functions and syntax (and stunning plots) much better than MATLAB's, and have been working hard to refactor my stuff over. However, I keep getting hung up on the way data is organized in my work.

对我来说,通常使用多次试验中重复的多元时间序列,这些时间序列存储在SERIESxSAMPLESxTRIALS的大型矩阵 rank-3张量多维数组中.这偶尔会使其适合一些线性代数,但涉及另一个变量(即CLASS)时比较笨拙.通常,类标签存储在尺寸为1x TRIALS的另一个向量中.

It's typical for me to work with multivariate time series repeated over many trials, which are stored in a big matrix rank-3 tensor multidimensional array of SERIESxSAMPLESxTRIALS. This lends itself to some nice linear algebra stuff occasionally, but is clumsy when it comes to another variable, namely CLASS. Typically class labels are stored in another vector of dimension 1xTRIALS.

当涉及到分析时,我基本上尽可能地减少绘图,因为要花费大量的工作才能整理出一个非常好的绘图,该绘图可以教您很多有关MATLAB中的数据的知识. (我并不是唯一一个有这种感觉的人.

When it comes to analysis I basically plot as little as possible, because it takes so much work to get together a really good plot that teaches you a lot about the data in MATLAB. (I'm not the only one who feels this way).

在R中,我一直尽可能地贴近MATLAB结构,但是在尝试使类标签保持分离时,事情变得非常烦人.即使我仅使用标签的属性,也必须继续将标签传递到函数中.因此,我要做的是按CLASS将数组分成数组列表.这给我所有的apply()函数增加了复杂性,但是就保持事物的一致性(和错误排除)而言,这似乎是值得的.

In R I've been sticking as close as I can to the MATLAB structure, but things get annoyingly complex when trying to keep the class labeling separate; I'd have to keep passing the labels into functions in even though I'm only using their attributes. So what I've done is separate the array into a list of arrays by CLASS. This adds complexity to all of my apply() functions, but seems to be worth it in terms of keeping things consistent (and bugs out).

另一方面,R似乎对张量/多维数组不友好.与他们一起工作,您需要获取abind库.有关多变量分析的文档,例如本例似乎是在假设您有一个巨大的2D数据点表,例如中世纪的长卷轴一个数据帧,并且没有提到如何从我所在的位置获得假设"的情况下进行操作.

On the other hand, R just doesn't seem to be friendly with tensors/multidimensional arrays. Just to work with them, you need to grab the abind library. Documentation on multivariate analysis, like this example seems to operate under the assumption that you have a huge 2-D table of data points like some long medieval scroll a data frame, and doesn't mention how to get 'there' from where I am.

一旦我开始对处理的数据进行绘图和分类,这并不是什么大问题,因为从那时起,我就一直致力于使用像TRIALSxFEATURES这样的形状的数据帧友好结构(melt这).另一方面,如果我想快速为探索阶段生成散点图矩阵或格构直方图(即统计矩,分离,类内/类间方差,直方图等),我必须停下来想一想我要把这些巨大的多维数组apply()变成那些库可以理解的东西.

Once I get to plotting and classifying the processed data, it's not such a big problem, since by then I've worked my way down to data frame-friendly structures with shapes like TRIALSxFEATURES (melt has helped a lot with this). On the other hand, if I want to quickly generate a scatterplot matrix or latticist histogram set for the exploratory phase (i.e. statistical moments, separation, in/between-class variance, histograms, etc.), I have to stop and figure out how I'm going to apply() these huge multidimensional arrays into something those libraries understand.

如果我不断在丛林中四处寻找解决方案,那我要么永远无法变得更好,要么最终会以自己怪异的,怪异的方式完成任务,对任何人都有意义.

If I keep pounding around in the jungle coming up with ad-hoc solutions for this, I'm either never going to get better or I'll end up with my own weird wizardly ways of doing it that don't make sense to anybody.

那么用重复的试验中积累的分类标签构造多变量数据以进行R中的探索性分析的正确方法是什么?拜托,我不想退回到MATLAB.

So what's the right way to structure multivariate data with categorical labels accumulated over repeated trials for exploratory analysis in R? Please, I don't want to slip back to MATLAB.

奖金:我倾向于针对多个主题在相同的数据结构上重复这些分析.有没有比将代码块包装到for循环中更好的通用方法了?

Bonus: I tend to repeat these analyses over identical data structures for multiple subjects. Is there a better general way than wrapping the code chunks into for loops?

推荐答案

也许是dplyr :: tbl_cube吗?

从@BrodieG的出色答案着手,我认为查看dplyr::tbl_cube中可用的新功能可能会很有用.从本质上讲,这是一个多维对象,您可以轻松地从数组列表中创建(如您当前使用的那样),它具有一些非常好的功能,用于子集,过滤和汇总(在我看来很重要)在数据的多维数据集"视图和表格"视图.

Maybe dplyr::tbl_cube ?

Working on from @BrodieG's excellent answer, I think that you may find it useful to look at the new functionality available from dplyr::tbl_cube. This is essentially a multidimensional object that you can easily create from a list of arrays (as you're currently using), which has some really good functions for subsetting, filtering and summarizing which (importantly, I think) are used consistently across the "cube" view and "tabular" view of the data.

require(dplyr)

注意事项:

这是一个早期版本:随之而来的所有问题
建议此版本在加载dplyr时卸载plyr

将数组加载到多维数据集中

这是另一个答案中定义的使用arr的示例:

# using arr from previous example
# we can convert it simply into a tbl_cube
arr.cube<-as.tbl_cube(arr)

arr.cube  
#Source: local array [24 x 3]  
#D: ser [chr, 3]  
#D: smp [chr, 2]  
#D: tr [chr, 4]  
#M: arr [dbl[3,2,4]]

因此请注意,D表示尺寸"和"M尺寸",您可以根据需要任意选择.

So note that D means Dimensions and M Measures, and you can have as many as you like of each.

通过将其作为data.frame返回(可以将其转换为data.table,如果以后需要功能和性能优势,则可以轻松地将其制成表格)

You can easily make the data tabular by returning it as a data.frame (which you can simply convert to a data.table if you need the functionality and performance benefits later)

head(as.data.frame(arr.cube))
#    ser   smp   tr       arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 2 smp 1 tr 1 0.6181301
#3 ser 3 smp 1 tr 1 0.7335676
#4 ser 1 smp 2 tr 1 0.9444435
#5 ser 2 smp 2 tr 1 0.8977054
#6 ser 3 smp 2 tr 1 0.9361929

子集

显然,您可以为每个操作平整所有数据,但这对性能和实用性有很多影响.我认为此程序包的真正好处是,您可以在将数据转换为ggplot友好的表格格式之前,对所需的数据进行预挖掘".简单过滤,仅返回系列1:

Subsetting

You could obviously flatten all data for every operation, but that has many implications for performance and utility. I think the real benefit of this package is that you can "pre-mine" the cube for the data that you require before converting it into a tabular format that is ggplot-friendly, e.g. simple filtering to return only series 1:

arr.cube.filtered<-filter(arr.cube,ser=="ser 1")
as.data.frame(arr.cube.filtered)
#    ser   smp   tr       arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 1 smp 2 tr 1 0.9444435
#3 ser 1 smp 1 tr 2 0.4331116
#4 ser 1 smp 2 tr 2 0.3916376
#5 ser 1 smp 1 tr 3 0.4669228
#6 ser 1 smp 2 tr 3 0.8942300
#7 ser 1 smp 1 tr 4 0.2054326
#8 ser 1 smp 2 tr 4 0.1006973

tbl_cube当前可与dplyr函数summarise()select()group_by()filter()一起使用.有用的是,您可以将这些与%.%运算符链接在一起.

tbl_cube currently works with the dplyr functions summarise(), select(), group_by() and filter(). Usefully you can chain these together with the %.% operator.

在其余示例中,我将使用内置的nasa tbl_cube对象,该对象具有大量的气象数据(并演示了多个维度和度量):

For the rest of the examples, I'm going to use the inbuilt nasa tbl_cube object, which has a bunch of meteorological data (and demonstrates multiple dimensions and measures):

nasa
#Source: local array [41,472 x 4]
#D: lat [dbl, 24]
#D: long [dbl, 24]
#D: month [int, 12]
#D: year [int, 6]
#M: cloudhigh [dbl[24,24,12,6]]
#M: cloudlow [dbl[24,24,12,6]]
#M: cloudmid [dbl[24,24,12,6]]
#M: ozone [dbl[24,24,12,6]]
#M: pressure [dbl[24,24,12,6]]
#M: surftemp [dbl[24,24,12,6]]
#M: temperature [dbl[24,24,12,6]]

因此,下面的示例显示了从多维数据集中拉回修改后的数据子集,然后 then 对其进行平坦化以使其适合绘制的过程是多么容易:

So here is an example showing how easy it is to pull back a subset of modified data from the cube, and then flatten it so that it's appropriate for plotting:

plot_data<-as.data.frame(          # as.data.frame so we can see the data
filter(nasa,long<(-70)) %.%        # filter long < (-70) (arbitrary!)
group_by(lat,long) %.%             # group by lat/long combo
summarise(p.max=max(pressure),     # create summary measures for each group
          o.avg=mean(ozone),
          c.all=(cloudhigh+cloudlow+cloudmid)/3)
)

head(plot_data)

#       lat   long p.max    o.avg    c.all
#1 36.20000 -113.8   975 310.7778 22.66667
#2 33.70435 -113.8   975 307.0833 21.33333
#3 31.20870 -113.8   990 300.3056 19.50000
#4 28.71304 -113.8  1000 290.3056 16.00000
#5 26.21739 -113.8  1000 282.4167 14.66667
#6 23.72174 -113.8  1000 275.6111 15.83333

n维和2维数据结构的一致表示法

可悲的是,mutate()函数尚未为tbl_cube实现,但是看起来这只是(不多)时间的问题.不过,您可以在表格结果中使用它(以及在多维数据集上起作用的所有其他函数)-表示法完全相同.例如:

Consistent notation for n-d and 2-d data structures

Sadly the mutate() function isn't yet implemented for tbl_cube but looks like that will just be a matter of (not much) time. You can use it (and all the other functions that work on the cube) on the tabular result, though - with exactly the same notation. For example:

plot_data.mod<-filter(plot_data,lat>25) %.%    # filter out lat <=25
mutate(arb.meas=o.avg/p.max)                   # make a new column

head(plot_data.mod)

#       lat      long p.max    o.avg    c.all  arb.meas
#1 36.20000 -113.8000   975 310.7778 22.66667 0.3187464
#2 33.70435 -113.8000   975 307.0833 21.33333 0.3149573
#3 31.20870 -113.8000   990 300.3056 19.50000 0.3033389
#4 28.71304 -113.8000  1000 290.3056 16.00000 0.2903056
#5 26.21739 -113.8000  1000 282.4167 14.66667 0.2824167
#6 36.20000 -111.2957   930 313.9722 20.66667 0.3376045

绘图-以喜欢"平面数据的R功能为例

然后,您可以使用扁平化数据的优点,使用ggplot()进行绘制:

# plot as you like:
ggplot(plot_data.mod) +
  geom_point(aes(lat,long,size=c.all,color=c.all,shape=cut(p.max,6))) +
  facet_grid( lat ~ long ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

在这里我不会扩展data.table的用法,因为在上一个答案中做得很好.显然,使用data.table的原因很多,在任何情况下,您都可以通过对data.frame的简单转换来返回data.table:

I'm not going to expand on the use of data.table here, as it's done well in the previous answer. Obviously there are many good reasons to use data.table - for any situation here you can return one by a simple conversion of the data.frame:

data.table(as.data.frame(your_cube_name))

动态处理多维数据集

我认为很棒的另一件事是能够向多维数据集添加度量(切片/场景/班次,无论您要称呼什么).我认为这将与问题中描述的分析方法非常吻合.这是一个使用arr.cube的简单示例-添加一个额外的量度,它本身就是前一个量度的((很简单)的功能).您可以通过语法 yourcube $mets[$...]

Working dynamically with your cube

Another thing I think is great is the ability to add measures (slices / scenarios / shifts, whatever you want to call them) to your cube. I think this will fit well with the method of analysis described in the question. Here's a simple example with arr.cube - adding an additional measure which is itself an (admittedly simple) function of the previous measure. You access/update measures through the syntax yourcube$mets[$...]

head(as.data.frame(arr.cube))

#    ser   smp   tr       arr
#1 ser 1 smp 1 tr 1 0.6656456
#2 ser 2 smp 1 tr 1 0.6181301
#3 ser 3 smp 1 tr 1 0.7335676
#4 ser 1 smp 2 tr 1 0.9444435
#5 ser 2 smp 2 tr 1 0.8977054
#6 ser 3 smp 2 tr 1 0.9361929

arr.cube$mets$arr.bump<-arr.cube$mets$arr*1.1  #arb modification!

head(as.data.frame(arr.cube))

#    ser   smp   tr       arr  arr.bump
#1 ser 1 smp 1 tr 1 0.6656456 0.7322102
#2 ser 2 smp 1 tr 1 0.6181301 0.6799431
#3 ser 3 smp 1 tr 1 0.7335676 0.8069244
#4 ser 1 smp 2 tr 1 0.9444435 1.0388878
#5 ser 2 smp 2 tr 1 0.8977054 0.9874759
#6 ser 3 smp 2 tr 1 0.9361929 1.0298122

尺寸-还是不...

我尝试动态添加全新的维度(有效地扩展具有其他维度的现有多维数据集,并使用 yourcube $dims[$...]克隆或修改原始数据),但是发现了行为有些不一致.最好还是避免这种情况,并在操作多维数据集之前先对其进行结构化.如果我有空的话,会及时通知你.

Dimensions - or not ...

I've played a little with trying to dynamically add entirely new dimensions (effectively scaling up an existing cube with additional dimensions and cloning or modifying the original data using yourcube$dims[$...]) but have found the behaviour to be a little inconsistent. Probably best to avoid this anyway, and structure your cube first before manipulating it. Will keep you posted if I get anywhere.

很明显,具有解释器访问多维数据库的主要问题之一是潜在地因错误的击键错误地对它进行了窃听.所以我想只是早点坚持下去:

Obviously one of the main issues with having interpreter access to a multidimensional database is the potential to accidentally bugger it with an ill-timed keystroke. So I guess just persist early and often:

tempfilename<-gsub("[ :-]","",paste0("DBX",(Sys.time()),".cub"))
# save:
save(arr.cube,file=tempfilename)
# load:
load(file=tempfilename)

希望有帮助!

这篇关于R vs. MATLAB中的高维数据结构化方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆