限制可重复示例的分层数据的大小 [英] Limiting size of hierarchical data for reproducible example

查看:85
本文介绍了限制可重复示例的分层数据的大小的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想提出这个问题的可重现示例(RE)。要被认定为拥有RE,该问题仅缺少可重复数据。然而,当我尝试使用几乎标准的方法 dput(head(myDataObj))时,生成的输出是14MB大小的文件。问题是我的数据对象是一个数据框列表,所以 head()限制似乎不会递归。

I am trying to come up with reproducible example (RE) for this question: Data frame columns lost after merging. To be qualified as having a RE, the question lacks only reproducible data. However, when I tried to use pretty much standard approach of dput(head(myDataObj)), the output produced is 14MB size file. The problem is that my data object is a list of data frames, so head() limitation doesn't appear to work recursively.

我没有找到任何 dput() head()可以让我为复杂对象递归控制数据大小的函数。除非我上述错了,还有什么其他方法创建最小的RE数据集,您会在这种情况下推荐我吗?

I haven't found any options for dput() and head() functions that would allow me to control data size recursively for complex objects. Unless I am wrong on the above, what other approaches to creating a minimal RE dataset would you recommend me in this situation?

推荐答案

根据@ MrFlick关于使用 lapply 的评论,您可以使用任何应用函数系列来执行样本函数,具体取决于您的需要为了减少 RE 的大小和测试目的(我发现使用大型数据集的子集或子样本更适用于调试,甚至是图表)。

Along the lines of @MrFlick's comment of using lapply, you may use any of the apply family of functions to perform the head or sample functions depending on your needs in order to reduce the size for both REs and for testing purposes (I've found that working with subsets or subsamples of large sets of data is preferable for debugging and even charting).

应该注意的是, head tail 提供结构的第一个或最后一个位,但有时这些在RE中没有足够的差异,并且当然不是随机的,这是 sample 可能变得更多有用的。

It should be noted that head and tail provide the first or last bits of a structure, but sometimes these don't have sufficient variance in them for RE purposes, and are certainly not random, which is where sample may become more useful.

假设我们有一个分层树结构(...的列表列表),并且我们要对每个叶进行子集,同时保留结构和标签(
a = 1:10,
b = list(ba =),


Suppose we have a hierarchical tree structure (list of lists of...) and we want to subset each "leaf" while preserving the structure and labels in the tree.

x <- list( 
    a=1:10, 
    b=list( ba=1:10, bb=1:10 ), 
    c=list( ca=list( caa=1:10, cab=letters[1:10], cac="hello" ), cb=toupper( letters[1:10] ) ) )

注意:在下面我实际上不能告诉使用 how =replace how =list之间的区别。

NOTE: In the following, I actually can't tell the difference between using how="replace" and how="list".

另请注意:对于 data.frame 叶节点来说,这不是很好。

ALSO NOTE: This won't be great for data.frame leaf nodes.

# Set seed so the example is reproducible with randomized methods:
set.seed(1)

您可以使用

rapply( x, head, how="replace" )

或传递修改行为的匿名函数: p>

Or pass an anonymous function that modifies the behavior:

# Complete anonymous function
rapply( x, function(y){ head(y,2) }, how="replace" )
# Same behavior, but using the rapply "..." argument to pass the n=2 to head.
rapply( x, head, how="replace", n=2 )

以下获取每个叶子的随机化样本顺序:

The following gets a randomized sample ordering of each leaf:

# This works because we use minimum in case leaves are shorter
# than the requested maximum length.
rapply( x, function(y){ sample(y, size=min(length(y),2) ) }, how="replace" )

# Less efficient, but maybe easier to read:
rapply( x, function(y){ head(sample(y)) }, how="replace" )  







# XXX: Does NOT work The following does **not** work 
# because `sample` with a `size` greater than the 
# item being sampled does not work (when 
# sampling without replacement)
rapply( x, function(y){ sample(y, size=2) }, how="replace" )

这篇关于限制可重复示例的分层数据的大小的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆