使用列表从较大的数据帧创建新的数据帧 [英] Creating new data frames from a larger data frame using a list

查看:147
本文介绍了使用列表从较大的数据帧创建新的数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个包含大量样本的多个数据点的数据帧。以下是一个缩短的示例,每个样本有3个样本,每个样本有3个数据点:

  Assay基因型样本
CCT6-002 G sam1
CCT6-007 G sam1
CCT6-013 C sam1
CCT6-002 T sam2
CCT6-007 A sam2
CCT6-013 T sam2
CCT6 -002 T sam3
CCT6-007 A sam3
CCT6-013 T sam3

为了做我的下游分析,我想将每个样本的数据分组到一个单独的数据框架中。由于这是我将使用许多具有不断变化的示例名称的数据集来处理的,所以Id想要自动化这样做,所以我不需要每次使用新的样本列表来编辑我的脚本。



我希望我的输出是与样本名称相同的每个样本的数据框。所以使用上面的示例数据,结果应该是名称为sam1,sam2,sam3的3个数据帧。每个数据框架将具有3条线与Assay和基因型数据。



我很抱歉,如果这是一个非常基本的问题,但我是一个新手,并一直在为此工作一段时间。谢谢!

解决方案

拆分命令是将其更改为分割在示例上的data.frame对象列表的最简单方法。 / p>

  myList<  -  split(mydf,mydf $ Sample)

可以通过数字索引在列表中访问项目(即 myList [[1]] )或通过变量Sample中唯一项目的名称(即 myList $ sam1 )。



数字索引非常方便当你通过一个序列,但你仍然可以使用这个名称。

  #get示例中的唯一项目名称
nam< - unique(mydf $ Sample)
#测试看看我的每个数据的前几行。
for(i in nam)print(head(myList [[i]]))
#another方法来使用访问对于data.frame是with(myList [[i]] print(Assay [1:2])的$($)
(i in nam)

这不一定是最有效的R语法,但希望它能实际使用您的data.frame对象列表。



现在,这给了你所要求的,但这里有一些关于你所要求的建议,不要这样做,只要学会正确地占用你的data.frame对象,你可以一样容易不能将列表列出来,并通过代码中的所有唯一的Sample实例,包括将它们保存为单独的文件。其优点是您可以在完整的数据框架上执行大量漂亮的向量化命令样品是m在列表中更难。只要坚持你漂亮的大数据框架。



这里有几个简单的例子。看看我上面做了什么,只是得到列表中每个单独的数据框架对象的前几行。这是类似的东西,只是运行在大数据框架上。

  lapply(unique(mydf $ Sample),function(x)print (头(mydf $ Sample == x,])))

?假设我想要通过Sample分离的每个单独基因型的计数。

  table(mydf $ Genotype,mydf $ Sample)

这比起你的大名单要简单得多。有很多功能,你会想要起诉你的完整的数据框架,如 tapply aggregate 。即使你想做一些似乎可能会更容易的data.frame分解,比如在每个Sample级别中排序,这是更容易的data.frame。

  mydf [order(mydf $ Sample,mydf $ Assay),] 

这将通过Sample进行排序,然后通过样本中的Assay进行排序。



当我启动RI时,认为分解data.frame对象是要走的路用了很多由于我更好地学习了R,所以我从来没有这样做。在几个星期之后,我没有写一个R代码,R将数据框架分解成一个列表。我不是说你不应该这样做。我只是说你需要它比较少见,或者这是最好的主意。你可能想在这里发布关于你的最终目标的查询,并就此提出一些建议。


I have a data frame that contains multiple data points for a large number of samples. Here is a shortened example with 3 samples each with 3 data points:

Assay       Genotype      Sample 
CCT6-002        G         sam1   
CCT6-007        G         sam1
CCT6-013        C         sam1 
CCT6-002        T         sam2   
CCT6-007        A         sam2
CCT6-013        T         sam2 
CCT6-002        T         sam3   
CCT6-007        A         sam3
CCT6-013        T         sam3 

To do my downstream analysis I would like to subset the data for each sample into an individual data frame. Since this is something that I will be doing with many data sets with changing sample names, Id like to figure out an automated way doing this so I don't need to edit my script each time with the list of new samples.

I would like my output to be a data frame for each sample with the same name as the sample. So with the example data above, the result should be 3 data frames with the names sam1, sam2, sam3. Each data frame would have 3 lines with the Assay and genotype data.

I am sorry if this is a very basic question but Im a newbie and have been working on this for quite a while. Thanks!

解决方案

The split command is the easiest way to turn this into a list of data.frame objects split on sample.

myList <- split(mydf, mydf$Sample)

The items can be accessed in the list by numeric indexing (i.e. myList[[1]]) or by the name of the unique item in the variable Sample (i.e. myList$sam1).

The numeric indexing is obvioustly handy when you're going through a sequence but you can still use the name for that as well.

 #get names of the unique items in sample
 nam <- unique(mydf$Sample)
 #as a test look at the first few rows of each of my data.frames
 for( i in nam) print( head(myList[[i]]) )
 #another way to use access to the data.frame is the with() statement
 for( i in nam) with(myList[[i]], print( Assay[1:2] )

That's not necessarily the most efficient R syntax but hopefully it gets you farther along in actually using your list of data.frame objects.

Now, that gives you what you asked for but here's some advice on what you asked for. Don't do it. Just learn to properly acccess your data.frame object. You could just as easily not make the list up and go through all of the unique instances of Sample in your code... including saving them out as separate files. The advantage of that is that you can do lots of nifty vectorized commands on your intact data.frame across Sample that are much harder on the list. Just stick with you nice big data.frame.

Here are a couple of simple examples. Look at what I did above for just getting the first few lines of each of the separate data.frame objects in the list. Here's something similar just run on the big data.frame.

lapply( unique(mydf$Sample), function(x) print(head( mydf[ mydf$Sample == x,] )) )

How about something more meaningful? Let's say I want a count of each individual Genotype separated by Sample.

table( mydf$Genotype, mydf$Sample)

That's much easier than what you'd have to do with the big list. There's lots of functions like that you'll want to sue on your intact data.frame like tapply and aggregate. Even if you wanted to do something that seems like it might be easier with the data.frame broken up, like sorting within each Sample level, it's easier with the data.frame.

mydf[ order(mydf$Sample, mydf$Assay), ]

That will order by Sample and then by Assay nested within Sample.

When I started R I thought that splitting up data.frame objects was the way to go and used it a lot. Since I've learned R better I never ever do that. I don't have a single bit of R code written after the few weeks with R that ever splits up the data.frame into a list. I'm not saying you should never do it. I'm just saying that it's relatively rare that you need it or that it's the best idea. You might want to post a query on here about your end goal and get some advice on that.

这篇关于使用列表从较大的数据帧创建新的数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆