如何基于年份从大型数据集中获取多个矩阵 [英] How to get multiple matrices from large data sets based on year

查看:65
本文介绍了如何基于年份从大型数据集中获取多个矩阵的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我开始之前,这只是我正在使用的数据的一小部分,我为此道歉(请注意,这只是一个非常大的数据集的前30行:

before I start here is a a small subset of the data I'm working with, i apologize in advance for it being so large (note this is only the first 30 rows of an extremely large dataset:

mydata<-structure(list(ParkName = c("SEP", "CSSP", 
                        "SEP", "ONF", "SEP", 
                        "ONF", "SEP", 
                        "CSSP", "ONF", 
                        "SEP", "CSSP", 
                        "PPRSP", "PPRSP", 
                        "SEP", "ONF", 
                        "PPRSP", "ONF", 
                        "SEP", "SEP", 
                        "ONF"), 
           Year = c(2001, 2005, 1998,2011, 1991, 1991, 1991, 1991, 1991, 1992, 1992, 1992, 1992, 1992,
                                          1992, 1992, 1992, 1993, 1994, 1994), 
           LatinName = c("Mola mola", "Clarias batrachus", "Lithobates catesbeianus", "Rana catesbeiana", "Rana catesbeiana", 
                         "Rana yellowis", "Rana catesbeiana", "Solenopsis sp1","Rana catesbeiana", "Rana catesbeiana",
                         "Pratensis", "Rana catesbeiana",  "Rana catesbeiana", "sp2", "Orchidaceae",
                         "Rana catesbeiana","Formica", "Rana catesbeiana", "Rana catesbeiana", "sp2"), 
           NumTotal = c(1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 100, 2, 1, 2)), Names = c("ParkName", "Year", "LatinName", 
                                                                                                                      "NumTotal"),
      row.names = c(NA, -20L), class = c("tbl_df", "tbl",  "data.frame"))

该数据集代表了多年以来不同公园中不同物种的丰富度.我本质上想要处理的是获取记录了数据的每一年的物种X公园矩阵,然后使用纯素"程序包来计算每个公园每年的多样性指数.显然,这并不是一个平衡的数据集,因为并非每个公园都记录了每年等物种的丰度.现在,我意识到要做这一点,我需要运行循环.为了创建这些矩阵,我需要每年获取一份公园列表,以及每个公园每年列出的物种及其丰富度列表.在运行循环方面,我并不是最出色的人,而这个任务使我感到困惑.例如,我在数据集中创建了一个唯一的唯一年份向量.然后,我创建了一个名为"parkbyyear"的空列表,以从主数据框中按年份填充公园列表.

This dataset represents the abundance of different species in different parks over a multitude of years. What I essentially want to do with this data is to get a species X park matrix for every year that data was recorded and then youse the 'vegan' package to calculate diversity indices for each park for each year. Obviously this is not a balanced dataset as not every park recorded species abundance for every year etc. Now I've realized to do this I need to run loops. I would need to get a list of parks per year and a list of species and their abundance per park per year in order to create these matrices. I'm not the greatest when it comes to running loops and this task is confusing me. For example, I created a separate vector of unique years in the dataset. I then created an empty list called "parkbyyear" to fill up with a list of parks by year from the main dataframe

year<-as.vector(unique(data[,3]))
parkbyyear<-NULL

for (i in 1:year) {
  parkbyyear[i]<- mydata[mydata$ParkName[year == "i"]
}

循环无法运行. 任何帮助,将不胜感激.

The loop fails to run. Any help would be appreciated.

推荐答案

只需使用by根据所需因子对数据帧进行切片,然后运行矢量返回之类的操作:

Simply use by to slice a dataframe by needed factor(s) and run operations like vector return:

parkbyyear_list <- by(mydata, mydata$Year, FUN=function(df) df$ParkName)

parkbyyear_list
# mydata$Year: 1991
# [1] "SEP"  "ONF"  "SEP"  "CSSP" "ONF" 
# ---------------------------------------------------------------------------
# mydata$Year: 1992
# [1] "SEP"   "CSSP"  "PPRSP" "PPRSP" "SEP"   "ONF"   "PPRSP" "ONF"  
# --------------------------------------------------------------------------- 
# mydata$Year: 1993
# [1] "SEP"
# ---------------------------------------------------------------------------
# mydata$Year: 1994
# [1] "SEP" "ONF"
# ---------------------------------------------------------------------------
# mydata$Year: 1998
# [1] "SEP"
# ---------------------------------------------------------------------------
# mydata$Year: 2001
# [1] "SEP"
# ---------------------------------------------------------------------------
# mydata$Year: 2005
# [1] "CSSP"
# ---------------------------------------------------------------------------
# mydata$Year: 2011
# [1] "ONF"

要获取的子集数据帧列表,只需使用split(或再次使用by):

For a list of subsetted dataframes by Year, simply use split (or by again):

dfList <- split(mydata, mydata$Year)
# dfList <- by(mydata, mydata$Year, FUN=function(df) df)   # SIMILAR CALL

dfList

# $`1991`
#   ParkName Year        LatinName NumTotal
# 5      SEP 1991 Rana catesbeiana        2
# 6      ONF 1991    Rana yellowis        1
# 7      SEP 1991 Rana catesbeiana        1
# 8     CSSP 1991   Solenopsis sp1        1
# 9      ONF 1991 Rana catesbeiana        1

# $`1992`
#    ParkName Year        LatinName NumTotal
# 10      SEP 1992 Rana catesbeiana        1
# 11     CSSP 1992        Pratensis        1
# 12    PPRSP 1992 Rana catesbeiana        1
# 13    PPRSP 1992 Rana catesbeiana        1
# 14      SEP 1992              sp2        1
# 15      ONF 1992      Orchidaceae        1
# 16    PPRSP 1992 Rana catesbeiana        1
# 17      ONF 1992          Formica      100
# 
# $`1993`
#    ParkName Year        LatinName NumTotal
# 18      SEP 1993 Rana catesbeiana        2
# 
# $`1994`
#    ParkName Year        LatinName NumTotal
# 19      SEP 1994 Rana catesbeiana        1
# 20      ONF 1994              sp2        2
# 
# $`1998`
#   ParkName Year               LatinName NumTotal
# 3      SEP 1998 Lithobates catesbeianus        1
# 
# $`2001`
#   ParkName Year LatinName NumTotal
# 1      SEP 2001 Mola mola        1
# 
# $`2005`
#   ParkName Year         LatinName NumTotal
# 2     CSSP 2005 Clarias batrachus        1
# 
# $`2011`
#   ParkName Year        LatinName NumTotal
# 4      ONF 2011 Rana catesbeiana        1

这篇关于如何基于年份从大型数据集中获取多个矩阵的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆