如何使用R中的范围数据显示离散类别中的频率? [英] How to show frequencies in discrete categories with range data in R?

查看:48
本文介绍了如何使用R中的范围数据显示离散类别中的频率?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试整理我拥有的关于恐龙及其年龄范围的大量数据.到目前为止,我的数据包含一列名称,然后是两列过去数百万年的最大和最小日期,如下所示:

I'm trying to sort out a bunch of data that I have about dinosaurs and their age ranges. So far, my data consists of a column of names, and then two columns of maximum and minimum dates in millions of years in the past, as you can see here:

GENUS           ma_max  ma_min  ma_mid    
Abydosaurus     109     94.3    101.65    
Achelousaurus   84.9    70.6    77.75    
Acheroraptor    70.6    66.043  68.3215    

地质时间分为不同的时代(如侏罗纪和白垩纪),这些也细分为阶段.这些阶段有特定的年龄范围,我制作了一个数据框来显示这些:

Geological time is split into different ages (such as the Jurassic and Cretaceous) and these are also subdivided into stage. These stages have specific age ranges and I have made a dataframe to display these:

Stage          ma_max ma_min ma_mid
Hettangian      201.6  197.0 199.30
Sinemurian      197.0  190.0 193.50
Pliensbachian   190.0  183.0 186.50
Toarcian        183.0  176.0 179.50
Aalenian        176.0  172.0 174.00
Bajocian        172.0  168.0 170.00
Bathonian       168.0  165.0 166.50
Callovian       165.0  161.0 163.00
Oxfordian       161.0  156.0 158.50
Kimmeridgian    156.0  151.0 153.50
Tithonian       151.0  145.5 148.25
Berriasian      145.5  140.0 142.75
Valanginian     140.0  136.0 138.00
Hauterivian     136.0  130.0 133.00
Barremian       130.0  125.0 127.50
Aptian          125.0  112.0 118.50
Albian          112.0   99.6 105.80
Cenomanian      99.6   93.5  96.55
Turonian        93.5   89.3  91.40
Coniacian       89.3   85.8  87.55
Santonian       85.8   83.5  84.65
Campanian       83.5   70.6  77.05
Maastrichtian   70.6   66.5  68.05

我试图找出每个阶段有多少属.问题是范围——例如,一个属的范围可以跨越 3 个或更多阶段,我希望每个阶段都记录一个属的存在.有没有简单的方法可以做到这一点?我考虑过按照此处的类似讨论中的建议使用格子包中的shingle",但我对 R 很陌生,不确定它是否可以以数据具有范围的方式实现.

I'm trying to find out how many genus' are in each stage. Problem is the range - for example, a genus can have a range that spans 3 or more stages, and I want each of those stages to record the presence of a genus. Is there any simple way to do this? I thought about using 'shingle' from the lattice packages as suggested in a similar discussion on here, but I'm very new to R and not sure if it can be implemented in a way where data has range.

推荐答案

假设你的数据框被称为 genusstage,首先创建一个包含,对于每个Stage,在那个 Stage 期间生活的属名.然后我们将其添加到 stage 数据框中,并添加另一列,用于计算每个 Stage 期间存活的属数.

Assuming your data frames are called genus and stage, first create a list that contains, for each Stage, the names of the genera that lived during that Stage. Then we'll add that to the stage data frame and also add another column that counts the number of genera living during each Stage.

在下面的代码中,sapply 依次获取Stage 的每个值并测试GENUS 的哪些值落入该Stage 的时间范围通过将 Stagema_maxma_minma_max 和 <每个GENUS的code>ma_min.

In the code below, sapply takes each value of Stage in turn and tests what values of GENUS fall within that Stage's time range by comparing the Stage's ma_max and ma_min with the ma_max and ma_min for each GENUS.

# List of genera that lived during each Stage
stages.genus = sapply(stage$Stage, function(x){
  genus$GENUS[which((stage$ma_max[stage$Stage==x] <= genus$ma_max & 
                       stage$ma_max[stage$Stage==x] >= genus$ma_min) |
                      (stage$ma_min[stage$Stage==x] >= genus$ma_min & 
                         stage$ma_min[stage$Stage==x] <= genus$ma_max))]
})

对于 stages.genus 的每个元素,将适用于该 StageGENUS 的所有值粘贴在一起,用逗号分隔,给出us 向量包含与 Stage 的每个值对应的属.将该向量分配为 stage 的新列,我们将其称为 genera.

For each element of stages.genus, paste together all values of GENUS that apply to that Stage, separated by a comma, giving us vector containing the genera that go with each value of Stage. Assign that vector as a new column of stage that we'll call genera.

# Add list of genera by stage to the stage data frame
stage$genera = lapply(stages.genus, paste, sep=", ")

要计算每个 Stage 中的属数,只需计算 stages.genus 的每个元素中的属数并将其分配给新列stage 我们将称之为 Ngenera:

To get a count of the number of genera in each Stage, just count the number of genera in each element of stages.genus and assign that to a new column of stage that we'll call Ngenera:

# Add count of genera for each Stage to the stage data frame
stage$Ngenera = lapply(stages.genus, length)

结果如下:

> stage

           Stage ma_max ma_min ma_mid                      genera Ngenera
1     Hettangian  201.6  197.0 199.30                                   0
2     Sinemurian  197.0  190.0 193.50                                   0
...
16        Aptian  125.0  112.0 118.50                                   0
17        Albian  112.0   99.6 105.80                 Abydosaurus       1
18    Cenomanian   99.6   93.5  96.55                 Abydosaurus       1
19      Turonian   93.5   89.3  91.40                                   0
20     Coniacian   89.3   85.8  87.55                                   0
21     Santonian   85.8   83.5  84.65               Achelousaurus       1
22     Campanian   83.5   70.6  77.05 Achelousaurus, Acheroraptor       2
23 Maastrichtian   70.6   66.5  68.05 Achelousaurus, Acheroraptor       2

另一个选项是在 stage 中为每个 GENUS 创建一列,如果 GENUS 生活在那个阶段,则将值设置为 1否则为零:

An additional option is to create a column in stage for each GENUS and set the value to 1 if the GENUS lived during that stage or zero otherwise:

stage[, genus$GENUS] = lapply(genus$GENUS, function(x) {
  ifelse(grepl(x, stages.genus), 1, 0)
})

以下是我们刚刚添加的附加列:

Here are the additional columns we just added:

> stage[ , c(1,7:9)]   # Just show the Stage plus the three new GENUS columns

           Stage Abydosaurus Achelousaurus Acheroraptor
1     Hettangian           0             0            0
2     Sinemurian           0             0            0
...
16        Aptian           0             0            0
17        Albian           1             0            0
18    Cenomanian           1             0            0
19      Turonian           0             0            0
20     Coniacian           0             0            0
21     Santonian           0             1            0
22     Campanian           0             1            1
23 Maastrichtian           0             1            1

最后一步还将让您按阶段对属进行良好的可视化.例如:

The last step will also set you up for a nice visualization of genera by stage. For example:

library(reshape2)
library(ggplot2)

# Melt data into long format
stage.m = melt(stage[,c(1:4,7:9)], id.var=1:4)

# Tile plot where height of each Stage is proportional to how long it lasted
ggplot(stage.m, aes(variable, ma_mid, fill=factor(value))) +
  geom_tile(aes(height=ma_max - ma_min), colour="grey20", lwd=0.2) +
  scale_fill_manual(values=c("white","blue")) +
  scale_y_continuous(breaks=stage$ma_mid, labels=stage$Stage) +
  xlab("Genus") + ylab("Stage") +
  theme_bw(base_size=15) +
  guides(fill=FALSE)

如果您希望蓝色仅覆盖时间范围,则还可以修改前面的代码以使用来自 stagegenus 数据帧的时间范围每个 GENUS 生活,而不是他们生活的每个 Stage 的全部范围.

The previous code can also be modified to use time ranges from both the stage and genus data frames if you want the blue coloring to cover only the time-range when each GENUS lived, rather than the full range of each Stage in which they lived.

这篇关于如何使用R中的范围数据显示离散类别中的频率?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆