如何根据数据的整体顺序更改特定的类别变量 [英] How to change a specific categorical variable based on overall sequence of the data

查看:76
本文介绍了如何根据数据的整体顺序更改特定的类别变量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我每隔五天就收集一个关于植物发育或物候(使用分类变量代码进行编码)的数据,并将其分为78个连续的部分。在每个部分的整个样带中对每种物种进行调查。

I have collected data about plant development or phenology (coded using a categorical variable 'Code') every five days along a transect broken down into 78 consecutive segments. Each species is surveyed across the transect in each of the segments.

我的研究重复了100年前的历史研究,我保留了原始的物候编码方案,而没有考虑过夏天之后我将如何分析数据!

My study is repeating a 100 year ago historical study and I have kept the original phenology coding scheme without considering how I would analyze the data after the summer!

我在收集数据时没有考虑的问题是,代码遵循一个序列,其中一个代码在夏季的早晚重复。具体来说,代码是:

The problem that I did not consider when collecting the data is that the codes follow a sequence where one of the codes is repeated early and late over the summer. Specifically, the codes are:

b1 = single flower
b2 = sparse flowers (two or three)
b3 = flowers common (more than three)
B4 = flowering ended

原始研究的方法论是,夏季在任何开花植物上收集的代码序列将类似于b1,b2,b3,b2,b1,b4。请注意,我们每五天访问一次样线,并且代码可能会在连续的几天内重复进行,例如b1,b1,b2,b2,b2,b2,b3,b3,b3,b2,b2,b1,b4。

Based on the methodology of the original study, the sequence of codes collected over the summer for any flowering plant will go something like, b1, b2, b3, b2, b1, b4. Note that we visit the transect every five days and the codes maybe repeated during consecutive days, e.g. b1, b1, b2, b2, b2, b2, b3, b3, b3, b2, b2, b1, b4.

我想重新编码 b1'和'b2'的代码如下(请参见示例和示例数据):


1.如果'b1'出现在'b2'或'b3'之前,则应为'b1a'且出现在 b2或 b3之后,则应为 b1b。请注意,有时观察序列中没有'b2'或'b3'。

I would like to re-code the 'b1' and 'b2' codes as follows (see example and sample data):

1. if 'b1' occurs before 'b2' or 'b3' then it should be 'b1a' and if it occurs after 'b2' or 'b3' then it should be 'b1b'. Note that sometimes there is not a 'b2' or 'b3' in the sequence of observations.


2.如果'b2'出现在'b3'之前,则应为'b2a',并且如果它出现在 b3之后,则应为 b2b。 OR 如果没有 b3,则 b2应为 b2a。请注意,重要的是要记住,在最后一次出现 b3之后,可能会多次观察到 b2(请参见示例和示例数据)。

2. if 'b2' occurs before 'b3' then it should be 'b2a' and if it occurs after the 'b3' it should be 'b2b'. OR if there is no 'b3' then 'b2' should be 'b2a'. Note it is important to remember that after the last occurrence of 'b3' there maybe multiple observations of 'b2' (see example and sample data).


3.考虑 b1和在没有观察到 b3的情况下可能会出现 b2,在这种情况下,它们都将被编码为 b1a和 b2a。

3. Consider that 'b1' and 'b2' might occur without and observation of 'b3', In this case, both would be coded as 'b1a' and 'b2a'.

这是数据

Date    Segment Species Code
01-Jun-17   1   A   b1
06-Jun-17   1   A   b1
10-Jun-17   1   A   b2
14-Jun-17   1   A   b2
19-Jun-17   1   A   b2
23-Jun-17   1   A   b3
28-Jun-17   1   A   b3
03-Jul-17   1   A   b2
08-Jul-17   1   A   b2
14-Jul-17   1   A   b1
19-Jul-17   1   A   b4
23-Jul-17   1   A   b4

这应该是这样:

Date    Segment Species Code
01-Jun-17   1   A   b1
06-Jun-17   1   A   b1a
10-Jun-17   1   A   b2a
14-Jun-17   1   A   b2a
19-Jun-17   1   A   b2a
23-Jun-17   1   A   b3
28-Jun-17   1   A   b3
03-Jul-17   1   A   b2b
08-Jul-17   1   A   b2b
14-Jul-17   1   A   b1b
19-Jul-17   1   A   b4
23-Jul-17   1   A   b4

以下是示例数据:

Test.Data<- structure(list(Date = structure(c(17318, 17323, 17327, 17331, 
17336, 17340, 17345, 17350, 17355, 17361, 17366, 17318, 17323, 
17327, 17331, 17336, 17340, 17345, 17350, 17355, 17361, 17366, 
17370, 17375, 17318, 17323, 17327, 17331, 17336, 17340, 17345, 
17350, 17355, 17361, 17366, 17318, 17323, 17327, 17331, 17336, 
17340, 17345, 17350, 17355, 17361, 17366, 17370, 17375, 17355, 
17361, 17366, 17370, 17375, 17350, 17355, 17361, 17366, 17370
), class = "Date"), Segment = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 1, 1, 1, 1, 1), Species = c("A", "A", "A", "A", "A", "A", 
"A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", "B", 
"B", "B", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C"
), Code = c("b1", "b1", "b2", "b2", "b2", "b3", "b3", "b2", "b2", 
"b4", "b4", "b1", "b2", "b2", "b2", "b3", "b3", "b3", "b2", "b2", 
"b2", "b1", "b4", "b4", "b1", "b1", "b2", "b2", "b2", "b3", "b3", 
"b2", "b2", "b4", "b4", "b1", "b2", "b2", "b2", "b3", "b3", "b3", 
"b2", "b2", "b2", "b4", "b4", "b4", "b3", "b3", "b2", "b1", "b4", 
"b1", "b1", "b2", "b2", "b4")), .Names = c("Date", "Segment", 
"Species", "Code"), row.names = c(NA, -58L), class = "data.frame")


推荐答案

使用data.table:

Using data.table:

library(data.table)
setDT(Test.Data)
Test.Data[, temp := rleid(Code), by = .(Segment, Species)] #unique ids for the sequence of codes
Test.Data[Code == "b2", Code := paste0(Code, letters[rleid(temp)]), 
  by = .(Segment, Species)] #use the unique ids inside subset
Test.Data[, temp := NULL]
#          Date Segment Species Code
# 1: 2017-06-01       1       A   b1
# 2: 2017-06-06       1       A   b1
# 3: 2017-06-10       1       A  b2a
# 4: 2017-06-14       1       A  b2a
# 5: 2017-06-19       1       A  b2a
# 6: 2017-06-23       1       A   b3
# 7: 2017-06-28       1       A   b3
# 8: 2017-07-03       1       A  b2b
# 9: 2017-07-08       1       A  b2b
#10: 2017-07-14       1       A   b4
#11: 2017-07-19       1       A   b4
#12: 2017-06-01       1       B   b1
#13: 2017-06-06       1       B  b2a
#14: 2017-06-10       1       B  b2a
#15: 2017-06-14       1       B  b2a
#16: 2017-06-19       1       B   b3
#17: 2017-06-23       1       B   b3
#18: 2017-06-28       1       B   b3
#19: 2017-07-03       1       B  b2b
#20: 2017-07-08       1       B  b2b
#21: 2017-07-14       1       B  b2b
#</cont>

这篇关于如何根据数据的整体顺序更改特定的类别变量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆