整理和重塑凌乱的数据集(重塑/收集/联合功能)? [英] Tidy up and reshape messy dataset (reshape/gather/unite function)?

查看:27
本文介绍了整理和重塑凌乱的数据集(重塑/收集/联合功能)?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

按照我之前的问题:R:reshape/gather 函数以准备好创建数据集用于多层次分析

我发现它有点复杂.我的数据集实际上比我希望的更混乱".所以这里是完整的故事:我有一个大数据集,240 个案例.每行是一个案例(乳腺癌患者).在数据集末尾的某个地方(例如从第 417 列开始),我有患者的合作伙伴数据,这些数据也填写了问卷.一开始,患者和伴侣都有人口统计变量,然后是患者的测试结果,然后是伴侣数据.

我想创建一个数据集,在其中拆分"患者和合作伙伴的数据,但要保持耦合.因此:我想复制主题 ID 并创建带有 1s 和 2s 的新列(1 对应于患者,2 对应于合作伙伴).然后,我希望我的数据实际上是现在的样子,但是可以匹配一些变量(例如,我知道患者 [pgebdat] 和合作伙伴 [prgebdat] 的出生日期"是分开的.当然,我可以把这个将两个出生日期放在gebdat"中.

此代码适用于我的一小部分数据:

mydf_long <- mydf4 %>%unite(bb1:bb50rec, col = `1`, sep = ";") %>% # 组合 'p1' 到 'p3' 的响应unite(pbb1:pbb50recM, col = `2`, sep = ";") %>% # 合并 'pr1' 到 'pr3' 的响应gather(couple, value, `1`:`2`) %>% # 形成长数据separator(value, sep = ";", into = c(paste0("bb", seq(1:104),"", sep = ','))) %>% # 分离并检索原始答案安排(ID)

结果:

 id groep_MNC zkhs fbeh pgebdat Couple bb1,1 3 1 1 1 1955-12-01 1 42 3 1 1 1 1955-12-01 2 53 5 1 1 1 1943-04-09 1 24 5 1 1 1 1943-04-09 2 2

但现在它也将患者的出生日期复制并粘贴到伙伴"行.

我被卡住了,我什至不知道你需要什么数据才能回答我的问题,所以请尽管问.我将在下面提供一些示例:

数据示例

 id groep_MNC zkhs fbeh pgebdat p_age pgesl prgebdat pr_age prgesl relpnst1 3 1 1 1 1955-12-01 42.50000 1 <NA>不适用 2 12 5 1 1 1 1943-04-09 55.16667 1 1962-04-18 36.50000 1 23 7 1 1 1 1958-04-10 40.25000 1 <NA>不适用 2 14 10 1 1 1 1958-04-17 40.25000 1 1957-07-31 41.33333 2 15 12 1 1 2 1947-11-01 50.66667 1 1944-06-08 54.58333 2 1

然后,在仅针对患者的几百个变量之后,出现了此合作伙伴数据:

 pbb1 pbb2 pbb3 pbb4 pbb5 pbb6 pbb7 pbb​​8 pbb91 5 5 5 5 2 5 4 2 32 2 1 4 1 3 4 3 3 43 5 3 4 4 4 3 5 3 44 5 3 5 5 5 5 4 4 45 5 5 5 5 5 4 4 3 4

注意,这个数据集不是我自己创建的 - 我只是来收拾烂摊子:)

数据集是荷兰语.Pgesl = 患者的性别,prgesl = 伴侣的性别......等等.

解决方案

使用 data.table-package 中的 melt 函数,您可以按模式使用多个度量,从而创建多个度量值列:

library(data.table)融化(setDT(df),measure.vars = patterns('_age','gesl','gebdat'),value.name = c('age','geslacht','geboortedatum'))[, 变量 := c('患者','伙伴')[变量]][]

你得到:

<块引用>

 id groep_MNC zkhs fbeh relpnst pbb1 pbb2 可变年龄 geslacht geboortedatum1: 3 1 1 1 1 5 5 患者 42.50000 1 1955-12-012: 5 1 1 1 2 2 1 患者 55.16667 1 1943-04-093: 7 1 1 1 1 5 3 患者 40.25000 1 1958-04-104: 10 1 1 1 1 5 3 患者 40.25000 1 1958-04-175: 12 1 1 2 1 5 5 患者 50.66667 1 1947-11-016: 3 1 1 1 1 5 5 伙伴 NA 2 <NA>7: 5 1 1 1 2 2 1 合作伙伴 36.50000 1 1962-04-188: 7 1 1 1 1 5 3 伙伴 NA 2 <NA>9: 10 1 1 1 1 5 3 合作伙伴 41.33333 2 1957-07-3110: 12 1 1 2 1 5 5 合作伙伴 54.58333 2 1944-06-08

您还可以使用列索引或列名列表代替模式.

HTH

<小时>

使用的数据:

df <- 结构(列表(id = c(3L,5L,7L,10L,12L),groep_MNC = c(1L, 1L, 1L, 1L, 1L),zkhs = c(1L, 1L, 1L, 1L, 1L),fbeh = c(1L, 1L, 1L, 1L, 2L),pgebdat = c("1955-12-01", "1943-04-09", "1958-04-10", "1958-04-17", "1947-11-01"),p_age = c(42.5, 55.16667, 40.25, 40.25, 50.66667),pgesl = c(1L, 1L, 1L, 1L, 1L),prgebdat = c("", "1962-04-18", "", "1957-07-31", "1944-06-08"),pr_age = c(NA, 36.5, NA, 41.33333, 54.58333),prgesl = c(2L, 1L, 2L, 2L, 2L),relpnst = c(1L, 2L, 1L, 1L, 1L),pbb1 = c(5L, 2L, 5L, 5L, 5L),pbb2 = c(5L, 1L, 3L, 3L, 5L)),.Names = c("id", "groep_MNC", "zkhs", "fbeh", "pgebdat", "p_age", "pgesl", "prgebdat", "pr_age", "prgesl", "relpnst", "pbb1", "pbb2"),class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

Following my earlier question: R: reshape/gather function to create dataset ready for multilevel analysis

I discovered it is a bit more complicated. My dataset is actually 'messier' than I hoped. So here's the full story: I have a big dataset, 240 cases. Each row is a case (breast cancer patient). Somewhere at the end of the dataset(say from column 417 onwards) I have partner data of the patients, that also filled in a questionnaire. In the beginning, there are demographic variables for both patients and partners, followed by test outcomes only of patients, thus followed by partner data.

I want to create a dataset, where I 'split' the patient and partner data, but keep it coupled. Thus: I want to duplicate the subject ID and create new column with 1s and 2s (1 corresponding to patient and 2 to partner). Then, I want my data actually as it is now, but some variables can be matched though (for example, I know have "date of birth" for patient [pgebdat] and for partner [prgebdat] separate. Ofcourse, I can turn this into 'gebdat' with the two birth dates below each other.

This code worked for me for a small subset of my data:

mydf_long <- mydf4 %>% 
unite(bb1:bb50rec, col = `1`, sep = ";") %>% # Combine responses of 'p1' through 'p3'
unite(pbb1:pbb50recM, col = `2`, sep = ";") %>% # Combine responses of 'pr1' through 'pr3'
gather(couple, value, `1`:`2`) %>% # Form into long data
separate(value, sep = ";", into = c(paste0("bb", seq(1:104),"", sep = ','))) %>% # Separate and retrieve original answers
arrange(id)

results in:

   id groep_MNC zkhs fbeh    pgebdat couple bb1,
1  3         1    1    1 1955-12-01      1    4
2  3         1    1    1 1955-12-01      2    5
3  5         1    1    1 1943-04-09      1    2
4  5         1    1    1 1943-04-09      2    2

But now it copies and pastes the date of birth of the patient also to 'partner' row.

I'm stuck, and don't even quite know what data you would need to be able to answer my question, so please do ask. I'll provide something of an example below:

Example of data

     id groep_MNC zkhs fbeh    pgebdat    p_age   pgesl   prgebdat   pr_age prgesl relpnst
1     3         1    1    1 1955-12-01 42.50000       1       <NA>       NA      2       1
2     5         1    1    1 1943-04-09 55.16667       1 1962-04-18 36.50000      1       2
3     7         1    1    1 1958-04-10 40.25000       1       <NA>       NA      2       1
4    10         1    1    1 1958-04-17 40.25000       1 1957-07-31 41.33333      2       1
5    12         1    1    2 1947-11-01 50.66667       1 1944-06-08 54.58333      2       1

And then, after couple of hundred variables for only patients, this partner data comes along:

     pbb1 pbb2 pbb3 pbb4 pbb5 pbb6 pbb7 pbb8 pbb9
 1      5    5    5    5    2    5    4    2    3
 2      2    1    4    1    3    4    3    3    4
 3      5    3    4    4    4    3    5    3    4
 4      5    3    5    5    5    5    4    4    4
 5      5    5    5    5    5    4    4    3    4

note, I didn't create this dataset myself - I'm just here to tidy up the mess :)

Edit: The dataset is in dutch. Pgesl = gender for patient, prgesl = gender for partner... etc.

解决方案

Using the melt function from the data.table-package you can use multiple measures by patterns and as a result create more than one value column:

library(data.table)
melt(setDT(df), measure.vars = patterns('_age','gesl','gebdat'),
     value.name = c('age','geslacht','geboortedatum')
     )[, variable := c('patient','partner')[variable]][]

you get:

    id groep_MNC zkhs fbeh relpnst pbb1 pbb2 variable      age geslacht geboortedatum
 1:  3         1    1    1       1    5    5  patient 42.50000        1    1955-12-01
 2:  5         1    1    1       2    2    1  patient 55.16667        1    1943-04-09
 3:  7         1    1    1       1    5    3  patient 40.25000        1    1958-04-10
 4: 10         1    1    1       1    5    3  patient 40.25000        1    1958-04-17
 5: 12         1    1    2       1    5    5  patient 50.66667        1    1947-11-01
 6:  3         1    1    1       1    5    5  partner       NA        2          <NA>
 7:  5         1    1    1       2    2    1  partner 36.50000        1    1962-04-18
 8:  7         1    1    1       1    5    3  partner       NA        2          <NA>
 9: 10         1    1    1       1    5    3  partner 41.33333        2    1957-07-31
10: 12         1    1    2       1    5    5  partner 54.58333        2    1944-06-08

Instead of patterns you could also use a list of column indexes or columnnames.

HTH


Used data:

df <- structure(list(id = c(3L, 5L, 7L, 10L, 12L), 
                     groep_MNC = c(1L, 1L, 1L, 1L, 1L),
                     zkhs = c(1L, 1L, 1L, 1L, 1L),
                     fbeh = c(1L, 1L, 1L, 1L, 2L),
                     pgebdat = c("1955-12-01", "1943-04-09", "1958-04-10", "1958-04-17", "1947-11-01"),
                     p_age = c(42.5, 55.16667, 40.25, 40.25, 50.66667),
                     pgesl = c(1L, 1L, 1L, 1L, 1L),
                     prgebdat = c("<NA>", "1962-04-18", "<NA>", "1957-07-31", "1944-06-08"),
                     pr_age = c(NA, 36.5, NA, 41.33333, 54.58333),
                     prgesl = c(2L, 1L, 2L, 2L, 2L),
                     relpnst = c(1L, 2L, 1L, 1L, 1L),
                     pbb1 = c(5L, 2L, 5L, 5L, 5L),
                     pbb2 = c(5L, 1L, 3L, 3L, 5L)), 
                .Names = c("id", "groep_MNC", "zkhs", "fbeh", "pgebdat", "p_age", "pgesl", "prgebdat", "pr_age", "prgesl", "relpnst", "pbb1", "pbb2"),
                class = "data.frame", row.names = c("1", "2", "3", "4", "5"))

这篇关于整理和重塑凌乱的数据集(重塑/收集/联合功能)?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆