从R中的分层数据中提取特定数据 [英] Extracting specific data from hierarchical-data in R

查看:342
本文介绍了从R中的分层数据中提取特定数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个由6列组成的数据框.列1到5分别具有离散的名称/值,例如区,年,月,年龄间隔和性别.第六列是该特定组合的死亡计数.

I have a dataframe made up of 6 columns. Columns 1 to 5 each have discrete names/values, such as a district, year, month, age interval and gender. The sixth column is the number of death counts for that specific combination.

               District Gender Year Month Age.Group Total.Deaths
1              Eastern  Female 2003     1        -1            0
2              Eastern  Female 2003     1        -2            2
3              Eastern  Female 2003     1         0            2
4              Eastern  Female 2003     1      01-4            1
5              Eastern  Female 2003     1     05-09            0
6              Eastern  Female 2003     1     10-14            1
7              Eastern  Female 2003     1     15-19            0
8              Eastern  Female 2003     1     20-24            4
9              Eastern  Female 2003     1     25-29            9
10             Eastern  Female 2003     1     30-34            3
11             Eastern  Female 2003     1     35-39            7
12             Eastern  Female 2003     1     40-44            5
13             Eastern  Female 2003     1     45-49            5
14             Eastern  Female 2003     1     50-54            8
15             Eastern  Female 2003     1     55-59            5
16             Eastern  Female 2003     1     60-64            4
17             Eastern  Female 2003     1     65-69            7
18             Eastern  Female 2003     1     70-74            8
19             Eastern  Female 2003     1     75-79            5
20             Eastern  Female 2003     1     80-84           10
21             Eastern  Female 2003     1       85+           11
22             Eastern  Female 2003     2        -1            0
23             Eastern  Female 2003     2        -2            0
24             Eastern  Female 2003     2         0            4
25             Eastern  Female 2003     2      01-4            1
26             Eastern  Female 2003     2     05-09            2
27             Eastern  Female 2003     2     10-14            2
28             Eastern  Female 2003     2     15-19            0

我想从这个大数据框中过滤或提取较小的数据框. 例如,我只希望有四个年龄段.这四个年龄段将分别包含:

I would like to filter, or extract, smaller dataframes from this big dataframe. For example, I would like to only have four age groups. These four age groups will each contain:

Group 0: Consisting of Age.Group -1, -2 and 0.
Group 1-4: Consisting of Age.Group 01-4
Group 5-14: Consisting of Age.Group 05-09 and 10-14
Group 15+: Consisting of Age.Group 15-19 to 85+

Total.Deaths将成为这些组中每个组的总和.

The Total.Deaths will then be the sum for each of these groups.

所以我希望它看起来像这样

So I want it to look like this

               District Gender Year Month Age.Group Total.Deaths
1              Eastern  Female 2003     1         0            4
2              Eastern  Female 2003     1      01-4            1
3              Eastern  Female 2003     1     05-14            1
4              Eastern  Female 2003     1       15+            104
5              Eastern  Female 2003     2         0            4
6              Eastern  Female 2003     2      01-4            1
7              Eastern  Female 2003     2     05-14            4
8              Eastern  Female 2003     2       15+            ...

我有很多数据,已经搜索了几天,但是无法找到有助于实现此目的的功能.

I have a lot of data and have searched for a few days, but unable to find a function to help be do this.

推荐答案

使用car包中的recode之类的方法来重新编码年龄变量可能是一种更合理的方法,特别是因为您方便地获得了当前的年龄变量编码的级别可以很好地作为字符排序.但是对于仅有的几个级别,我经常只是通过创建一个新的age变量来手工对其进行重新编码,而这种方法是仅在R中完成工作"的一种很好的做法:

There may be a pithier way of recoding your age variable using something like recode from the car package, particularly since you've conveniently got your current age variable coded with levels that sort nicely as characters. But for only a few levels, I often just recode them by hand by creating a new age variable, and this method is good practice for just 'getting stuff done' in R:

#Reading your data in from a text file that I made via copy/paste
dat <- read.table("~/Desktop/soEx.txt",sep="",header=TRUE)

#Make sure Age.Group is ordered and init new age variable
dat$Age.Group <- factor(dat$Age.Group,ordered=TRUE)
dat$AgeGroupNew <- rep(NA,nrow(dat))

#The recoding
dat$AgeGroupNew[dat$Age.Group <= "0"] <- "0"
dat$AgeGroupNew[dat$Age.Group == "01-4"] <- "01-4"
dat$AgeGroupNew[dat$Age.Group >= "05-09" & dat$Age.Group <= "10-14" ] <- "05-14"
dat$AgeGroupNew[dat$Age.Group > "10-14" ] <- "15+"

然后我们可以使用ddplysummarise生成摘要:

Then we can generate summaries using ddply and summarise:

datNew <- ddply(dat,.(District,Gender,Year,Month,AgeGroupNew),summarise,
            TotalDeaths = sum(Total.Deaths))

起初我很担心,因为我的死亡人数是91,而不是您指出的104,但是我手工计算,我认为91是对的.也许是错字.

I was worried at first because I got 91 deaths instead of 104 as you indicated, but I counted by hand and 91 is right I think. A typo, perhaps.

这篇关于从R中的分层数据中提取特定数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆