数据表dcast列标题 [英] Data table dcast column headings

查看:83
本文介绍了数据表dcast列标题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个数据表,形式为

  ID区域INCOME_BAND RESIDENCY_YEARS 
1个软件在5,000以下10-15
2威尔士超过70,000 1-5
3中心15,000-19,999 6-9
4 SE 15,000-19,999 15-19
5北15,000-19,999 10-15
6北15,000-19,999 6-9

创建者

  exp = data.table(
ID = c(1,2,3,4,5,6),
REGION = c( SW, Wales, Center, SE, North, North),
INCOME_BAND = c(?5,000以下, 70,000以上, 15,000-?19,999 ,?15,000-?19,999,?15,000-?19,999,?15,000-?19,999),
RESIDENCY_YEARS = c( 10-15, 1-5, 6- 9, 15-19, 10-15, 6-9)))

我想将其转换为





我已经设法通过dcast执行了大部分工作:

  exp.dcast = dcast(exp,ID〜REGION + INCOME_BAND + RESIDENCY_YEARS,fun = length,
value.var = c('REGION','INCOME_BAND','RESIDENCY_YEARS'))

但是我需要一些帮助来创建合理的列标题。
当前我有


[ ID

REGION.1_Center_?15,000-?19,999_6 -9

REGION.1_North_?15,000-?19,999_10-15

REGION.1_North_?15,000-?19,999_6-9

REGION.1_SE_?15,000-?19,999_15-19 REGION.1_SW_Under
?5,000_10-15 REGION.1_Wales_Over?70,000_1-5

INCOME_BAND.1_Center_?15,000 -?19,999_6-9

INCOME_BAND.1_North_?15,000-?19,999_10-15

INCOME_BAND.1_North_?15,000-?19,999_6-9

INCOME_BAND.1_SE_?15,000-?19,999_15-19

INCOME_BAND.1_SW_Under?5,000_10-15

INCOME_BAND.1_Wales_Over?70,000_1-5

RESIDENCY_YEARS.1_Center_?15,000-?19,999_6-9
RESIDENCY_YEARS.1_North_?15,000-?19,999_10-15
RESIDENCY_YEARS.1_North_?15,000-? 19,999_6-9

RESIDENCY_YEARS.1_SE_?15,000-?19,999_15-19

RESIDENCY_YEARS.1_SW_在?5,000_10-15以下

RESIDENCY_YEARS.1_Wales_Over?70,000_1 -5


我希望列标题为

  ID SW威尔士中心东南部5,000以下70,000以上15,000-19,999 1-5 6-9 10-15 15-19 

有人可以建议吗?

解决方案

这个看似简单的问题并不容易回答。因此,我们将逐步进行。



首先,OP尝试同时重塑多个值列,这会造成不必要的交叉所有可用组合的乘积。



为了以相同方式处理所有值,我们需要 melt()重整前先排所有值列:

  melt(exp,id.vars = ID)[,dcast(.SD ,ID〜值,长度)] 




  ID 1-5 10-15 15-19 6-9?15,000-?19,999 Center North over?70,000 SE SW Under?5,000威尔士
1:1 0 1 0 0 0 0 0 0 0 1 1 0
2:2 1 0 0 0 0 0 0 1 0 0 0 1
3:3 0 0 0 1 1 1 0 0 0 0 0 0
4:4 0 0 1 0 1 0 0 0 1 0 0 0
5:5 0 1 0 0 1 0 1 0 0 0 0 0
6:6 0 0 0 1 1 0 1 0 0 0 0 0


现在,结果有13列而不是19列,并且这些列按要求由各自的值命名。



不幸的是,这些列显示顺序错误,因为它们是按字母顺序排列的。更改顺序有两种方法:



重塑后更改列的顺序



setcolorder()函数对 data.table 的列进行重新排序,例如无需复制:

 #定义列顺序=值的顺序
col_order<-c( North,威尔士,中锋, SW, SE,低于5,000欧元, 15,000--19,999英镑,超过70,000欧元, 1-5, 6-9, 10- 15, 15-19)

melt(exp,id.vars = ID)[,dcast(.SD,ID〜值,长度)] [
#重新排序列
,setcolorder(.SD,c( ID,col_order))]




  ID北威尔士中心西南部SE≤5,000欧元?15,000-?19,999超过70,000欧元1-5 6-9 10-15 15-19 
1 :1 0 0 0 1 0 1 0 0 0 0 1 0
2:2 0 1 0 0 0 0 1 1 0 0 0
3:3 0 0 1 0 0 0 1 0 0 1 0 0
4:4 0 0 0 0 1 0 1 0 0 0 0 1
5:5 1 0 0 0 0 0 0 1 0 0 0 1 0
6:6 1 0 0 0 0 0 1 0 0 1 0 0


现在,所有 REGION 列首先出现,然后依次是 INCOME_BAND RESIDENCY_YEARS 列按指定顺序。



在重塑之前设置因子水平



如果将转换为具有适当排序的因子水平的因子,则 dcast()将使用因子水平排序列:

  melt(exp,id.vars = ID)[,value:= factor(value,col_order )] [
,dcast(.SD,ID〜值,长度)]




  ID北威尔士中心西南部SE≤5,000欧元?15,000-?19,999超过70,000欧元1-5 6-9 10-15 15-19 
1: 1 0 0 0 1 0 1 0 0 0 0 1 0
2:2 0 1 0 0 0 0 0 1 1 0 0 0
3:3 0 0 1 0 0 0 1 0 0 1 0 0
4:4 0 0 0 0 1 0 1 0 0 0 0 1
5: 5 1 0 0 0 0 0 1 0 0 0 1 0
6:6 1 0 0 0 0 0 1 0 0 1 0 0




在重塑之前设置因子水平-惰性版本



如果具有足够的列由 REGION INCOME_BAND RESIDENCY_YEARS 分组使用快捷方式以避免在 col_order 中指定每个值。 forcats 包中的 fct_inorder()函数通过它们在向量中的首次出现来重新排序因子水平:

  melt(exp,id.vars = ID)[,value:= factor(value,col_order)] [
, dcast(.SD,ID〜值,长度)]




  ID SW威尔士中心东南部5,000欧元以下7,000欧元以下15,000-?19,999 10-15 1-5 6-9 15-19 
1:1 1 0 0 0 0 1 0 0 1 0 0 0
2:2 0 1 0 0 0 0 1 0 0 1 0 0
3:3 0 0 1 0 0 0 0 1 0 0 1 0
4:4 0 0 0 1 0 0 0 1 0 0 0 1
5:5 0 0 0 0 1 0 0 1 1 0 0 0
6:6 0 0 0 0 1 0 0 1 0 0 1 0


之所以有效,是因为输出 melt()变量排序:

  melt(exp,id.vars = ID)




  ID变量值
1:1区域SW
2:2区域威尔士
3:3区域中心
4:4区域SE
5:北5个区域
6:北6个区域
7:1个收入低于5,000欧元的收入带
8:2个收入超过70,000欧元的收入带
9:3个收入带15,000欧元-?19,999
10:4 INCOME_BAND?15,000-?19,999
11:5 INCOME_BAND?15,000-?19,999
12:6 INCOME_BAND?15,000-?19,999
13:1 RESIDENCY_YEARS 10-15
14:2个RESIDENCY_YEARS 1-5
15:3个RESIDENCY_YEARS 6-9
16:4个RESIDENCY_YEARS 15-19
17:5 RESIDENCY_YEARS 10-15
18:6 RESIDENCY_YEARS 6-9



I have a data table of the form

ID  REGION  INCOME_BAND RESIDENCY_YEARS
1   SW  Under 5,000 10-15
2   Wales   Over 70,000 1-5
3   Center  15,000-19,999   6-9
4   SE  15,000-19,999   15-19
5   North   15,000-19,999   10-15
6   North   15,000-19,999   6-9

created by

exp = data.table(
  ID = c(1,2,3,4,5,6),
  REGION=c("SW", "Wales", "Center", "SE", "North", "North"),
  INCOME_BAND = c("Under ?5,000", "Over ?70,000", "?15,000-?19,999", "?15,000-?19,999", "?15,000-?19,999","?15,000-?19,999"),
  RESIDENCY_YEARS = c("10-15","1-5","6-9","15-19","10-15", "6-9"))

I would like to transform this to

I've managed to perform the majority of the work with dcast:

exp.dcast = dcast(exp,ID~REGION+INCOME_BAND+RESIDENCY_YEARS, fun=length,
  value.var=c('REGION', 'INCOME_BAND', 'RESIDENCY_YEARS'))

However I need some help creating sensible column headings. Currently I have

["ID"
"REGION.1_Center_?15,000-?19,999_6-9"
"REGION.1_North_?15,000-?19,999_10-15"
"REGION.1_North_?15,000-?19,999_6-9"
"REGION.1_SE_?15,000-?19,999_15-19" "REGION.1_SW_Under ?5,000_10-15" "REGION.1_Wales_Over ?70,000_1-5"
"INCOME_BAND.1_Center_?15,000-?19,999_6-9"
"INCOME_BAND.1_North_?15,000-?19,999_10-15"
"INCOME_BAND.1_North_?15,000-?19,999_6-9"
"INCOME_BAND.1_SE_?15,000-?19,999_15-19"
"INCOME_BAND.1_SW_Under ?5,000_10-15"
"INCOME_BAND.1_Wales_Over ?70,000_1-5"
"RESIDENCY_YEARS.1_Center_?15,000-?19,999_6-9" "RESIDENCY_YEARS.1_North_?15,000-?19,999_10-15" "RESIDENCY_YEARS.1_North_?15,000-?19,999_6-9"
"RESIDENCY_YEARS.1_SE_?15,000-?19,999_15-19"
"RESIDENCY_YEARS.1_SW_Under ?5,000_10-15"
"RESIDENCY_YEARS.1_Wales_Over ?70,000_1-5"

And I would like the column headings to be

ID  SW  Wales   Center  SE  North   Under 5,000 Over 70,000 15,000-19,999   1-5 6-9 10-15   15-19

Could anybody advise?

解决方案

This apparently simple question is not easy to answer. So, we will go forward step-by step.

First, the OP has tried to reshape multiple value columns simultaneously which creates an unwanted cross product of all available combinations.

In order to treat all values in the same way, we need to melt() all value columns first before reshaping:

melt(exp, id.vars = "ID")[, dcast(.SD, ID ~ value, length)]

   ID 1-5 10-15 15-19 6-9 ?15,000-?19,999 Center North Over ?70,000 SE SW Under ?5,000 Wales
1:  1   0     1     0   0               0      0     0            0  0  1            1     0
2:  2   1     0     0   0               0      0     0            1  0  0            0     1
3:  3   0     0     0   1               1      1     0            0  0  0            0     0
4:  4   0     0     1   0               1      0     0            0  1  0            0     0
5:  5   0     1     0   0               1      0     1            0  0  0            0     0
6:  6   0     0     0   1               1      0     1            0  0  0            0     0

Now, the result has 13 columns instead of 19 and the columns are named by the respective value as requested.

Unfortunately, the columns appear in the wrong order because they alphabetically ordered. There are two approaches to change the order:

Change order of columns after reshaping

The setcolorder() function reorders the columns of a data.table in place, e.g. without copying:

# define column order = order of values
col_order <- c("North", "Wales", "Center", "SW", "SE", "Under ?5,000", "?15,000-?19,999", "Over ?70,000", "1-5", "6-9", "10-15", "15-19")

melt(exp, id.vars = "ID")[, dcast(.SD, ID ~ value, length)][
  # reorder columns
  , setcolorder(.SD, c("ID", col_order))]

   ID North Wales Center SW SE Under ?5,000 ?15,000-?19,999 Over ?70,000 1-5 6-9 10-15 15-19
1:  1     0     0      0  1  0            1               0            0   0   0     1     0
2:  2     0     1      0  0  0            0               0            1   1   0     0     0
3:  3     0     0      1  0  0            0               1            0   0   1     0     0
4:  4     0     0      0  0  1            0               1            0   0   0     0     1
5:  5     1     0      0  0  0            0               1            0   0   0     1     0
6:  6     1     0      0  0  0            0               1            0   0   1     0     0

Now, all REGION columns appear first, followed by INCOME_BAND and RESIDENCY_YEARS columns in the specified order.

Set factor levels before reshaping

If value is turned into a factor with appropriately ordered factor levels dcast() will use the factor levels for ordering the columns:

melt(exp, id.vars = "ID")[, value := factor(value, col_order)][
  , dcast(.SD, ID ~ value, length)]

   ID North Wales Center SW SE Under ?5,000 ?15,000-?19,999 Over ?70,000 1-5 6-9 10-15 15-19
1:  1     0     0      0  1  0            1               0            0   0   0     1     0
2:  2     0     1      0  0  0            0               0            1   1   0     0     0
3:  3     0     0      1  0  0            0               1            0   0   1     0     0
4:  4     0     0      0  0  1            0               1            0   0   0     0     1
5:  5     1     0      0  0  0            0               1            0   0   0     1     0
6:  6     1     0      0  0  0            0               1            0   0   1     0     0

Set factor levels before reshaping - lazy version

If it is sufficient to have the columns grouped by REGION, INCOME_BAND, and RESIDENCY_YEARS then we can use a short cut to avoid specifying each value in col_order. The fct_inorder() function from the forcats package reorders factor levels by their first appearance in a vector:

melt(exp, id.vars = "ID")[, value := factor(value, col_order)][
  , dcast(.SD, ID ~ value, length)]

   ID SW Wales Center SE North Under ?5,000 Over ?70,000 ?15,000-?19,999 10-15 1-5 6-9 15-19
1:  1  1     0      0  0     0            1            0               0     1   0   0     0
2:  2  0     1      0  0     0            0            1               0     0   1   0     0
3:  3  0     0      1  0     0            0            0               1     0   0   1     0
4:  4  0     0      0  1     0            0            0               1     0   0   0     1
5:  5  0     0      0  0     1            0            0               1     1   0   0     0
6:  6  0     0      0  0     1            0            0               1     0   0   1     0

This works because the output of melt() is ordered by variable:

melt(exp, id.vars = "ID")

    ID        variable           value
 1:  1          REGION              SW
 2:  2          REGION           Wales
 3:  3          REGION          Center
 4:  4          REGION              SE
 5:  5          REGION           North
 6:  6          REGION           North
 7:  1     INCOME_BAND    Under ?5,000
 8:  2     INCOME_BAND    Over ?70,000
 9:  3     INCOME_BAND ?15,000-?19,999
10:  4     INCOME_BAND ?15,000-?19,999
11:  5     INCOME_BAND ?15,000-?19,999
12:  6     INCOME_BAND ?15,000-?19,999
13:  1 RESIDENCY_YEARS           10-15
14:  2 RESIDENCY_YEARS             1-5
15:  3 RESIDENCY_YEARS             6-9
16:  4 RESIDENCY_YEARS           15-19
17:  5 RESIDENCY_YEARS           10-15
18:  6 RESIDENCY_YEARS             6-9

这篇关于数据表dcast列标题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆