使用多组测量列将数据框重塑为长格式 [英] Reshape a dataframe to long format with multiple sets of measure columns

查看:81
本文介绍了使用多组测量列将数据框重塑为长格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个R数据框,它是使用 XML 包中的 readHTMLTable()从互联网上抓取的。该表看起来像以下摘录,其中包含人口和年份的多个变量/列。 (请注意,年份不是跨列重复的,而是代表总体的唯一标识符。)

I have an R dataframe that I scraped from the internet using readHTMLTable() in the XML package. The table looks like the following excerpt with multiple variables/columns for population and year. (Note that the years are not duplicated across columns and represent a unique identifier for population.)

        year1   pop1      year2   pop2     year3   pop3     
1                                                        
2       16XX    4675,0    1900    6453,0    1930   9981,2       
3       17XX    4739,3    1901    6553,5    1931   ...      
4       17XX    4834,0    1902    6684,0    1932   
5       180X    4930,0    1903    6818,0    1933        
6       180X    5029,0    1904    6955,0    1934        
7       181X    5129,0    1905    7094,0    1935
8       181X    5231,9    1906    7234,7    1936
9       182X    5297,0    1907    7329,0    1937
10      182X    5362,0    1908    7422,0    1938

我想将数据重新组织为两列,一列用于年份,一列用于人口,如下所示:

I would like to reorganize the data into just two columns, one for year and one for population that looks like the following:

        year    pop     
1                                                        
2       16XX    4675,0
3       17XX    4739,3  
4       17XX    4834,0  
5       180X    4930,0
6       180X    5029,0  
7       181X    5129,0
8       181X    5231,9  
9       182X    5297,0
10      182X    5362,0  
11      1900    6453,0
12      1901    6553,5
13      1902    6684,0
...     ...     ...
21      1930    9981,2
22      ... 

变量/列 year2 的值year3 以及相应的人口值都附加在 year1 下。

The values from the variables/columns year2 and year3 are appended below year1, as are the corresponding population values.

我考虑了以下内容:

(1)遍历人口和年份列( n> 2 ),并将这些值添加为Year1和Population1的新观测值即可,但这似乎不必要。

(1) Looping over the population and year columns (n>2) and adding those values as new observations to year1 and population1 will work, but this seems unnecessarily cumbersome.

( 2)我尝试按以下方法进行分解,但是它要么无法处理将id变量拆分为多个列,要么无法正确实现。

(2) I have tried melt as below, but either it cannot handle the id variable split across multiple columns, or I am not implementing it correctly.

df.melt <- melt(df, id=c("year1", "year2",...)

(3)最后,我考虑将每年的列作为其自己的向量,并按如下所示将这些向量附加在一起:

(3) Lastly, I considered pulling out each year column as its own vector, and appending each of those vectors together as here:

year.all <- c(df$year1, df$year2,...)

但是,以上返回的年份如下。all

However, the above returns the following for year.all

[1]  1  2  3  3  4  4  5  5  6  6  7  8  8  9  9  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  1  1  2 ...

宁可比这

[1] 16XX 17XX 17XX 180X 180X 181X 181X 182X 182X 1900 1901 1902...

如果有直接完成此重组的方法,我很乐意学习。非常感谢您的帮助。

If there is a straightforward way of accomplishing this reorganizing I would love to learn it. Many thanks for the help.

推荐答案

如果'year','pop',各列是交替的,则可以使用 c (TRUE,FALSE)来获取第1、3、5..etc列。和 c(FALSE,TRUE)由于回收而得到2,4,6,..然后,我们取消列出列并创建一个新的 data.frame。

If the 'year', 'pop', columns are alternating, we can subset with c(TRUE, FALSE) to get the columns 1, 3, 5,..etc. and c(FALSE, TRUE) to get 2, 4, 6,.. due to the recycling. Then, we unlist the columns and create a new 'data.frame.

 df2 <- data.frame(year=unlist(df1[c(TRUE, FALSE)]), 
                  pop=unlist(df1[c(FALSE, TRUE)]))
 row.names(df2) <- NULL
 head(df2)
 #   year    pop
 #1            
 #2 16XX 4675,0
 #3 17XX 4739,3
 #4 17XX 4834,0
 #5 180X 4930,0
 #6 180X 5029,0

或另一种选择是

library(splitstackshape)
merged.stack(transform(df1, id=1:nrow(df1)), var.stubs=c('year', 'pop'), 
        sep='var.stubs')[order(.time_1), 3:4, with=FALSE]



数据



data

df1 <- structure(list(year1 = c("", "16XX", "17XX", "17XX", "180X", 
"180X", "181X", "181X", "182X", "182X"), pop1 = c("", "4675,0", 
"4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0", 
"5362,0"), year2 = c(NA, 1900L, 1901L, 1902L, 1903L, 1904L, 1905L, 
1906L, 1907L, 1908L), pop2 = c("", "6453,0", "6553,5", "6684,0", 
"6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"), 
year3 = c(NA, 1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L, 
1937L, 1938L), pop3 = c("", "9981,2", "", "", "", "", "", 
"", "", "")), .Names = c("year1", "pop1", "year2", "pop2", 
"year3", "pop3"), class = "data.frame", row.names = c(NA, -10L))

这篇关于使用多组测量列将数据框重塑为长格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆