使用多组测量列将数据框重塑为长格式 [英] Reshape a dataframe to long format with multiple sets of measure columns
问题描述
我有一个R数据框,它是使用 XML
包中的 readHTMLTable()
从互联网上抓取的。该表看起来像以下摘录,其中包含人口和年份的多个变量/列。 (请注意,年份不是跨列重复的,而是代表总体的唯一标识符。)
I have an R dataframe that I scraped from the internet using readHTMLTable()
in the XML
package. The table looks like the following excerpt with multiple variables/columns for population and year. (Note that the years are not duplicated across columns and represent a unique identifier for population.)
year1 pop1 year2 pop2 year3 pop3
1
2 16XX 4675,0 1900 6453,0 1930 9981,2
3 17XX 4739,3 1901 6553,5 1931 ...
4 17XX 4834,0 1902 6684,0 1932
5 180X 4930,0 1903 6818,0 1933
6 180X 5029,0 1904 6955,0 1934
7 181X 5129,0 1905 7094,0 1935
8 181X 5231,9 1906 7234,7 1936
9 182X 5297,0 1907 7329,0 1937
10 182X 5362,0 1908 7422,0 1938
我想将数据重新组织为两列,一列用于年份,一列用于人口,如下所示:
I would like to reorganize the data into just two columns, one for year and one for population that looks like the following:
year pop
1
2 16XX 4675,0
3 17XX 4739,3
4 17XX 4834,0
5 180X 4930,0
6 180X 5029,0
7 181X 5129,0
8 181X 5231,9
9 182X 5297,0
10 182X 5362,0
11 1900 6453,0
12 1901 6553,5
13 1902 6684,0
... ... ...
21 1930 9981,2
22 ...
变量/列 year2
和的值year3
以及相应的人口值都附加在 year1
下。
The values from the variables/columns year2
and year3
are appended below year1
, as are the corresponding population values.
我考虑了以下内容:
(1)遍历人口和年份列( n> 2
),并将这些值添加为Year1和Population1的新观测值即可,但这似乎不必要。
(1) Looping over the population and year columns (n>2
) and adding those values as new observations to year1 and population1 will work, but this seems unnecessarily cumbersome.
( 2)我尝试按以下方法进行分解,但是它要么无法处理将id变量拆分为多个列,要么无法正确实现。
(2) I have tried melt as below, but either it cannot handle the id variable split across multiple columns, or I am not implementing it correctly.
df.melt <- melt(df, id=c("year1", "year2",...)
(3)最后,我考虑将每年的列作为其自己的向量,并按如下所示将这些向量附加在一起:
(3) Lastly, I considered pulling out each year column as its own vector, and appending each of those vectors together as here:
year.all <- c(df$year1, df$year2,...)
但是,以上返回的年份如下。all
However, the above returns the following for year.all
[1] 1 2 3 3 4 4 5 5 6 6 7 8 8 9 9 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 1 2 ...
宁可比这
[1] 16XX 17XX 17XX 180X 180X 181X 181X 182X 182X 1900 1901 1902...
如果有直接完成此重组的方法,我很乐意学习。非常感谢您的帮助。
If there is a straightforward way of accomplishing this reorganizing I would love to learn it. Many thanks for the help.
推荐答案
如果'year','pop',各列是交替的,则可以使用 c (TRUE,FALSE)
来获取第1、3、5..etc列。和 c(FALSE,TRUE)
由于回收而得到2,4,6,..然后,我们取消列出
列并创建一个新的 data.frame。
If the 'year', 'pop', columns are alternating, we can subset with c(TRUE, FALSE)
to get the columns 1, 3, 5,..etc. and c(FALSE, TRUE)
to get 2, 4, 6,.. due to the recycling. Then, we unlist
the columns and create a new 'data.frame.
df2 <- data.frame(year=unlist(df1[c(TRUE, FALSE)]),
pop=unlist(df1[c(FALSE, TRUE)]))
row.names(df2) <- NULL
head(df2)
# year pop
#1
#2 16XX 4675,0
#3 17XX 4739,3
#4 17XX 4834,0
#5 180X 4930,0
#6 180X 5029,0
或另一种选择是
library(splitstackshape)
merged.stack(transform(df1, id=1:nrow(df1)), var.stubs=c('year', 'pop'),
sep='var.stubs')[order(.time_1), 3:4, with=FALSE]
数据
data
df1 <- structure(list(year1 = c("", "16XX", "17XX", "17XX", "180X",
"180X", "181X", "181X", "182X", "182X"), pop1 = c("", "4675,0",
"4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0",
"5362,0"), year2 = c(NA, 1900L, 1901L, 1902L, 1903L, 1904L, 1905L,
1906L, 1907L, 1908L), pop2 = c("", "6453,0", "6553,5", "6684,0",
"6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"),
year3 = c(NA, 1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L,
1937L, 1938L), pop3 = c("", "9981,2", "", "", "", "", "",
"", "", "")), .Names = c("year1", "pop1", "year2", "pop2",
"year3", "pop3"), class = "data.frame", row.names = c(NA, -10L))
这篇关于使用多组测量列将数据框重塑为长格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!