将数据框与data.table匹配 [英] Matching dataframes with data.table

查看:74
本文介绍了将数据框与data.table匹配的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要使用另一个矩阵作为标识符(ID.MA),用来自长数据帧(DF)的信息填充矩阵(MA).

I need to fill a matrix (MA) with information from a long data frame (DF) using another matrix as identifier (ID.MA).

关于我的三个矩阵的想法: MA.ID创建一个标识符,以便在大DF中查找所需的变量:

An idea of my three matrices: MA.ID creates an identifier to look in the big DF the needed variables:

     a      b      c
a    ID.aa  ID.ab  ID.ac
b    ID.ba  ID.bb  ID.bc
c    ID.ca  ID.cb  ID.cc

原始大数据框没有任何信息,但其中的行对于我填充目标MA矩阵也很有用:

The original big data frame has useless information but has also the rows that are useful for me to fill the target MA matrix:

ID     1990 1991 1992
ID.aa  10   11   12
ID.ab  13   14   15
ID.ac  16   17   18
ID.ba  19   20   21
ID.bb  22   23   24
ID.bc  25   26   27
ID.ca  28   29   30
ID.cb  31   32   33
ID.cc  34   35   36
ID.xx  40   40   55
ID.xy  50   51   45
....

MA应填充交叉信息.在我的示例中,它应该类似于DF的选定列(假设是1990年):

MA should be filled with cross-information. In my example it should look like that for a chosen column of DF (let's say, 1990):

     a    b    c
a    10   13   16
b    19   22   25
c    28   31   34

我尝试使用match,但说实话,它没有成功:

I've tried to use match but honestly it didn't work out:

MA$a = DF[match(MA.ID$a, DF$ID),2]

建议使用data.table程序包,但我看不到有什么帮助.

I was recommended to use the data.table package but I couldn't see how that would help me.

有人有解决这个问题的好方法吗?

Anyone has any good way to approach this problem?

推荐答案

假设您的输入是数据帧,那么您可以执行以下操作:

Supposing that your input are dataframes, then you could do the following:

library(data.table)
setDT(ma)[, lapply(.SD, function(x) x = unlist(df[match(x,df$ID), "1990"]))
          , .SDcols = colnames(ma)]

返回:

    a  b  c
1: 10 13 16
2: 19 22 25
3: 28 31 34

说明:

  • 使用setDT(ma)数据框转换为数据表(这是增强的 dataframe ).
  • 使用.SDcols=colnames(ma),您可以指定必须在哪些列上应用转换.
  • lapply(.SD, function(x) x = unlist(df[match(x,df$ID),"1990"]))对用.SDcols指定的每一列执行匹配操作.
  • With setDT(ma) you transform the dataframe into a datatable (which is an enhanced dataframe).
  • With .SDcols=colnames(ma) you specify on which columns the transformation has to be applied.
  • lapply(.SD, function(x) x = unlist(df[match(x,df$ID),"1990"])) performs the matching operation on each column specified with .SDcols.

使用data.table的另一种方法是先将ma转换为长的 data.table :

An alternative approach with data.table is first transforming ma to a long data.table:

ma2 <- melt(setDT(ma), measure.vars = c("a","b","c"))
setkey(ma2, value)    # set key by which 'ma' has to be indexed
setDT(df, key="ID")   # transform to a datatable & set key by which 'df' has to be indexed

# joining the values of the 1990 column of df into
# the right place in the value column of 'ma'
ma2[df, value :=  `1990`]

给出:

> ma2
   variable value
1:        a    10
2:        b    13
3:        c    16
4:        a    19
5:        b    22
6:        c    25
7:        a    28
8:        b    31
9:        c    34

此方法的唯一缺点是值"列中的数字值被存储为字符值.您可以通过将其扩展如下来纠正此问题:

The only drawback of this method is that the numeric values in the 'value' column get stored as character values. You can correct this by extending it as follows:

ma2[df, value :=  `1990`][, value := as.numeric(value)]

如果要将其更改回宽格式,可以在dcast中使用rowid函数:

If you want to change it back to wide format you could use the rowid function within dcast:

ma3 <- dcast(ma2, rowid(variable) ~ variable, value.var = "value")[, variable := NULL]

给出:

> ma3
    a  b  c
1: 10 13 16
2: 19 22 25
3: 28 31 34


使用的数据:


Used data:

ma <- structure(list(a = structure(1:3, .Label = c("ID.aa", "ID.ba", "ID.ca"), class = "factor"), 
                     b = structure(1:3, .Label = c("ID.ab", "ID.bb", "ID.cb"), class = "factor"), 
                     c = structure(1:3, .Label = c("ID.ac", "ID.bc", "ID.cc"), class = "factor")), 
                .Names = c("a", "b", "c"), class = "data.frame", row.names = c(NA, -3L))

df <- structure(list(ID = structure(1:9, .Label = c("ID.aa", "ID.ab", "ID.ac", "ID.ba", "ID.bb", "ID.bc", "ID.ca", "ID.cb", "ID.cc"), class = "factor"), 
                     `1990` = c(10L, 13L, 16L, 19L, 22L, 25L, 28L, 31L, 34L), 
                     `1991` = c(11L, 14L, 17L, 20L, 23L, 26L, 29L, 32L, 35L), 
                     `1992` = c(12L, 15L, 18L, 21L, 24L, 27L, 30L, 33L, 36L)), 
                .Names = c("ID", "1990", "1991", "1992"), class = "data.frame", row.names = c(NA, -9L))

这篇关于将数据框与data.table匹配的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆