连接和覆盖一个表中的数据以及另一表中的数据 [英] Join and overwrite data in one table with data from another table

查看:138
本文介绍了连接和覆盖一个表中的数据以及另一表中的数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

如何连接和覆盖数据似乎是一个普遍的要求,但是我还没有找到适用于整个数据集的优雅解决方案.

How to join and overwrite data appears to be a common request, but I have yet to find an elegant solution that applies to an entire dataset.

(注意:为简化数据,我将仅使用1和NA来表示值和一小部分列,但实际上我有数百个具有不同值的列).

(Note: to simplify the data, I will use only 1s and NAs for values and a small subset of columns, but in reality I have hundreds of columns with different values).

我有一个数据表(d1),在某些列和行中具有NA值.

I have one data table (d1) that has NA values in certain columns and rows.

library(data.table)
d1 = fread(
"r id v1 v2 v3
1  A  1  1  1
2  B  1  1  1
3  C  1 NA NA
4  D  1  1 NA
5  E  1 NA  1")[, r := NULL]

我还有另一个数据表(d2),其中包含其他列以及d1中现有列中缺少的数据点.

And I have another data table (d2) that consists of additional columns as well as data points missing from existing columns in d1.

d2 = fread(
"r id v2 v3 v4 v5
1  C  1  1  1  1
2  D  1  1  1  1
3  E  1  1  1  1")[, r := NULL ]

我想基本上将d1中的所有数据加入+覆盖d1,当然要确保按id匹配行,按名称匹配列,如下所示.

I would like to basically join + overwrite d1 with all the data in d2, making sure of course to match rows by id and columns by name, as shown below.

> d12
  id v1 v2 v3 v4 v5
1  A  1  1  1 NA NA
2  B  1  1  1 NA NA
3  C  1  1  1  1  1
4  D  1  1  1  1  1
5  E  1  1  1  1  1

其他情况:我也想知道如果您只想更新d1中的NA值,即确保不覆盖现有的非NA值,该如何做? . (为使显示更容易理解,我添加了新的表,表中的数字都为1和0.)

Additional scenario: I'd also like to know how this can be done if you only want to update the NA values in d1, that is, make sure existing non-NA values are not overwritten. (To make this easier to visualize, I'm including new tables with both 1s and 0s).

例如,如果我们有d3

For example, if we have d3

d3 = fread(
"r id v1 v2 v3
1  A  1  1  1
2  B  1  1  1
3  C  1  0 NA
4  D  1  1  0
5  E  1 NA  1")[, r := NULL ]

我们想加入d2并仅覆盖NA以获得:

And we want to join d2 and overwrite only NAs to get:

> d32
  id v1 v2 v3 v4 v5
1  A  1  1  1 NA NA
2  B  1  1  1 NA NA
3  C  1  0  1  1  1
4  D  1  1  0  1  1
5  E  1  1  1  1  1

仅供参考,以下是其他一些解决此问题的帖子,但仅适用于一两列.我正在寻找的解决方案应该允许一个表中的数据被另一表中的许多(如果不是全部)列覆盖.

FYI, below are some other posts addressing this problem but only for one or two columns. The solution I'm looking for should allow the data in one table to be overwritten by many if not all of the columns in another table.

合并数据框并覆盖值

合并两个数据框并替换R

首选基于 data.table 的解决方案,但也欢迎使用其他解决方案.

A data.table-based solution would be preferred, but others are welcome.

推荐答案

我认为使用长格式是最容易的:

I think it's easiest to go to long form:

md1 = melt(d2, id="id")
md2 = melt(d2, id="id")

然后您可以堆叠它们并获取最新值:

Then you can stack them and take the latest value:

res1 = unique(rbind(md1, md2), by=c("id", "variable"), fromLast=TRUE)

如果您只想更新[d3]中的NA值,即确保不覆盖现有的非NA值,我也想知道如何做到这一点.

I'd also like to know how this can be done if you only want to update the NA values in [d3], that is, make sure existing non-NA values are not overwritten.

如果行出现在md3中,则可以从更新表md2中排除行:

You can exclude rows from the update table, md2, if they appear in md3:

md3 = melt(d3, id="id")

res3 = unique(rbind(md3, md2[!md3, on=.(id, variable)]), 
  by=c("id", "variable"), fromLast=TRUE)   

如有必要,可以使用

dcast返回宽格式,例如dcast(res3, id ~ ...).

dcast can be used to go back to wide format if necessary, e.g., dcast(res3, id ~ ...).

这篇关于连接和覆盖一个表中的数据以及另一表中的数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆