连接数据框并在有多个匹配项时选择随机行 [英] Join data frames and select random row when there are multiple matches

查看:68
本文介绍了连接数据框并在有多个匹配项时选择随机行的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个参考数据框(df1),其中包含三列特性"(性别,年份,代码)和两列值"(金额,状态).看起来像这样,但是有很多行:

I have a reference data frame (df1) with three columns of "characteristics" (gender, year, code), and two columns of "values" (amount, status). It looks like this, but with many rows:

gender    year    code    amount   status
     M    2011       A        15      EMX
     M    2011       A       123      NOX
     F    2015       B         0      MIX
     F    2018       A        12      NOX
     F    2015       B        11      NOX

我还有另一个数据框(df2),其中只有三个特性"列.例如:

I have another data frame (df2) that just has the three "characteristics" columns. For example:

gender    year   code
     M    2011      A
     M    2011      A
     F    2018      A
     F    2015      B

对于df2中的每一行,我想基于df1的特征"中的匹配项来分配值".如果有多个匹配项,我想随机选择成对的值".因此,当df2中存在重复的特征"时,它们可能会以不同的值"对结尾,但是所有这些变量在df1中将具有完全匹配的值.本质上,对于特征的每种组合,我希望值的分布在两个表之间匹配.

For each row in df2, I want to assign "values" based on matches in "characteristics" to df1. Where there are multiple matches, I want to select pairs of "values" at random. So when there are duplicate "characteristics" in df2, they might end up with different pairs of "values", but all of them will have an exact match in df1. Essentially, for each combination of characteristics, I want the distribution of values to match between the two tables.

例如,"df2"中的最后一行(性别= F,年份= 2015,代码= B)与"df1"中的两行匹配:第三行(amont = 0,状态= MIX)和第五行(金额= 11,状态= NOX).然后,应随机选择这些匹配行之一.对于基于性别,年份和代码在"df2"和"df1"之间多次匹配的所有此类情况,应选择一个随机行.

For example, the last row in 'df2' (gender = F, year = 2015, code = B) matches two rows in 'df1': the third row (amont = 0, status = MIX) and the fifth row (amount = 11, status = NOX). Then one of these matching rows should be selected randomly. For all such cases of multiple matches between 'df2' and 'df1' based on gender, year and code, a random row should be selected.

到目前为止,我的方法是从使用dplyr在两个数据帧之间进行left_join开始.但是,这为df2中的每一行提供了所有可能的值",而不是随机选择一个.因此,我必须按特征分组并选择一个.这样会产生一个很大的中间表,而且效率似乎并不高.

So far, my approach has been to start by using dplyr to do a left_join between the two data frames. However, this provides all possible "values" for each row in df2, rather than selecting one at random. So I then have to group by characteristics and select one. This produces a very large intermediate table and doesn't seem very efficient.

我想知道是否有人建议使用一种更有效的方法?以前,我发现使用data.table软件包的速度更快,但是对软件包的了解并不多.我还想知道我是否应该进行联接,还是应该只使用sample函数?

I wondered if anyone had suggestions for a more efficient method? I've previously found that joining with the data.table package is quicker, but don't really have a good understanding of the package. I also wonder if I should be doing joins at all, or should just use the sample function?

非常感谢您的帮助.

推荐答案

基于'gender','year','code'(d1[d2, on = .(gender, year, code), ...])中的匹配项,使用'd2'在'd1'中查找行.对于每个匹配项(by = .EACHI),请采样一行(sample(.N, 1L)).用它来索引数量"和状态".

Use 'd2' to lookup rows in 'd1' based on matches in 'gender', 'year', 'code' (d1[d2, on = .(gender, year, code), ...]). For each match (by = .EACHI), sample one row (sample(.N, 1L)). Use this to index 'amount' and 'status'.

d1[d2, on = .(gender, year, code),
  {ri <- sample(.N, 1L)
  .(amount = amount[ri], status = status[ri])}, by = .EACHI]

# sample based on set.seed(1)
#    gender year code amount status
# 1:      M 2011    A     15    EMX
# 2:      M 2011    A     15    EMX
# 3:      F 2018    A     12    NOX
# 4:      F 2015    B     11    NOX


请注意,增强了mult参数的功能,即当x中的多行与i中的行匹配时如何处理情况.当前,有效选项为"all"(默认),"first""last".但是,如果/在实施问题时,可以使用mult = "random"(sample(.N, size = 1L))在匹配项中选择随机行(行).


Note that there is an open issue on Enhanced functionality of mult argument, i.e. how to handle cases when multiple rows in x match to the row in i. Currently, valid options are "all" (default), "first" or "last". But if/when the issue is implemented, mult = "random" (sample(.N, size = 1L)) may be used to select a random row (rows) among the matches.

这篇关于连接数据框并在有多个匹配项时选择随机行的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆