使用 data.table 的非 equi 连接:输出中缺少列 [英] Non-equi join using data.table: column missing from the output

查看:16
本文介绍了使用 data.table 的非 equi 连接:输出中缺少列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 data.table 进行左非等连接:

I am doing a left non-equi join using data.table:

OUTPUT <- DT2[DT1, on=.(DOB, FORENAME, SURNAME, POSTCODE, START_DATE <= MONTH, EXPIRY_DATE >= MONTH)]

OUTPUT 包含正确的左连接,但缺少 MONTH 列(存在于 DT1 中)除外.

The OUTPUT contains a correct left join, with the exception that the MONTH column (which is present in DT1) is missing.

这是 data.table 中的错误吗?

NB:当然,START_DATEEXPIRY_DATEMONTH 是相同的 YYYY-MM-DD、IDate 格式.根据这些非 equi 标准,连接的结果是正确的.只是缺少该列,我需要在进一步的工作中使用它.

NB: Of course, START_DATE, EXPIRY_DATE and MONTH are in the same YYYY-MM-DD, IDate format. The results of the join are correct based on these non-equi criteria. It is just that the column is missing and I need to use it in further work.

编辑 1:简化的可重现示例

DT1 <- structure(list(ID = c(1, 2, 3), FORENAME = c("JOHN", "JACK", 
"ROB"), SURNAME = c("JOHNSON", "JACKSON", "ROBINSON"), MONTH = structure(c(16953L, 
16953L, 16953L), class = c("IDate", "Date"))), .Names = c("ID", 
"FORENAME", "SURNAME", "MONTH"), row.names = c(NA, -3L), class = c("data.table", 
"data.frame"))

DT2 <- structure(list(CERT_NUMBER = 999, FORENAME = "JOHN", SURNAME = "JOHNSON", 
    START_DATE = structure(16801L, class = c("IDate", "Date")), 
    EXPIRY_DATE = structure(17166L, class = c("IDate", "Date"
    ))), .Names = c("CERT_NUMBER", "FORENAME", "SURNAME", "START_DATE", 
"EXPIRY_DATE"), row.names = c(NA, -1L), class = c("data.table", 
"data.frame"))

OUTPUT <- DT2[DT1, on=.(FORENAME, SURNAME, START_DATE <= MONTH, EXPIRY_DATE >= MONTH)]

> OUTPUT
   CERT_NUMBER FORENAME  SURNAME START_DATE EXPIRY_DATE ID
1:         999     JOHN  JOHNSON 2016-06-01  2016-06-01  1
2:          NA     JACK  JACKSON 2016-06-01  2016-06-01  2
3:          NA      ROB ROBINSON 2016-06-01  2016-06-01  3

  • FORENAMESURNAME 连接在一起并出现在输出中.
  • MONTH 也(非 equi)加入,输出中不存在.
    • FORENAME and SURNAME are joined on and are present in the output.
    • MONTH is also (non-equi) joined on, and is absent from the output.
    • 为什么会出现这种预期行为?

      Why is this expected behaviour?

      即使这是预期的行为,对我的情况也没有帮助,因为我需要保留 MONTH 以进行进一步的数据操作.

      Even if it is expected behaviour, it is not helpful in my case, because I need to retain MONTH for further data manipulation.

      我的预期输出将是同一张表,但 MONTH 列与 DT1 中的一样.毕竟,我对左连接的期望是从左表 (DT1) 中保留每一行和每一列,并从右表 (DT2) 中添加所有列和仅匹配的行.

      My expected output would be the same table, but with the MONTH column retained as it is in DT1. After all, what I expect from a left join is for every row and column to be retained from the left table (DT1) and for all columns and only matched rows to be added from the right table (DT2).

         CERT_NUMBER FORENAME  SURNAME START_DATE EXPIRY_DATE ID      MONTH
      1:         999     JOHN  JOHNSON 2016-01-01  2016-12-31  1 2016-06-01
      2:          NA     JACK  JACKSON       <NA>        <NA>  2 2016-06-01
      3:          NA      ROB ROBINSON       <NA>        <NA>  3 2016-06-01
      

      编辑 2:显然在我的代码产生的输出中,START 和 END 日期也是错误的!只有人 1 的证书开始日期为 1 月 1 日,结束日期为 12 月 31 日!预期的输出是应该的.但实际输出是 1 月 1 日.

      Edit 2: apparently in the output produced by my code, the START and END dates are also wrong! Only person 1 had a certificate with a start date on 1-Jan and end date on 31-Dec! The expected output is what it should be. But the actual output made everything 1-Jan.

      推荐答案

      在 data.table 中,x[i] 形式的连接传统上使用来自 i 的值,但是使用 x 中的列名.尽管这与返回两者的 SQL 不同,但此默认值对于 equi 连接很有意义,因为我们对来自 i 的所有行感兴趣,如果它们匹配,则两者都匹配data.tables 无论如何都有 equal 值,如果它们不相等,我们需要将 i 中那些不匹配的值保留在结果中.

      In data.table, joins of the form x[i] traditionally uses values from i but uses column names from x. Even though this is different from SQL which returns both, this default makes a lot of sense for equi joins since we are interested in all rows from i and if they match then both data.tables have equal values anyway, and if they don't we need to keep those unmatched values from i in result.

      但是对于 non-equi 连接,由于值可能不匹配完全,即可能在一个范围内,可能存在我们必须返回的情况类似于 SQL 的输出(或识别此类情况并返回用户期望的结果,类似于 equi 连接的情况).这还没有完成,但我现在已经为它做了一个规定,即使用 x. 前缀显式引用列.这不方便,我同意.希望这会很快自动处理.

      But for non-equi joins, since the values might not match exactly, i.e., can fall within a range, there might be cases where we will have to return outputs similar to SQL (or identify such cases and return the result user expects, similar to the case of equi joins). This hasn't been done yet, but I've placed a provision for it at the moment, which is to refer to the columns explicitly with a x. prefix. It is not convenient, I agree. Hopefully this'll be taken care of automatically soon.

      以下是使用 x. 前缀获取结果的方法.

      Here's how to get your result using x. prefix.

      ans <- DT2[DT1, .(CERT_NUMBER, FORENAME, SURNAME, x.START_DATE, x.EXPIRY_DATE, ID, MONTH), 
               on=.(FORENAME, SURNAME, START_DATE <= MONTH, EXPIRY_DATE >= MONTH)]
      

      IIRC 在 GitHub 项目页面上也有关于此的问题.

      IIRC there's an issue filed on the GitHub project page about this as well.

      这篇关于使用 data.table 的非 equi 连接:输出中缺少列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆