dplyr :: left_join为新的连接列产生NA值 [英] dplyr::left_join produce NA values for new joined columns

查看:46
本文介绍了dplyr :: left_join为新的连接列产生NA值的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个表希望通过 dplyr 包进行 left_join .问题是,这会为所有新列(我需要的列)产生 NA 值.

I have two tables I wish to left_join through the dplyr package. The issue is that is produces NA values for all new columns (the ones I'm after).

如下所示, left_join Incep.Price DayCounter 的新列提供 NA 值.代码>.为什么会发生这种情况,如何解决?

As you can see below, the left_join procudes NA values for the new column of Incep.Price and DayCounter. Why does this happen, and how can this be resolved?

更新:感谢@akrun,使用 left_join(Avanza.XML,checkpoint,by = c('Firm'='Firm'))解决了该问题,并且各列已正确连接.

Update: Thanks to @akrun, using left_join(Avanza.XML, checkpoint, by = c('Firm' = 'Firm')) solves the issue and the columns are joined correctly.

但是警告信息是一样的,有人可以解释这种行为吗?在这种情况下,为什么必须显式指定 join 列,否则必须生成 NA 值?

However the warning message is sitll the same, could someone explain this behaviour? Why one must in this case explicitly specify the join columns, or otherwise produce NA values?

> head(Avanza.XML)
                      Firm Gain.Month.1 Last.Price Vol.Month.1
1     Stockwik Förvaltning       131.25      0.074   131264420
2                 Novestra        37.14      7.200      605330
3       Bactiguard Holding        29.55     14.250     2815572
4              MSC Group B        20.87      3.070      671855
5 NeuroVive Pharmaceutical        18.07      9.800     3280944
6      Shelton Petroleum B        16.21      3.800     2135798

> head(checkpoint)
                      Firm Gain.Month.1 Last.Price Vol.Month.1 Incep.Price DayCounter
1     Stockwik Förvaltning        87.50       0.06    91270090    0.032000 2016-01-25
2                 Novestra        38.10       7.25      604683    5.249819 2016-01-25
3       Bactiguard Holding        29.09      14.20     2784161   11.000077 2016-01-25
4              MSC Group B        27.56       3.24      657699    2.539981 2016-01-25
5      Shelton Petroleum B        19.27       3.90     1985305    3.269892 2016-01-25
6 NeuroVive Pharmaceutical        16.87       9.70     3220303    8.299820 2016-01-25

> head(left_join(Avanza.XML, checkpoint))
Joining by: c("Firm", "Gain.Month.1", "Last.Price", "Vol.Month.1")
                      Firm Gain.Month.1 Last.Price Vol.Month.1 Incep.Price DayCounter
1     Stockwik Förvaltning       131.25      0.074   131264420          NA       <NA>
2                 Novestra        37.14      7.200      605330          NA       <NA>
3       Bactiguard Holding        29.55     14.250     2815572          NA       <NA>
4              MSC Group B        20.87      3.070      671855          NA       <NA>
5 NeuroVive Pharmaceutical        18.07      9.800     3280944          NA       <NA>
6      Shelton Petroleum B        16.21      3.800     2135798          NA       <NA>
Warning message:
In left_join_impl(x, y, by$x, by$y) :
  joining factors with different levels, coercing to character vector

推荐答案

有两个问题.

  1. 未在 left_join 中指定 by 参数:在这种情况下,默认情况下,所有列均用作连接的变量.如果我们查看列-"Gain.Month.1","Last.Price","Vol.Month.1"-所有 numeric 类,并且在每个数据集.因此,最好通过固定"加入

  1. Not specifying the by argument in left_join: In this case, by default all the columns are used as the variables to join by. If we look at the columns - "Gain.Month.1", "Last.Price", "Vol.Month.1" - all numeric class and do not have a matching value in each of the datasets. So, it is better to join by "Firm"

left_join(Avanza.XML, checkpoint, by = "Firm")

  • 固定"列类- factor :当 factor 列的 levels 存在差异时,我们会收到警告(如果它是我们加入的变量).为了消除警告,我们可以将两个数据集中的固定"列都转换为 character class

  • The "Firm" column class - factor: We get warning when there is difference in the levels of the factor column (if it is the variable that we join by). In order to remove the warning, we can either convert the "Firm" column in both datasets to character class

    Avanza.XML$Firm <- as.character(Avanza.XML$Firm)
    checkpoint$Firm <- as.character(checkpoint$Firm)
    

  • 或者,如果我们仍然希望将列保留为 factor ,则将"Firm"中的 levels 更改为包括所有 levels 在两个数据集中

    Or if we still want to keep the columns as factor, then change the levels in the "Firm" to include all the levels in both the datasets

    lvls <- sort(unique(c(levels(Avanza.XML$Firm), 
                              levels(checkpoint$Firm))))
    Avanza.XML$Firm <- factor(Avanza.XML$Firm, levels=lvls)
    checkpoint$Firm <- factor(checkpoint$Firm, levels=lvls)
    

    ,然后执行 left_join .

    这篇关于dplyr :: left_join为新的连接列产生NA值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆