dplyr :: left_join为新的连接列产生NA值 [英] dplyr::left_join produce NA values for new joined columns
问题描述
我有两个表希望通过 dplyr
包进行 left_join
.问题是,这会为所有新列(我需要的列)产生 NA
值.
I have two tables I wish to left_join
through the dplyr
package. The issue is that is produces NA
values for all new columns (the ones I'm after).
如下所示, left_join
为 Incep.Price
和 DayCounter 的新列提供
NA
值.代码>.为什么会发生这种情况,如何解决?
As you can see below, the left_join
procudes NA
values for the new column of Incep.Price
and DayCounter
. Why does this happen, and how can this be resolved?
更新:感谢@akrun,使用 left_join(Avanza.XML,checkpoint,by = c('Firm'='Firm'))
解决了该问题,并且各列已正确连接.
Update: Thanks to @akrun, using left_join(Avanza.XML, checkpoint, by = c('Firm' = 'Firm'))
solves the issue and the columns are joined correctly.
但是警告信息是一样的,有人可以解释这种行为吗?在这种情况下,为什么必须显式指定 join
列,否则必须生成 NA
值?
However the warning message is sitll the same, could someone explain this behaviour? Why one must in this case explicitly specify the join
columns, or otherwise produce NA
values?
> head(Avanza.XML)
Firm Gain.Month.1 Last.Price Vol.Month.1
1 Stockwik Förvaltning 131.25 0.074 131264420
2 Novestra 37.14 7.200 605330
3 Bactiguard Holding 29.55 14.250 2815572
4 MSC Group B 20.87 3.070 671855
5 NeuroVive Pharmaceutical 18.07 9.800 3280944
6 Shelton Petroleum B 16.21 3.800 2135798
> head(checkpoint)
Firm Gain.Month.1 Last.Price Vol.Month.1 Incep.Price DayCounter
1 Stockwik Förvaltning 87.50 0.06 91270090 0.032000 2016-01-25
2 Novestra 38.10 7.25 604683 5.249819 2016-01-25
3 Bactiguard Holding 29.09 14.20 2784161 11.000077 2016-01-25
4 MSC Group B 27.56 3.24 657699 2.539981 2016-01-25
5 Shelton Petroleum B 19.27 3.90 1985305 3.269892 2016-01-25
6 NeuroVive Pharmaceutical 16.87 9.70 3220303 8.299820 2016-01-25
> head(left_join(Avanza.XML, checkpoint))
Joining by: c("Firm", "Gain.Month.1", "Last.Price", "Vol.Month.1")
Firm Gain.Month.1 Last.Price Vol.Month.1 Incep.Price DayCounter
1 Stockwik Förvaltning 131.25 0.074 131264420 NA <NA>
2 Novestra 37.14 7.200 605330 NA <NA>
3 Bactiguard Holding 29.55 14.250 2815572 NA <NA>
4 MSC Group B 20.87 3.070 671855 NA <NA>
5 NeuroVive Pharmaceutical 18.07 9.800 3280944 NA <NA>
6 Shelton Petroleum B 16.21 3.800 2135798 NA <NA>
Warning message:
In left_join_impl(x, y, by$x, by$y) :
joining factors with different levels, coercing to character vector
推荐答案
有两个问题.
-
未在
left_join
中指定by
参数:在这种情况下,默认情况下,所有列均用作连接的变量.如果我们查看列-"Gain.Month.1","Last.Price","Vol.Month.1"-所有numeric
类,并且在每个数据集.因此,最好通过固定"加入
Not specifying the
by
argument inleft_join
: In this case, by default all the columns are used as the variables to join by. If we look at the columns - "Gain.Month.1", "Last.Price", "Vol.Month.1" - allnumeric
class and do not have a matching value in each of the datasets. So, it is better to join by "Firm"
left_join(Avanza.XML, checkpoint, by = "Firm")
固定"列类- factor
:当 factor
列的 levels
存在差异时,我们会收到警告(如果它是我们加入的变量).为了消除警告,我们可以将两个数据集中的固定"列都转换为 character
class
The "Firm" column class - factor
: We get warning when there is difference in the levels
of the factor
column (if it is the variable that we join by). In order to remove the warning, we can either convert the "Firm" column in both datasets to character
class
Avanza.XML$Firm <- as.character(Avanza.XML$Firm)
checkpoint$Firm <- as.character(checkpoint$Firm)
或者,如果我们仍然希望将列保留为 factor
,则将"Firm"中的 levels
更改为包括所有 levels
在两个数据集中
Or if we still want to keep the columns as factor
, then change the levels
in the "Firm" to include all the levels
in both the datasets
lvls <- sort(unique(c(levels(Avanza.XML$Firm),
levels(checkpoint$Firm))))
Avanza.XML$Firm <- factor(Avanza.XML$Firm, levels=lvls)
checkpoint$Firm <- factor(checkpoint$Firm, levels=lvls)
,然后执行 left_join
.
这篇关于dplyr :: left_join为新的连接列产生NA值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!