R中的复杂特征矩阵和目标向量连接 [英] complicated feature matrix and target vector join in R
问题描述
我在这里问了一个类似的问题:R 中特征矩阵和目标向量的复杂逻辑连接
I asked a similar question here: Joining feature matrix and target vector in R with complicated logic
但由于我的困惑和不清楚的提示,我创建了一个新问题.
But I've created a new question due to the confusion and unclear prompt I had.
我有一个这样的特征向量:
I have a feature vector like this:
rest_id qtr cooking cleaning eating jumping
1 123 1 FALSE TRUE FALSE FALSE
2 123 2 FALSE TRUE FALSE FALSE
3 123 3 FALSE TRUE FALSE FALSE
4 123 4 FALSE TRUE FALSE FALSE
5 435 1 FALSE TRUE FALSE FALSE
6 435 2 FALSE TRUE FALSE FALSE
7 435 3 FALSE TRUE FALSE FALSE
8 435 4 FALSE TRUE FALSE FALSE
9 437 1 FALSE TRUE FALSE FALSE
10 437 2 FALSE TRUE FALSE FALSE
11 437 3 FALSE TRUE FALSE TRUE
12 437 4 FALSE TRUE FALSE FALSE
13 439 2 FALSE TRUE TRUE FALSE
14 508 1 FALSE TRUE TRUE FALSE
15 508 2 FALSE TRUE TRUE FALSE
16 234 2 FALSE TRUE TRUE FALSE
还有一个像这样的目标向量:
And a target vector like this:
rest_id qtr target
1 123 1 TRUE
2 123 2 FALSE
3 123 3 FALSE
4 123 4 TRUE
5 123 5 TRUE
6 435 1 TRUE
7 435 2 TRUE
8 435 3 TRUE
9 435 4 FALSE
10 435 5 FALSE
11 437 1 TRUE
12 437 2 TRUE
13 437 3 TRUE
14 437 4 FALSE
15 439 3 FALSE
16 508 3 FALSE
17 508 5 FALSE
18 234 3 TRUE
我想把这两者结合在一起
I want to join these two together such that
功能 Q1 ->目标 Q1Q2
Feature Q1 -> Target Q1Q2
功能 Q2 ->目标 Q2Q3
Feature Q2 -> Target Q2Q3
功能 Q3 ->目标 Q3Q4
Feature Q3 -> Target Q3Q4
功能 Q4 ->目标 Q4Q5
Feature Q4 -> Target Q4Q5
例如,如果特征观察在第 1 季度,我们检查目标向量的第 1 和第 2 季度的 rest_id
和 quarter
:如果它们都为 TRUE,则target 变为 TRUE,如果它们都为 FALSE,则目标变为 FALSE,如果它们为 TRUE 和 FALSE,则目标变为 TRUE.相同的逻辑适用于 Q2、Q3、Q4.
For example if the feature observation is in quarter 1, we check quarter 1 and 2 of the target vector for that rest_id
and quarter
: if they are both TRUE the target becomes TRUE, if they are both FALSE the target becomes FALSE, and if they are TRUE and FALSE they the target becomes TRUE. The same logic applies for Q2,Q3,Q4.
然而,目标向量中有一些缺失的四分之一.如果我们查看特征向量中的第 1 季度,我们会检查 Q1 和 Q3 的相同 rest_id
的目标.可能发生三种情况:
However there are some missing quarters in the target vector. If we are looking at quarter 1 in our feature vector, we check the target for the same rest_id
for Q1 and Q3. There are three cases that can happen:
Q1 丢失,Q2 没有丢失 --->取 Q2 的目标值
Q1 is missing and Q2 is not missing ---> take the target value for Q2
Q2 没有丢失,Q1 没有丢失 --->取 Q1 的目标值
Q2 is not missing and Q1 is missing ---> take target value for Q1
Q1 和 Q2 都不见了 --->应该是 N/A
Q1 and Q2 are both missing ---> should be N/A
预期的输出如下所示:
rest_id qtr cooking cleaning eating jumping target
123 1 FALSE TRUE FALSE FALSE TRUE
123 2 FALSE TRUE FALSE FALSE FALSE
123 3 FALSE TRUE FALSE FALSE TRUE
123 4 FALSE TRUE FALSE FALSE TRUE
435 1 FALSE TRUE FALSE FALSE TRUE
435 2 FALSE TRUE FALSE FALSE TRUE
435 3 FALSE TRUE FALSE FALSE TRUE
435 4 FALSE TRUE FALSE FALSE FALSE
437 1 FALSE TRUE FALSE FALSE TRUE
437 2 FALSE TRUE FALSE FALSE TRUE
437 3 FALSE TRUE FALSE FALSE TRUE
437 4 FALSE TRUE FALSE FALSE FALSE
439 2 FALSE TRUE FALSE FALSE FALSE
508 1 FALSE TRUE TRUE FALSE N/A
508 2 FALSE TRUE TRUE FALSE FALSE
234 2 FALSE TRUE TRUE FALSE TRUE
由于我提到的复杂逻辑,我无法仅通过 R 中的常规连接来完成此操作.最简单的方法是什么?
I cant do this with just a regular join in R because of the complicated logic I mentioned. What is the easiest way to do this?
谢谢!
推荐答案
tidyverse
方式(因为问题是用它标记的):
A tidyverse
way (since the question is tagged with it):
library(tidyverse)
expand_grid(rest_id = unique(feature_vector$rest_id), qtr = 1:5) %>%
arrange(rest_id, qtr) %>%
left_join(target_vector) %>%
group_by(rest_id) %>%
mutate(lead_target = lead(target)) %>%
mutate(aimed_target = case_when(!is.na(target) & is.na(lead_target) ~ target,
is.na(target) & !is.na(lead_target) ~ lead_target,
TRUE ~ target|lead_target)) %>%
ungroup() %>%
right_join(feature_vector) %>%
select(rest_id, qtr, cooking, cleaning, eating, jumping, aimed_target) %>%
rename(target = aimed_target)
首先,我创建了特征向量中所有
rest_id
的组合,以及使用expand_grid()
qtr>.然后我使用arrange()
对网格进行排序(如果rest_id
首先已经排序,这是多余的).
First I create a combination of all
rest_id
s in the feature vector, andqtr
from 1 to 5 usingexpand_grid()
. Then I usearrange()
to make the grid sorted (this is redundant ifrest_id
is already sorted in the first place).
然后我使用 left_join()
将 target_vector
加入上述网格.前两个步骤完成后,每个缺失的 rest_id
和 qtr
组合都会在 target
列中获得一个 NA
值>.
Then I use left_join()
to join the target_vector
to the aforementioned grid. These first two steps are done so that every missing rest_id
and qtr
combination is granted a NA
value in the column target
.
我创建列 lead_target
,原因是因为您总是需要当前季度和下一个季度的 target
值.现在,我可以通过 lead()
使一行同时包含两者.在此之前,我使用 group_by()
所以 lead()
函数只在类似的 rest_id
上完成.
I create column lead_target
, the reason is because you'll always want the current quarter and the next quarter's target
value. Now, I can make one row have both via lead()
. Before that I use group_by()
so the lead()
function is done on similar rest_id
s only.
aimed_target
几乎是使用您指定的逻辑创建的.我使用 case_when()
作为多个 ifelse()
函数的替代.运算符 |
是或",以防万一.
aimed_target
is pretty much created using logic that you specify. I use case_when()
as a replacement to multiple ifelse()
functions. The operator |
is "or", in case you wonder.
其余的代码非常简单.我需要删除一些列并在最后重命名.
The rest of the code is pretty straightforward. I need to drop some columns and rename in the end.
这篇关于R中的复杂特征矩阵和目标向量连接的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!