使用2个表格制作新功能 [英] Make new feature using 2 tables

查看:103
本文介绍了使用2个表格制作新功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

table1 <- data.frame(user_id=c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2),
                 product_id = c(14, 24, 38, 40, 66, 2, 19, 30, 71, 98, 7, 16),
                 first_order = c(1, 2, 1, 4, 5, 3, 2, 4, 2, 4, 2, 3),
                 last_order = c(4, 7, 5, 8, 8, 3, 4, 7, 5, 9, 4, 5))
table2 <- data.frame(user_id=c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
                 order_number=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6),
                 days_cumsum = c(0, 7, 15, 26, 34, 43, 53, 59, 66, 74, 82, 91, 5, 11, 17, 24, 29, 35))

我想使用table2向table1添加新功能。
新功能是每个用户订购每种产品的时间间隔。

例如,让我们看一下表1。第一行具有(user_id == 1),(product_id == 14),(第一订单== 1)和(最后订单== 4)。这表示产品14已按订单1、2、3、4进行订购。我们可以在表2中找到该订单号。
新功能是关于第一笔订单和最后一张订单之间的订单间隔。我们可以在表2中使用 days_cumsum来得出它,它是先前订单后的天数的累积总和。第一行的新功能值为26(= 26-0)。

For example, let's see the table1. First row has (user_id==1), (product_id==14), (first order==1) and (last order ==4). It means product 14 was ordered at order 1, 2, 3, 4. And we can find this order number in table 2. New feature is about order interval between first order and last order. We can derive it in table 2, using "days_cumsum" which is cumulative sum of days after prior order. First row's new feature value is 26(=26-0).

我认为可以通过join完成,但是 我不能使用join ,因为每个表都非常

I think it may be done with join, but I can't use join because each table is very big actually.

所以我在下面使用此函数和for循环:

So I'm using this function below with for loop :

f <- function(i){
  a <- table2 %>% 
    filter(user_id==table1[i, 1]) %>% 
    filter(order_number==table1[i, 3] | order_number==table1[i, 4])

  ifelse(nrow(a)==2, a[2, 3] - a[1, 3], 999999) # first_order==last_order
}

它会计算每个新功能值逐行,但是它非常慢并且需要很多计算。我经常遇到这个问题(使用两个表创建新功能),但是每次遇到困难时。

It calculates each new feature value row by row, but it is very slow and needs many calculation. I often encounter this problem(make new feature using two tables), but every time I have difficulties.

是否有更好的代码?我正在等待您的帮助。

Is there better code? I'm waiting for your help.

推荐答案

可以共享 join的运行时/计算时间比较吗? 以及使用循环的版本?

Can you share the runtime/computation time comparisons for join and the version using your loop?

下面是使用联接的解决方案。

Below is a solution using joins.

library(tidyverse)

df1 <- as.data.frame(table1)
df2 <- as.data.frame(table2)


df1 %>%
  left_join(df2, by = c("user_id"="user_id", "first_order" = "order_number")) %>%
  rename(dayMin = days_cumsum) %>%
  left_join(df2, by = c("user_id"="user_id", "last_order" = "order_number")) %>%
  rename(dayMax = days_cumsum) %>%
  mutate(newVar = dayMax-dayMin) %>%
  select(user_id, product_id, first_order, last_order, newVar)

其中提供:

   user_id product_id first_order last_order newVar
     <dbl>      <dbl>       <dbl>      <dbl>  <dbl>
 1       1         14           1          4     26
 2       1         24           2          7     46
 3       1         38           1          5     34
 4       1         40           4          8     33
 5       1         66           5          8     25
 6       1          2           3          3      0
 7       1         19           2          4     19
 8       1         30           4          7     27
 9       1         71           2          5     27
10       1         98           4          9     40
11       2          7           2          4     13
12       2         16           3          5     12

这篇关于使用2个表格制作新功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆