在日期范围内按组ID左加入 [英] Left join on date range by group ID

查看:54
本文介绍了在日期范围内按组ID左加入的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

考虑此令牌数据集:

Data = structure(list(txs = c(-50, -750, -35, -5.96, -61.5, -42.07, 
-142.4, -500, 132, -154.89, -109.51, -2000, -50, -40, -24.98, 
-15.6, -50, -147.72, -20, -6.6, -5, -20, -13.48, -7.25, -54.09, 
-200, -124.11, -30, -50, -30, 400, -10, -0.95, -4.1, -10000, 
30, -1.99, 74.03, -6.95, -2.96, -29, -403.6, -6, -6, 5250, -513.57, 
-300, -10, -500, -20, -6.45, -7.26, -40, -50, -13.14, 321.29, 
-18, 100, -5.5, -25, -59.2, -10.75, -3.2, 270, 65.8, -11.6, -104.78, 
-99.39, 0.1, -50, -80, -50, -371.44, -78, 270, -6.3, 40, -2.5, 
-29.99, -189.48, -400, -0.29, -20, -6.55, -987.37, -1400, -0.49, 
-20, -29.04, -65, -40, -27.5, -17.37, -10, -1092.84, -5.5, -69.93, 
-15.07, -400, -4.8), week = structure(c(1439157600, 1454281200, 
1471212000, 1445205600, 1448233200, 1451862000, 1449442800, 1453676400, 
1460325600, 1460930400, 1445205600, 1454281200, 1460930400, 1444600800, 
1462140000, 1471816800, 1443996000, 1448838000, 1479682800, 1453071600, 
1447023600, 1473631200, 1465768800, 1433109600, 1445205600, 1433714400, 
1466978400, 1441576800, 1459116000, 1451862000, 1436133600, 1440367200, 
1456095600, 1458514800, 1456700400, 1450047600, 1440972000, 1446418800, 
1465164000, 1441576800, 1442181600, 1453071600, 1461535200, 1460930400, 
1438552800, 1464559200, 1447628400, 1434924000, 1437343200, 1436738400, 
1443391200, 1438552800, 1440972000, 1446418800, 1446418800, 1453071600, 
1453071600, 1457305200, 1444600800, 1462140000, 1435528800, 1457305200, 
1437948000, 1440972000, 1437948000, 1433109600, 1461535200, 1453676400, 
1454886000, 1454281200, 1441576800, 1441576800, 1471212000, 1453071600, 
1451862000, 1442786400, 1443391200, 1439762400, 1436133600, 1461535200, 
1442181600, 1468188000, 1442181600, 1453676400, 1466373600, 1443391200, 
1450652400, 1454886000, 1439157600, 1441576800, 1463954400, 1442181600, 
1446418800, 1454886000, 1476050400, 1461535200, 1456700400, 1456700400, 
1435528800, 1456700400), class = c("POSIXct", "POSIXt"), tzone = ""), 
    num_c = c(1219, 1257, 1195, 33, 1105, 1223, 1257, 1317, 486, 
    1227, 477, 1039, 1238, 1008, 1137, 1294, 1070, 596, 1295, 
    1354, 1010, 1294, 1348, 1254, 19, 1185, 24, 1287, 1198, 955, 
    1324, 1293, 1343, 1162, 1272, 972, 972, 179, 1343, 1105, 
    1085, 1020, 947, 1375, 1005, 477, 596, 1198, 928, 1137, 1263, 
    1237, 1054, 1288, 1185, 1115, 1257, 1301, 1294, 1185, 1039, 
    957, 1131, 33, 477, 1258, 477, 1039, 1362, 1246, 596, 1010, 
    972, 1238, 477, 1296, 972, 1148, 1105, 24, 553, 1297, 1288, 
    1223, 789, 1298, 1082, 1353, 1030, 1287, 1203, 1008, 1294, 
    1227, 1298, 1203, 1346, 1010, 19, 1303)), .Names = c("txs", 
"week", "num_c"), row.names = c(NA, -100L), class = "data.frame")

它具有三列:

一个叫 num_c :这是一个客户编号

  • one called num_c: this is a client number

一个称为 week 的日期:这是放置
num_c 的星期几

one called week: this is the date of the Monday of the weeks in which num_c placed an order.

现在,对于每个客户,我想使用dplyr扩展(加入)此数据集
,以便在
的日期范围内,每周获得一行该客户已下订单。

Now, for every client I would like to use dplyr to expand (join) this data set so that I get one row for each week in the range of dates during which that client has placed orders.

此外,空白单元格(用于TX)应填充NA。
与原始(未扩展)数据集中存在的周/客户相对应的tx值应保留其原始值;

Moreover, the new empty cells (for txs) should be filled with NA's. The values of txs corresponding to those weeks/clients that are present in the original (no expanded) data set should retain their original values;

我尝试过:

 library(dplyr)
 stretch_Data = Data %>%
                group_by(num_c) %>%
                right_join(seq(min(week), max(week), by = 'week'), by = "week")

但是我明白了。

Error in seq(min(week), max(week), by = "week") : object 'week' not found

这很愚蠢,因为数据确实包含一个星期列(非常感谢您)。

which is silly because Data does contain a week column (thank you very much).

我在做什么错?

感谢@ mt1022解决方案(以下评论)。这是安静的聪明。但是仍然存在一个问题:是否可以包含Data中的所有列? -为了说明我的观点,我添加了一个:txs –然后,缺失值(与没有购买的几周相对应的值)应由NA填充(例如在联合中)。与我们购买的客户*周日期相对应的单元格应仅保留其原始值。

Thanks to @mt1022 for his solution (comment below). This is quiet clever. But one problem remains: is it possible to include all the colums from Data? --To make my point I've added one: txs-- Then the missing values (those corresponding to weeks with no purchase) should be filled by NA's (like in a joint). The cells corresponding to client * week dates for which we had a purchase should just retain their original values.

实质上,对新的(扩展表)执行na.omit()应该返回原始表;就像在关节上一样;

In essence doing na.omit() on the new (expanded table) should return the original table; like it would on a joint;

推荐答案

使用@ mt1002中的解决方案并添加一个简单的加法即可获得 txs 列。我已经通过 sum(txs)获得了一个单一值,其中也可能是 min(txs) max(txs),具体取决于您的需求。

Using the solution from @mt1002 with one simple addition to obtain the txs column. I have obtained a single value by sum(txs), where this could also be a single value of the min(txs) or max(txs) depending on your needs.

Data %>% group_by(num_c) %>% 
  summarise(week = list(seq(min(week), max(week), by = 'week')),
            txs = sum(txs)) %>% 
  unnest(week)

在澄清之后,这是我想出的解决方案是,除了每个用户每周的订单数之外,还有几周没有订单的NA值。您还可以使用包含上述对df的查询的左连接,将订单由 num_c 加入的周列表加入。

After the clarification, this is the solution I came up with where there are NA values for weeks there were no orders, in addition to the number of orders per user per week. You could also join the list of weeks with orders by num_c using a left-join including the above query to the df.

library(lubridate)
a <- data.frame(week = rep(seq(1,52,1)))
Data %>% 
  group_by(num_c) %>% 
  mutate(week_num = week(week)) %>% 
  group_by(num_c, week_num) %>% 
  summarise(txs = sum(txs),
            number_orders = n()) %>% 
  full_join(a, by = c("week_num"="week")) %>% 
  ungroup() %>% 
  arrange(week_num)

这篇关于在日期范围内按组ID左加入的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆