在dbplyr后端中应用排名窗口功能 [英] Apply a ranking window function in dbplyr backend

查看:91
本文介绍了在dbplyr后端中应用排名窗口功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想无缝地识别新订单(获取)并在我的交易数据库表中退货。

I want to seamlessly identify new orders (acquisitions) and returns in my transactional database table.

这听起来像是窗口函数的完美工作;我想在 dbplyr 中执行此操作。

This sounds like the perfect job for a window function; I would like to perform this operation in dbplyr.

我当前的过程是:


  1. 创建一个查询对象,然后将其用于 dbGetQuery();该查询包含标准的 rank()窗口函数,通常在 postgresql

  2. 将此查询吸收到我的R环境中

  3. 然后使用 ifelse()函数将其插入 mutate()动词,我将第一个订单(也称为获取订单)标识为窗口功能标记为1的订单,否则将其标识为重复发生的订单。

  1. Create a query object I then use into dbGetQuery() ; this query contains a standard rank() window function as usually seen in postgresql
  2. Ingest this query into my R environment
  3. Then using an ifelse() function into the mutate() verb, I identify the first orders (aka acquisition orders) as the ones marked with 1 by the window function and "recurring" orders otherwise.

query <- 
"SELECT o.user_id,
o.id,
o.completed_at,
rank() over (partition by o.user_id order by o.completed_at asc) as order_number
FROM orders as o"

 df <- dbGetQuery(db, query) %>%
mutate(order_type = ifelse(order_number == '1','acquisition','repeat'))


我认为有一种方法可以使用 dbplyr 来压缩此过程,但是目前我还不清楚

I assume there is a way to squeeze this process using dbplyr but at the moment I don't know exactly how.

这是查询的输出:

    id    user_id completed_at        order_number
1   58051      68 2019-02-02 09:45:59            1
2   78173    7173 2019-03-28 08:30:16            1
3   79585    7173 2019-04-15 21:59:51            2
4  105261    7173 2019-07-15 13:51:44            3
5   57158    7181 2019-01-02 08:30:12            1
6   64316    7185 2019-02-24 14:54:26            1
7   77556    7185 2019-03-26 08:30:26            2
8   91287    7185 2019-04-25 08:30:25            3
9   55781    7191 2018-12-04 09:21:42            1
10  57039    7191 2019-01-01 08:30:11            2
11  55947    7204 2018-12-10 20:56:41            1
12 106126    7204 2019-06-28 15:10:27            2
13 112490    7204 2019-07-19 14:38:16            3
14 112514    7204 2019-07-19 16:24:09            4

您可以在此gdoc中找到测试数据-> 链接

You can find test data in this gdoc -> link.

推荐答案

dbplyr只能翻译的主要挑战之一将有限的R命令集转换为SQL命令。 dplyr之外的R函数不太可能有效地转换为SQL。

One of the key challenges to do this that dbplyr can only translate a limited set of R commands into SQL commands. R functions outside of dplyr are unlikely to translate to SQL effectively.

假设 db 是数据库连接,我会尝试执行以下操作:

Assuming db is you database connection, I would try something like the following:

# create a local pointer to the database table
orders <- tbl(db, from = "orders")

# define table using dplyr commands
df <- orders %>%
  group_by(user_id) %>%
  mutate(order_number = row_number(complete_at)) %>%
  select(user_id, id, completed_at) %>%
  mutate(order_type = ifelse(order_number == 1, 'acquisition', 'repeat'))

# check underlying sql and confirm it is correct
df %>% show_query()

# load data into local R from database
df <- collect(df)

根据您的应用,您可能更喜欢 dense_rank row_number 。我尚未测试dbplyr到sql的翻译。我只使用操作就可以在R&数据库环境我会写:

Depending on your application, you may prefer dense_rank to row_number. I have not tested the dbplyr translation of either to sql. Using only operations I know work in my R & database environment I would write:

orders <- tbl(db, from = "orders")

df_acquisition <- orders %>%
  group_by(user_id) %>%
  mutate(tmp = lead(complete_at, 1, order_by = "complete_at")) %>%
  # only the latest record will lack a tmp value
  filter(is.na(tmp)) %>%
  select(user_id, id, completed_at) %>%
  mutate(order_type = 'acquisition')

然后分别创建重复订单。

And then create the repeat orders separately.

这篇关于在dbplyr后端中应用排名窗口功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆