重写“模糊连接"从 R 到 SQL 的函数 [英] Re-Writing "Fuzzy Join" Functions from R to SQL

查看:14
本文介绍了重写“模糊连接"从 R 到 SQL 的函数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 R 编程语言中,我对执行模糊连接"感兴趣并通过 SQL 连接传递它:

In the R programming language, I am interested in performing a "fuzzy join" and passing this through a SQL Connection:

library(fuzzyjoin)
library(dplyr)

library(RODBC)
library(sqldf)

con = odbcConnect("some name", uid = "some id", pwd = "abc")

sample_query = sqlQuery( stringdist_inner_join(table_1, table_2, by = "id2", max_dist = 2) %>%
  filter(date_1 >= date_2, date_1 <= date_3) )

view(sample_query)

然而,我认为这是不可能的,因为我们用于模糊连接"的函数没有(stringdist_inner_join) 不受 Netezza 支持.

However, I do not think this is possible, because the function which us being used for the "fuzzy join" (stringdist_inner_join) is not supported by Netezza.

我试图找到这个模糊连接"的源代码;函数,并在这里找到它:https://rdrr.io/cran/fuzzyjoin/src/R/stringdist_join.R

I tried to find the source code for this "fuzzy join" function, and found it over here: https://rdrr.io/cran/fuzzyjoin/src/R/stringdist_join.R

我的问题:有谁知道是否可以(手动)转换这个模糊连接"?函数转换为 Netezza 可识别的 SQL 格式?有什么快速的方法可以重写这个函数 (stringdist_inner_join) 以便 Netezza 可以识别它?是否有任何预先存在的方法可以做到这一点?

My Question: Does anyone know if it is possible to (manually) convert this "fuzzy join" function into an SQL format that will be recognized by Netezza? Are there any quick ways to re-write this function (stringdist_inner_join) such that it can be recognized by Netezza? Are there any pre-existing ways to do this?

现在我只能执行sample_query";在本地 - 重写此函数(stringdist_inner_join)将让执行sample_query";快得多.

Right now I can only execute "sample_query" on locally - re-writing this function (stringdist_inner_join) would let perform the "sample_query" much faster.

有人知道这是否可能吗?

Does anyone know if this is possible?

注意:

我的数据如下所示:

table_1 = data.frame(id1 = c("123 A", "123BB", "12 5", "12--5"), id2 = c("11", "12", "14", "13"),
date_1 = c("2010-01-31","2010-01-31", "2015-01-31", "2018-01-31" ))

table_1$id1 = as.factor(table_1$id1)
table_1$id2 = as.factor(table_1$id2)
table_1$date_1 = as.factor(table_1$date_1)

table_2 = data.frame(id1 = c("0123", "1233", "125  .", "125_"), id2 = c("111", "112", "14", "113"),
date_2 = c("2009-01-31","2010-01-31", "2010-01-31", "2010-01-31" ),
date_3 = c("2011-01-31","2010-01-31", "2020-01-31", "2020-01-31" ))


table_2$id1 = as.factor(table_2$id1)
table_2$id2 = as.factor(table_2$id2)
table_2$date_2 = as.factor(table_2$date_2)
table_2$date_3 = as.factor(table_2$date_3)

推荐答案

根据您关于此问题的其他帖子,解决了如何构建 SQL 查询的问题的解决方案:

Based on your other post about this issue, a solution to the question of how to structure the SQL query was solved:

SAS:模糊连接

select a.*, b.* 
from table_a a
inner join table_b b
  on (a.date_1 between b.date_2 and b.date_3)
      and (le_dst(a.id1, b.id1) = 1 or a.id2 = b.id2)

为了让它在 R 脚本中运行,我建议使用 dbplyr 并使用 tbl 这样您就可以继续对其进行基本操作,就好像它是一个 data.frame 一样,dbplyr 会将其转换为 SQL(至少是基本命令),然后将所有内容组合到一个查询中并最终从中提取数据带有 collect() 函数的查询.

To get this to run in an R script, I would recommend using dbplyr and creating this using tbl so you can continue doing basic manipulation of it as if it were a data.frame and dbplyr will translate it into SQL (at least basic commands), then combine everything into a query and eventually pull the data from the query with the collect() function.

请注意,tbl 命令将开始构建 SQL 语句并获取列名,但在您输入 collect()<之前它不会运行它来提取数据/code> 此时,R 将查询发送到服务器,服务器将运行查询并发送数据.

Just a note, the tbl command will start building a SQL statement and get column names, but it won't run it to pull data until you enter collect() at which point, R will send the query to the server, the server will run the query and send the data.

请记住这一点,因为如果 dbplyr 无法将某些内容转换为 SQL,它会假定它是一个 SQL 命令并尝试发送它,因此您在尝试收集之前不会知道有错误.例如,stringr 包中的函数 str_dect 未在 dbplyr 中实现,因此 dbplyr 将将该命令发送到数据库,它会抛出错误,因为它不知道那是什么,但只有在运行 collect() 之后才会抛出错误.查看上面链接的 dbplyr 页面了解详情.

Just keep this in mind because if dbplyr can't translate something to SQL, it will assume it's a SQL command and try to send it, so you won't know there's an error until you try to collect. For example, a function from the stringr package, str_dectect, isn't implemented in dbplyr and so dbplyr would send that command to the database, which would throw an error because it doesn't know what that is, but only after running collect(). Check out the dbplyr page linked above for details.


library(dbplyr)

new_con<- dbConnect(
  odbc(),
  Driver= "ODBC Driver 17 for SQL Server (as an example)",
  Server = "Server name here",
  uid = "some_id",
  pwd = "abc"
)


sample_query<- dbplyr::tbl(
  new_con,
  dbplyr::sql(
  "select a.*, b.* 
    from table_a a
    inner join table_b b
      on (a.date_1 between b.date_2 and b.date_3)"

sample_data<-sample_query %>%
 filter(silly_example==TRUE) %>%
 collect()  

这篇关于重写“模糊连接"从 R 到 SQL 的函数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆