如何在dplyr中制作left_join(...,copy = TRUE)而不授予INSERT特权 [英] How to make a left_join(..., copy=TRUE) in dplyr without granting INSERT privileges

查看:405
本文介绍了如何在dplyr中制作left_join(...,copy = TRUE)而不授予INSERT特权的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们需要将数据库表与excel文件的内容连接起来.使用dplyr left_join很简单,但是由于连接的数据不是来自同一源,因此需要在联接中设置copy=TRUE.反过来,这意味着该代码仅在数据库用户具有INSERT特权时才运行,以便left_join可以在/tmp文件夹中创建临时表.

We need to join a database table with the content of an excel file. This is straightforward using a dplyr left_join, but requires that copy=TRUE is set in the join because the data do not come from the same source. This in turn means that the code will only run when the database user has INSERT privileges so that left_join can create temporary tables in the /tmp folder.

是否可以在不授予INSERT特权的情况下执行此copy=TRUE left_join?访问数据进行分析的数据库用户实际上应该只是读者.

Is there any way to do this copy=TRUE left_join without granting INSERT privileges? The database user accessing the data for analysis really should be a reader only.

如果未授予插入特权,则错误如下:

If insert privileges are not granted an error like:

Error in .local(conn, statement, ...) : 
  could not run statement: INSERT command denied to user 
  'reader'@'192.168.135.1' for table 'utiexriryc'

将出现(每个连接具有不同的随机表名称).

will appear (with a different random table name on each connect).

一个简单的可复制示例很困难,因为它需要数据库连接(在这种情况下为mysql wti src_mysql())和excel文件(在这种情况下为readxl读取).联接看起来像:

A short reproducible example is difficult as it requires a database connection (in this case to mysql wtih src_mysql()) and an excel file (in this case read with readxl). The join looks like:

df.biozones <- db.sites %>% 
  left_join(ef.join_site_ids, by=c("site_id"="id"), copy=TRUE) %>% 
  collect()

其中db.sites是mysql数据库表,而ef.join_site_ids是用readxl读取的excel电子表格中的data.frame.

Where db.sites is a mysql database table and ef.join_site_ids an data.frame from an excel spreadsheet read with readxl.

是否可以在dplyr中执行上述操作,并避免为读取数据的用户授予INSERT特权?

Is there any way to do the above in dplyr and avoid granting INSERT privileges for the user reading the data?

推荐答案

尽管没有办法直接完成我所问的问题,就像@hadley所说的那样,但是有另一种方法来解决问题.我花了一些时间来理解的是dplyr将每个连接都视为差异源(或简称为src). src始终存在,它是内存中对象的集合.

Although there is no way to do what I asked directly, as @hadley said, there is a different way to look at the problem. What took me some time to understand is that dplyr treats each connection as a difference source (or src for short). There is on src that is always there which is the collection of objects in memory.

所以在我的问题中,我实际上是在处理三个来源:

So in my question I was actually dealing with three sources:

  1. 内存中的对象
  2. 数据库_1
  3. 数据库_2

这三个中的任何一个之间的任何联接都将需要dplyr的权限来创建临时表并将其插入其中.这三个部分中的任何一个都不会成为问题.因此,当您无法获得数据库的其他权限时,有(至少)三种可能的方法来解决此问题.

Any join between either of these three will require the permissions for dplyr to make temporary tables and insert into them. Any join within any of these three will not be an issue. So there are (as least) three possible ways to deal with this when you cannot get additional permissions to the databases.

1.将所有内容加载到内存中

如果您将两个数据库中所需的所有对象都存储到内存中,则可以将它们带到共享的src中,并且可以对它们执行任何连接. Dplyr提供了collect()函数来实现:

If you source all the objects you need from both databases into memory, then you will bring them into a shared src and will be able to do any joins on them. Dplyr offers the collect() function to do so:

db_table_of_interest <- tbl(Database_1, "table_of_interest")
df_table_of_interest <- collect(db_table_of_interest).

实际上,您会将表转换为本地数据框.您将需要对所有感兴趣的数据库中的所有感兴趣的表执行此操作.只要您有足够的RAM来容纳正在读取的所有数据,它就可以正常工作.

You will in fact turn the table into a local dataframe. You will need to do this for all tables of interest, in all databases of interest. It will work fine as long as you have enough RAM to harbour all the data you are reading.

2.在数据库之间复制标识符

如果(在我们的情况下)如果您需要在两个具有相同ID的相同站点的数据库之间加入记录标识符(例如站点或样本ID),那么最简单的方法是将两个表中的ID都加倍数据库.因此,与其使用外部工作表在其中列出每个数据库中每个站点的所有ID并设置匹配项,不如在每个数据库的site_table中添加一列,并引用另一列.

If (as was our case) you need to join the record identifiers (like site or sample ID's) between two databasas that have the same sites under different ID's, then the easiest way is to double the ID's in a table in both database. So instead of using an external sheet where you list all the ID's of each site in each database, and set up the matches, add one column in the site_table in each database with a reference to the other.

只要您不需要使用实际数据设置联接表,就可以了.例如,如果您需要读取按物种分类的站点和按环境分类的表,其中物种和环境数据存储在单独的数据库中,并且每个站点具有不同的ID.

As long as you do not need to set up joined tables with the actual data, then you will be fine. For example if you need to read in a sites-by-species and a sites-by-environment table where the species and environment data are stored in a separate database, and have different ID's for each site.

如果确实需要连接表,那么至少可以在执行collect()并读取内存中的所有内容之前在两个数据库中进行预选择.当有大量数据要读取时,这将节省您的内存需求.

If you do need to join tables, then at least you will be able to pre-select in both databases before doing a collect() and reading everything in memory. That will save you memory requirements when there is a lot of data to read.

3.将所有内容复制到一个分析数据库中

如果由于某种原因无法再提供更多的内存,那么从理论上讲,您可以先将所有数据复制到单个数据库中.我无法想象这在任何情况下都是可行的选择,除了只有几个感兴趣的表之外,这样就可以将它们合并到一个新的联合数据库中.也许在数据集非常大且不会更改的情况下.

If for some reason more memory is not possible, then in theory you could copy all the data into a single database first. I cannot imagine that this is a viable option in any situation other than that there are only a few tables of interest so that it would be feasible to merge them into a new joint database. Perhaps in cases where the data set is very large and does not change.

如果此列表中没有我提供的选项,我很高兴知道您可能还会找到其他解决方案.

If there is an option I have missed in this list, I would be happy to know what other solutions you may have found.

这篇关于如何在dplyr中制作left_join(...,copy = TRUE)而不授予INSERT特权的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆