创建连续的整数,然后使用dplyr为存储在sqlserver中的表创建索引 [英] create consecutive integer and then create index to a table stored in sqlserver using dplyr

查看:219
本文介绍了创建连续的整数,然后使用dplyr为存储在sqlserver中的表创建索引的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在对存储在sqlserver中的一些大型表进行一些数据处理,创建索引有时会减少某些R脚本运行所需的时间。我尝试使用 dplyr mutate 函数来创建一个新列( idx )连续编号,然后使用 idx 列作为索引。但mutate函数似乎不起作用并且经常给我这个错误:

I am doing some data processing of some large tables stored in sqlserver that creating an index sometimes reduce the time needed for some R script to run. I try to use the mutate function of dplyr to create a new column (idx) with consecutive number, then use that idxcolumn as index. But the mutate function seems not working and constantly give me this error:

> tbl(channel,'tbl_iris') %>% mutate(idx=1:n())
Error in from:to : NA/NaN argument
In addition: Warning message:
In 1:n() : NAs introduced by coercion\

现在我正在做一些似乎这样对我来说非常愚蠢,绕过上面的错误信息:

Right now I am doing something that seems quite stupid to me like this, to "bypass" the above error message:

iris <- tbl(channel,'tbl_iris') %>% 
  collect %>%
  mutate(idx=1:n())

try(db_drop_table(channel,'##iris'))
copy_to(channel,iris,'##iris',temporary=FALSE)
db_create_index(channel,'##iris',columns='idx')

有没有更好的方法呢?谢谢!

Is there any better way of doing this? Thanks!

我试过 mutate(idx = row_number()) 根据@Phil的建议,它无效并显示以下错误消息:

I tried mutate(idx = row_number()) as suggested by @Phil, it is not working and show the following error message:

> tbl(channel,'##iris') %>%
+   mutate(idx=row_number())
Error: <SQL> 'SELECT  TOP 10 "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species", row_number() OVER () AS "idx"
FROM "##iris"'
  nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]The function 'row_number' must have an OVER clause with ORDER BY. 
> tbl(channel,'##iris') %>%
+   arrange(Species) %>%
+   mutate(idx=row_number())
Error: <SQL> 'SELECT  TOP 10 "Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species", row_number() OVER (ORDER BY "Species") AS "idx"
FROM (SELECT *
FROM "##iris"
ORDER BY "Species") "kwtundzona"'
  nanodbc/nanodbc.cpp:1587: 42000: [Microsoft][ODBC SQL Server Driver][SQL Server]The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified. 



更新02



我试过的方式由@Moody_Mudskipper建议,似乎有效

Update 02

I tried the way suggested by @Moody_Mudskipper, and it seems working

> try(db_drop_table(channel,'##iris'))
[1] 0
> copy_to(channel,iris,'##iris',temporary=FALSE)
> tbl(channel,'##iris') %>% head(.,1)
# Source:   lazy query [?? x 5]
# Database: Microsoft SQL Server 11.00.6251[dbo@WCDCHCMS9999\CMSAH_DC7_999/data_xx_yyy]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
         <dbl>       <dbl>        <dbl>       <dbl> <chr>  
1         5.10        3.50         1.40       0.200 setosa 
> 
> DBI::dbSendQuery(channel,"ALTER TABLE ##iris ADD idx INT IDENTITY(1,1) NOT NULL")
<OdbcResult>
  SQL  ALTER TABLE ##iris ADD idx INT IDENTITY(1,1) NOT NULL
  ROWS Fetched: 0 [complete]
       Changed: 0
> db_create_index(channel,'##iris',columns='idx')
[1] 0
Warning message:
In new_result(connection@ptr, statement) : Cancelling previous query
> tbl(channel,'##iris') %>% head(.,5)
# Source:   lazy query [?? x 6]
# Database: Microsoft SQL Server 11.00.6251[dbo@WCDCHCMS9999\CMSAH_DC7_999/data_xx_yyy]
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species   idx
         <dbl>       <dbl>        <dbl>       <dbl> <chr>   <int>
1         5.10        3.50         1.40       0.200 setosa      1
2         4.90        3.00         1.40       0.200 setosa      2
3         4.70        3.20         1.30       0.200 setosa      3
4         4.60        3.10         1.50       0.200 setosa      4
5         5.00        3.60         1.40       0.200 setosa      5

我将尝试修改我的脚本,看看这是否会提供类似的性能提升与我以前更愚蠢的方法相比。

I will try to modify my script to see if this gives similar performance boost when compared to my previous more silly method.

除了下面显示的错误信息外,我希望事情按计划进行。

Other than the error message shown as below, I hope things are working as planned.

Warning message:
In new_result(connection@ptr, statement) : Cancelling previous query


推荐答案

据我所知,您无法使用 dbplyr 向服务器端的现有表添加列,但是对于这样的简单查询就像使用 DBI :: dbSendQuery 一样容易获得所需的效果。以下行将创建一个id列:

To my knowledge you can't add a column to an existing table on server side with dbplyr but for a simple query like this it is just as easy to use DBI::dbSendQuery for the desired effect. The following line will create an id column:

DBI::dbSendQuery(channel, "ALTER TABLE tbl_iris ADD ID INT IDENTITY(1,1) NOT NULL")

然后你可以使用创建索引dplyr :: db_create_index 或发送另一个查询:

Then you can create the index by using dplyr::db_create_index or send another query :

 DBI::dbSendQuery(channel, "CREATE INDEX id ON tbl_iris (id);")

这篇关于创建连续的整数,然后使用dplyr为存储在sqlserver中的表创建索引的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆