使用dplyr访问sql表/查询时计算行数 [英] Count number of rows when using dplyr to access sql table/query

查看:90
本文介绍了使用dplyr访问sql表/查询时计算行数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

计算使用dplyr访问sql表的行数的有效方法是什么。 MWE下面使用SQLite,但是我使用PostgreSQL并存在相同的问题。基本上dim()不是很一致。我使用了

What would be the efficient way to count the number of rows which using dplyr to access sql table. MWE is below using SQLite, but I use PostgreSQL and have the same issue. Basically dim() is not very consistent. I used

dim()

这适用于数据库中的架构(第一种情况),但是当我通过SQL查询针对同一架构创建tbl时(第二种情况),它不是非常一致。我的行数是几百万,或者即使只有一千行也能看到。我得到NA或??。是否缺少任何内容?

This works for a schema in the database (First case), but is not very consistent when I create a tbl from an SQL query for the same schema (Second case). My number of rows is in the millions or I see this even with a small 1000 of rows. I get NA or ??. Is there anything that is missing?

#MWE
test_db <- src_sqlite("test_db.sqlite3", create = T)
library(nycflights13)
flights_sqlite <- copy_to(test_db, flights, temporary = FALSE, indexes = list(
c("year", "month", "day"), "carrier", "tailnum"))

flights_postgres <- tbl(test_db, "flights")

第一种情况(直接模式中的表)

First case (table from direct schema)

flights_postgres

 > flights_postgres
 Source: postgres 9.3.5 []
 From: flights [336,776 x 16]

   year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight    origin dest air_time distance hour minute
  1  2013     1   1      517         2      830        11      UA  N14228   1545    EWR  IAH      227     1400    5     17
  2  2013     1   1      533         4      850        20      UA  N24211   1714    LGA  IAH      227     1416    5     33

#using dim()
> dim(flights_postgres)
[1] 336776     16

上面的方法可以得到行数。
第二种情况(来自SQL查询的表)

The above works and get the count of the number of rows. Second case (table from SQL query)

 ## use the flights schema above but can also be used to create other variables (like lag, lead)   in run time
 flight_postgres_2 <- tbl(test_db, sql("SELECT * FROM flights"))

  >flight_postgres_2
 Source: postgres 9.3.5 []
 From: <derived table> [?? x 16]

  year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight     origin dest air_time distance hour minute
   1  2013     1   1      517         2      830        11      UA  N14228   1545    EWR  IAH      227     1400    5     17
   2  2013     1   1      533         4      850        20      UA  N24211   1714    LGA  IAH      227     1416    5     33

> 
> dim(flight_postgres_2)
[1] NA 16

如您所见,它要么打印为? ?或不适用。所以不是很有帮助。

As you see it either prints as ?? or NA. So not very helpful.

我通过使用collect()解决了这个问题,然后使用as.data.frame()将输出转换为数据框以检查尺寸。但是鉴于给定的行数可能会花费更多的时间,这两种方法可能不是理想的解决方案。

I got around this by either using collect() or then convert the output to a dataframe using as.data.frame() to check the dimension. But these two methods may not be the ideal solution, given the time it may take for larger number of rows.

推荐答案

我认为答案就是@alistaire所建议的:在数据库中执行。

I think the answer is what @alistaire suggests: Do it in the database.

> flight_postgres_2 %>% summarize(n())
Source: sqlite 3.8.6 [test_db.sqlite3]
From: <derived table> [?? x 1]

      n()
    (int)
1  336776
..    ...

要求 dim 这样做会很麻烦(使用 dplyr对SQL的惰性评估,将数据保留在数据库中)并吃掉它(可以完全访问 R 中的数据)。

Asking dim to do this would be having your cake (lazy evaluation of SQL with dplyr, keeping data in the database) and eating it too (having full access to the data in R).

请注意,这是@alistaire的方法:

Note that this is doing @alistaire's approach underneath:

> flight_postgres_2 %>% summarize(n()) %>% explain()
<SQL>
SELECT "n()"
FROM (SELECT COUNT() AS "n()"
FROM (SELECT * FROM flights) AS "zzz11") AS "zzz13"


<PLAN>
  selectid order from                                                         detail
1        0     0    0 SCAN TABLE flights USING COVERING INDEX flights_year_month_day

这篇关于使用dplyr访问sql表/查询时计算行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆