使用dplyr访问sql表/查询时计算行数 [英] Count number of rows when using dplyr to access sql table/query
问题描述
计算使用dplyr访问sql表的行数的有效方法是什么。 MWE下面使用SQLite,但是我使用PostgreSQL并存在相同的问题。基本上dim()不是很一致。我使用了
What would be the efficient way to count the number of rows which using dplyr to access sql table. MWE is below using SQLite, but I use PostgreSQL and have the same issue. Basically dim() is not very consistent. I used
dim()
这适用于数据库中的架构(第一种情况),但是当我通过SQL查询针对同一架构创建tbl时(第二种情况),它不是非常一致。我的行数是几百万,或者即使只有一千行也能看到。我得到NA或??。是否缺少任何内容?
This works for a schema in the database (First case), but is not very consistent when I create a tbl from an SQL query for the same schema (Second case). My number of rows is in the millions or I see this even with a small 1000 of rows. I get NA or ??. Is there anything that is missing?
#MWE
test_db <- src_sqlite("test_db.sqlite3", create = T)
library(nycflights13)
flights_sqlite <- copy_to(test_db, flights, temporary = FALSE, indexes = list(
c("year", "month", "day"), "carrier", "tailnum"))
flights_postgres <- tbl(test_db, "flights")
第一种情况(直接模式中的表)
First case (table from direct schema)
flights_postgres
> flights_postgres
Source: postgres 9.3.5 []
From: flights [336,776 x 16]
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
#using dim()
> dim(flights_postgres)
[1] 336776 16
上面的方法可以得到行数。
第二种情况(来自SQL查询的表)
The above works and get the count of the number of rows. Second case (table from SQL query)
## use the flights schema above but can also be used to create other variables (like lag, lead) in run time
flight_postgres_2 <- tbl(test_db, sql("SELECT * FROM flights"))
>flight_postgres_2
Source: postgres 9.3.5 []
From: <derived table> [?? x 16]
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
1 2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2 2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
>
> dim(flight_postgres_2)
[1] NA 16
如您所见,它要么打印为? ?或不适用。所以不是很有帮助。
As you see it either prints as ?? or NA. So not very helpful.
我通过使用collect()解决了这个问题,然后使用as.data.frame()将输出转换为数据框以检查尺寸。但是鉴于给定的行数可能会花费更多的时间,这两种方法可能不是理想的解决方案。
I got around this by either using collect() or then convert the output to a dataframe using as.data.frame() to check the dimension. But these two methods may not be the ideal solution, given the time it may take for larger number of rows.
推荐答案
我认为答案就是@alistaire所建议的:在数据库中执行。
I think the answer is what @alistaire suggests: Do it in the database.
> flight_postgres_2 %>% summarize(n())
Source: sqlite 3.8.6 [test_db.sqlite3]
From: <derived table> [?? x 1]
n()
(int)
1 336776
.. ...
要求 dim
这样做会很麻烦(使用 dplyr对SQL的惰性评估
,将数据保留在数据库中)并吃掉它(可以完全访问 R
中的数据)。
Asking dim
to do this would be having your cake (lazy evaluation of SQL with dplyr
, keeping data in the database) and eating it too (having full access to the data in R
).
请注意,这是@alistaire的方法:
Note that this is doing @alistaire's approach underneath:
> flight_postgres_2 %>% summarize(n()) %>% explain()
<SQL>
SELECT "n()"
FROM (SELECT COUNT() AS "n()"
FROM (SELECT * FROM flights) AS "zzz11") AS "zzz13"
<PLAN>
selectid order from detail
1 0 0 0 SCAN TABLE flights USING COVERING INDEX flights_year_month_day
这篇关于使用dplyr访问sql表/查询时计算行数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!