处理分页的SQL查询结果 [英] Handling paginated SQL query results

查看:240
本文介绍了处理分页的SQL查询结果的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在我的论文数据收集中,来源之一是外部管理的系统,该系统基于Web表单来提交 SQL查询.使用RRCurl,我实现了一个自动数据收集框架,在其中模拟了上述形式.当我限制结果数据集的大小时,一切工作都很好.但是,当我尝试超过100000条记录(在下面的代码中为RQ_SIZE)时,串联的我的代码-他们的系统"开始没有响应(挂起").

因此,我决定使用 SQL分页功能(LIMIT ... OFFSET ...)提交一系列请求,然后希望将分页结果合并为目标数据框架.但是,相应地更改代码后,我看到的输出只是一个分页进度字符(*),然后没有更多输出.如果您能帮助我确定意外行为的可能原因,我们将不胜感激.我无法提供可复制的示例,因为提取功能非常困难,更不用说数据了,但是我希望以下代码片段足以揭示问题(或者至少是解决问题的方向). /p>

# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
                DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
numRequests <- as.numeric(data) %/% RQ_SIZE + 1

# Now, we can request & retrieve data via SQL pagination
for (i in 1:numRequests) {

  # setup SQL pagination
  if (rq$where == '') rq$where <- '1=1'
  rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*(i-1))

  # Submit data request
  srdaRequestData(queryURL, rq$select, rq$from, rq$where,
                  DATA_SEP, ADD_SQL)
  assign(dataName, srdaGetData()) # retrieve result
  data <- get(dataName)

  # some code

  # add current data frame to the list
  dfList <- c(dfList, data)
  if (DEBUG) message("*", appendLF = FALSE)
}

# merge all the result pages' data frames
data <- do.call("rbind", dfList)

# save current data frame to RDS file
saveRDS(data, rdataFile)

解决方案

我正在回答我自己的问题,因为最后,我弄清楚了问题的真正来源是什么.我的调查显示,程序的意外等待状态是由于PostgreSQL被格式错误的SQL查询(其中包含多个LIMITOFFSET关键字)所迷惑.

原因非常简单:我在for循环的内部和外部都使用了rq$where,这使得paste()将先前迭代的WHERE子句与当前子句连接在一起.我已经通过处理WHERE子句的内容并将其保存在循环之前,然后安全地在循环的每次迭代中使用保存的值来固定代码,因为它变得独立于原始的WHERE子句的值了. /p>

此调查还帮助我修复了代码中的其他一些缺陷并进行了改进(例如使用子选择来正确处理SQL查询,该查询返回具有聚合函数的查询的记录数).故事的道德:您在软件开发中永远不能太谨慎. 非常感谢帮助这个问题的好人.

For my dissertation data collection, one of the sources is an externally-managed system, which is based on Web form for submitting SQL queries. Using R and RCurl, I have implemented an automated data collection framework, where I simulate the above-mentioned form. Everything worked well while I was limiting the size of the resulting dataset. But, when I tried to go over 100000 records (RQ_SIZE in the code below), the tandem "my code - their system" started being unresponsive ("hanging").

So, I have decided to use SQL pagination feature (LIMIT ... OFFSET ...) to submit a series of requests, hoping then to combine the paginated results into a target data frame. However, after changing my code accordingly, the output that I see is only one pagination progress character (*) and then no more output. I'd appreciate, if you could help me identify the probable cause of the unexpected behavior. I cannot provide reproducible example, as it's very difficult to extract the functionality, not to mention the data, but I hope that the following code snippet would be enough to reveal the issue (or, at least, a direction toward the problem).

# First, retrieve total number of rows for the request
srdaRequestData(queryURL, "COUNT(*)", rq$from, rq$where,
                DATA_SEP, ADD_SQL)
assign(dataName, srdaGetData()) # retrieve result
data <- get(dataName)
numRequests <- as.numeric(data) %/% RQ_SIZE + 1

# Now, we can request & retrieve data via SQL pagination
for (i in 1:numRequests) {

  # setup SQL pagination
  if (rq$where == '') rq$where <- '1=1'
  rq$where <- paste(rq$where, 'LIMIT', RQ_SIZE, 'OFFSET', RQ_SIZE*(i-1))

  # Submit data request
  srdaRequestData(queryURL, rq$select, rq$from, rq$where,
                  DATA_SEP, ADD_SQL)
  assign(dataName, srdaGetData()) # retrieve result
  data <- get(dataName)

  # some code

  # add current data frame to the list
  dfList <- c(dfList, data)
  if (DEBUG) message("*", appendLF = FALSE)
}

# merge all the result pages' data frames
data <- do.call("rbind", dfList)

# save current data frame to RDS file
saveRDS(data, rdataFile)

解决方案

I'm answering my own question, as, finally, I have figured out what has been the real source of the problem. My investigation revealed that the unexpected waiting state of the program was due to PostgreSQL becoming confused by malformed SQL queries, which contained multiple LIMIT and OFFSET keywords.

The reason of that is pretty simple: I used rq$where both outside and inside the for loop, which made paste() concatenate previous iteration's WHERE clause with the current one. I have fixed the code by processing contents of the WHERE clause and saving it before the loop and then using the saved value in each iteration of the loop safely, as it became independent from the value of the original WHERE clause.

This investigation also helped me to fix some other deficiencies in my code and make improvements (such as using sub-selects to properly handle SQL queries returning number of records for queries with aggregate functions). The moral of the story: you can never be too careful in software development. Big thank you to those nice people who helped with this question.

这篇关于处理分页的SQL查询结果的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆