清理代码并防止null值崩溃导致read.csv.sql [英] Clean up code and keep null values from crashing read.csv.sql

查看:235
本文介绍了清理代码并防止null值崩溃导致read.csv.sql的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 read.csv.sql 有条件地读取数据(我的数据集非常大,所以这是我选择对其进行过滤并将其减少的解决方案读取数据之前的大小)。通过读取完整数据然后对其进行过滤,我遇到了内存问题,这就是为什么使用条件读取以便读取子集而不是完整数据集很重要的原因。

I am using read.csv.sql to conditionally read in data (my data set is extremely large so this was the solution I chose to filter it and reduce it in size prior to reading the data in). I was running into memory issues by reading in the full data and then filtering it so that is why it is important that I use the conditional read so that the subset is read in, versus the full data set.

这是一个小数据集,因此可以重现我的问题:

Here is a small data set so my problem can be reproduced:

write.csv(iris, "iris.csv", row.names = F)
library(sqldf)
csvFile <- "iris.csv"

我发现使用 read.csv.sql 必须使用的符号极其尴尬,以下是操作方法我正在阅读文件:

I am finding that the notation you have to use is extremely awkward using read.csv.sql the following is the how I am reading in the file:

# Step 1 (Assume these values are coming from UI)
spec <- 'setosa'
petwd <- 0.2

# Add quotes and make comma-separated:
spec <- toString(sprintf("'%s'", spec)) 
petwd <- toString(sprintf("'%s'", petwd)) 

# Step 2 - Conditionally read in the data, store in 'd'
d <- fn$read.csv.sql(csvFile, sql='select * from file where 
                                  "Species" in ($spec)'
                                  and "Petal.Width" in ($petwd)',
                     filter = list('gawk -f prog', prog = '{ gsub(/"/, ""); print }'))

我的 main 问题是(从UI)上面的值为null,则它将无法正确读入数据,因为这段代码都是硬编码的。

我想将其更改为:步骤1-检查哪些值为null且不过滤掉它们,然后使用 read.csv.sql 过滤对应列上的所有非空值。

My main problem is that if any of the values above (from UI) are null then it won't read in the data properly, because this chunk of code is all hard coded.
I would like to change this into: Step 1 - check which values are null and do not filter off of them, then filter using read.csv.sql for all non-null values on corresponding columns.

注意::我正在重用此类似的问题

更新

我想清除根据我的要求。这是我要尝试的操作:

UPDATE
I want to clear up what I am asking. This is what I am trying to do:

如果某个字段(例如 spec )通过 NA (表示用户未选择输入),然后我希望它进行过滤(默认为 spec == EVERY SPEC ):

If a field, say spec comes through as NA (meaning the user did not pick input) then I want it to filter as such (default to spec == EVERY SPEC):

# Step 2 - Conditionally read in the data, store in 'd'
d <- fn$read.csv.sql(csvFile, sql='select * from file where 
                                  "Petal.Width" in ($petwd)',
                     filter = list('gawk -f prog', prog = '{ gsub(/"/, ""); print }'))

由于如果您尝试过滤/读取匹配 spec == NA NA c>它将读取空数据集,因为我的数据中没有NA值,因此破坏了代码和程序。希望这可以清除更多内容。

Since spec is NA, if you try to filter/read in a file matching spec == NA it will read in an empty data set since there are no NA values in my data, hence breaking the code and program. Hope this clears it up more.

推荐答案

有几个问题:


  • 问题w的链接中提供的一些简化

  • spec 是标量,因此人们只能使用'$ spec'

  • petwd 是数字标量,SQL不需要数字引号,因此只需使用 $ petwd

  • 该问题指出您要处理空字段,但不是要处理的字段,因此我们使用csvfix将其映射为-1并去除引号。 (或者让他们输入并在R中执行操作。空数字将通过0来输入,而空字符字段将以零长度字符字段来通过。)

  • 您可以使用[... ]代替SQL中的 ...

  • some of the simplifications provided in the link in the question were not followed.
  • spec is a scalar so one can just use '$spec'
  • petwd is a numeric scalar and SQL does not require quotes around numbers so just use $petwd
  • the question states you want to handle empty fields but not how so we have used csvfix to map them to -1 and also strip off quotes. (Alternately let them enter and do it in R. Empty numerics will come through as 0 and empty character fields will come through as zero length character fields.)
  • you can use [...] in place of "..." in SQL

下面的代码在使用bash shell的Windows和Ubuntu Linux中都对我有用

The code below worked for me in both Windows and Ubuntu Linux with the bash shell.

library(sqldf)

spec <- 'setosa'
petwd <- 0.2

d <- fn$read.csv.sql(
  "iris.csv", 
  sql = "select * from file where [Species] = '$spec' and [Petal.Width] = $petwd", 
  verbose = TRUE, 
  filter = 'csvfix map -smq -fv "" -tv -1'
)



更新



关于问题末尾的更新明确指出,NA可能位于 spec 中,而不是位于正在读取的数据中,并且如果 spec 是NA,则包含 spec 的条件应被视为TRUE。在那种情况下,只需扩展SQL where 条件即可,如下处理。

Update

Regarding the update at the end of the question it was clarified that the NA could be in spec as opposed to being in the data being read in and that if spec is NA then the condition involving spec should be regarded as TRUE. In that case just expand the SQL where condition to handle that as follows.

spec <- NA
petwd <- 0.2

d <- fn$read.csv.sql(
  "iris.csv", 
  sql = "select * from file 
         where ('$spec' == 'NA' or [Species] = '$spec') and [Petal.Width] = $petwd", 
  verbose = TRUE, 
  filter = 'csvfix echo -smq'
)

以上将返回 Petal.Width 为0.2的所有行。

The above will return all rows for which Petal.Width is 0.2 .

这篇关于清理代码并防止null值崩溃导致read.csv.sql的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆