dplyr包:如何使用'%xyz%'SQL语法查询大数据帧? [英] dplyr package: How can I query large data frame using like '%xyz%' SQL syntax?
问题描述
dplyr是唯一可以处理我的843k数据帧的软件包,并以快速的方式进行查询。
我可以使用一些数学和相等的标准过滤,但是我需要实现一个概念的搜索。
我需要像这样sqldf query
library(sqldf)
head(iris)
sqldf(select * from iris where lower(Species)像'%nica%')
在dplyr帮助我无法找到我能做什么它。如下所示:
过滤器(虹膜,物种像'%something%')
开始和结束%非常重要。另外,请注意,数据帧有800 + k行,所以传统的R函数可能运行缓慢。它必须是一个基于dplyr的解决方案。
这个 -
数据(虹膜)
过滤器(iris,grepl(nica,Species))
编辑:另一个选项 - 中的
%<%c $ c> data.table()
library(dplyr)
data(iris)
##
Iris< - iris [
rep(seq_len(nrow(iris)),each = 5000),
]
dim(Iris)
[1] 750000 5
##
库(微基准)
库(data.table)
##
Dt < - data.table(Iris)
setkeyv(Dt,cols =Species)
##
foo< - function(){
subI< - filter(Iris,grepl(nica )
}
##
foo2< - function(){
subI < - Dt [物种%like%nica]
}
##
foo3< - function(){
subI< - filter(Iris,Species%like%nica)
}
Res< - microbenchmark (
foo(),foo2(),foo3(),
times = 100L)
##
> Res
单位:毫秒
expr最小lq中位数uq max neval
foo()114.31080 122.12303 131.15523 136.33254 214.0405 100
foo2()23.00508 30.33685 39.77843 41.49121 129.9125 100
foo3 ()18.84933 22.47958 29.39228 35.96649 114.4389 100
dplyr is the only package that can handle my 843k data.frame and query it in a fast way. I can filter fine using some math and equal criteria, however I need to implement a search for a concept.
I need something like this sqldf query
library(sqldf)
head(iris)
sqldf("select * from iris where lower(Species) like '%nica%'")
In dplyr help I was not able to find how I could do it. something like:
filter(iris,Species like '%something%')
The starting and ending % is very important. Also, note that the data frame has 800+k rows so traditional R functions may run slow. It has to bee a dplyr based solution.
What about this -
library(dplyr)
data(iris)
filter(iris, grepl("nica",Species))
EDIT: Another option - the function %like%
in data.table()
library(dplyr)
data(iris)
##
Iris <- iris[
rep(seq_len(nrow(iris)),each=5000),
]
dim(Iris)
[1] 750000 5
##
library(microbenchmark)
library(data.table)
##
Dt <- data.table(Iris)
setkeyv(Dt,cols="Species")
##
foo <- function(){
subI <- filter(Iris, grepl("nica",Species))
}
##
foo2 <- function(){
subI <- Dt[Species %like% "nica"]
}
##
foo3 <- function(){
subI <- filter(Iris, Species %like% "nica")
}
Res <- microbenchmark(
foo(),foo2(),foo3(),
times=100L)
##
> Res
Unit: milliseconds
expr min lq median uq max neval
foo() 114.31080 122.12303 131.15523 136.33254 214.0405 100
foo2() 23.00508 30.33685 39.77843 41.49121 129.9125 100
foo3() 18.84933 22.47958 29.39228 35.96649 114.4389 100
这篇关于dplyr包:如何使用'%xyz%'SQL语法查询大数据帧?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!