在R中使用"fastmatch"包 [英] Using 'fastmatch' package in R
问题描述
我必须在大约10MM值的向量内找到1MM数值的索引.我找到了包fastmatch
,但是当我使用函数fmatch()
时,我只返回第一个匹配项的索引.
I have to find indices for 1MM numeric values within a vector of roughly 10MM values. I found the package fastmatch
, but when I use the function fmatch()
, I am only returning the index of the first match.
有人可以帮助我使用此功能查找所有值,而不仅仅是第一个?我意识到这是一个基本问题,但是在线文档很少,并且fmatch
大大减少了计算时间.
Can someone help me use this function to find all values, not just the first? I realize this is a basic question but online documentation is pretty sparse and fmatch
has cut down the computing time considerably.
非常感谢!
以下是一些示例数据-在本练习中,我们将此数据框称为A:
Here is some sample data - for the purposes of this exercise, let's call this data frame A:
DateTime Address Type ID
1 2014-03-04 20:21:03 982076970 1 2752394
2 2014-03-04 20:21:07 98174238211 1 2752394
3 2014-03-04 20:21:08 76126162197 1 2752394
4 2014-03-04 20:21:16 6718053253 1 2752394
5 2014-03-04 20:21:17 98210219176 1 2752510
6 2014-03-04 20:21:20 7622877100 1 2752510
7 2014-03-04 20:21:23 2425126157 1 2752510
8 2014-03-04 20:21:23 2425126157 1 2752510
9 2014-03-04 20:21:25 701838650 1 2752394
10 2014-03-04 20:21:27 98210219176 1 2752394
我想做的是找到每个Address
的唯一Type
值的数量.大约有几百万行数据具有大约1MM的唯一地址值...平均而言,每个地址在数据集中显示大约6次.而且,尽管上面列出的Type
值都是1,但是它们可以取0:5的任何值.我还意识到Address
值很长,这增加了匹配所需的时间.
What I wish to do is to find the number of unique Type
values for each Address
. There are several million rows of data with roughly 1MM unique Address values... on average, each Address appears about 6 times in the data set. And, though the Type
values listed above are all 1, they can take any value from 0:5. I also realize the Address
values are quite long, which adds to the time required for the matching.
我尝试了以下操作:
uvals <- unique(A$Address)
utypes <- matrix(0,length(uvals),2)
utypes[,1] <- uvals
for (i in 1:length(unique(Address))) {
b <- which(uvals[i] %in% A$Address)
c <- length(unique(A$Type[b]))
utypes[i,2] <- c
}
但是,上面的代码不是很有效-如果我遍历1MM值,我估计这将需要10到15个小时.
However, the code above is not very efficient - if I am looping over 1MM values, I estimate this will take 10-15 hours.
我也在循环中尝试了此方法,但是速度并不快.
I have tried this, as well, within the loop... but it is not considerably faster.
b <- which(A$Address == uvals[i])
我知道有一种更优雅/更快的方法,我对R来说还很陌生,希望能对您有所帮助.
I know there is a more elegant/faster way, I am fairly new to R and would appreciate any help.
推荐答案
可以使用data.table
中的unique
函数,然后进行聚合来完成此操作.我将使用@Chinmay生成的示例数据来说明这一点:
This can be done using unique
function in data.table
, followed by an aggregation. I'll illustrate it using more or less the sample data generated by @Chinmay:
set.seed(100L)
dat = data.frame(
address = sample(1e6L, 1e7L, TRUE),
value = sample(1:5, 1e7L, TRUE, prob=c(0.5, 0.3, 0.1, 0.07, 0.03))
)
data.table解决方案:
require(data.table) ## >= 1.9.2
dat.u = unique(setDT(dat), by=c("address", "value"))
ans = dat.u[, .N, by=address]
说明:
- 在data.table上运行的
setDT
函数通过引用将data.frame
转换为data.table
(非常快).unique
函数调用unique.data.table
方法,该方法 非常快 与base:::unique
相比.现在,对于每个address
,我们只有唯一的type
值.- 剩下要做的就是聚合或 group-by
address
并获取每个组中存在的观察数.通过address
和.N
分组的by=address
部分是内置的data.table
变量,提供该组的观察次数.
- The
setDT
function converts adata.frame
todata.table
by reference (which is very fast).unique
function operated on a data.table evokes theunique.data.table
method, which is incredibly fast compared tobase:::unique
. Now, we've only unique values oftype
for everyaddress
.- All that's left to do is to aggregate or group-by
address
and get the number of observations that are there in each group. Theby=address
part groups byaddress
and.N
is an in-builtdata.table
variable that provides the number of observations for that group.
基准:
我将创建函数以生成数据作为data.table
和data.frame
,以针对@beginneR提出的针对dplyr
解决方案(a)的基准data.table
答案,尽管我认为不需要arrange(.)
在那里,因此将跳过该部分.
Benchmarks:
I'll create functions to generate data as data.table
and data.frame
to benchmark data.table
answer againstdplyr
solution (a) proposed by @beginneR, although I don't see the need for arrange(.)
there and therefore will skip that part.
## function to create data
foo <- function(type = "df") {
set.seed(100L)
dat = data.frame(
address = sample(1e6L, 1e7L, TRUE),
value = sample(1:5, 1e7L, TRUE, prob=c(0.5, 0.3, 0.1, 0.07, 0.03))
)
if (type == "dt") setDT(dat)
dat
}
## DT function
dt_sol <- function(x) {
unique(x, by=c("address", "value"))[, .N, by=address]
}
## dplyr function
dplyr_sol <- function(x) {
distinct(x) %>% group_by(address) %>% summarise(N = n_distinct(value))
}
此处报告的计时是每个功能上连续三个system.time(.)
运行.
The timings reported here are three consecutive runs of system.time(.)
on each function.
## benchmark timings in seconds
## pkg run-01 run-02 run-03 command
## data.table 2.4 2.3 2.4 system.time(ans1 <- dt_sol(foo("dt")))
## dplyr 15.3 16.3 15.7 system.time(ans2 <- dplyr_sol(foo()))
由于某些原因,dplyr
通过分组变量自动对结果进行排序.因此,为了比较结果,我还将在data.table
:
For some reason, dplyr
automatically orders the result by the grouping variable. So in order to compare the results, I'll also order them in the result from data.table
:
system.time(setkey(ans1, address)) ## 0.102 seconds
identical(as.data.frame(ans1), as.data.frame(ans2)) ## TRUE
所以,data.table
的速度要快6倍左右.
So, data.table
is ~6x faster here.
请注意,data.table
也支持bit64:::integer64
-由于您提到地址值太长,因此也可以将它们存储为integer64
.
Note that bit64:::integer64
is also supported in data.table
- since you mention the address values are too long, you can also store them as integer64
.
这篇关于在R中使用"fastmatch"包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!