在R中使用"fastmatch"包 [英] Using 'fastmatch' package in R

查看:220
本文介绍了在R中使用"fastmatch"包的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我必须在大约10MM值的向量内找到1MM数值的索引.我找到了包fastmatch,但是当我使用函数fmatch()时,我只返回第一个匹配项的索引.

I have to find indices for 1MM numeric values within a vector of roughly 10MM values. I found the package fastmatch, but when I use the function fmatch(), I am only returning the index of the first match.

有人可以帮助我使用此功能查找所有值,而不仅仅是第一个?我意识到这是一个基本问题,但是在线文档很少,并且fmatch大大减少了计算时间.

Can someone help me use this function to find all values, not just the first? I realize this is a basic question but online documentation is pretty sparse and fmatch has cut down the computing time considerably.

非常感谢!

以下是一些示例数据-在本练习中,我们将此数据框称为A:

Here is some sample data - for the purposes of this exercise, let's call this data frame A:

              DateTime     Address       Type     ID
1  2014-03-04 20:21:03   982076970          1  2752394
2  2014-03-04 20:21:07 98174238211          1  2752394
3  2014-03-04 20:21:08 76126162197          1  2752394
4  2014-03-04 20:21:16  6718053253          1  2752394
5  2014-03-04 20:21:17 98210219176          1  2752510
6  2014-03-04 20:21:20  7622877100          1  2752510
7  2014-03-04 20:21:23  2425126157          1  2752510
8  2014-03-04 20:21:23  2425126157          1  2752510
9  2014-03-04 20:21:25   701838650          1  2752394
10 2014-03-04 20:21:27 98210219176          1  2752394

我想做的是找到每个Address的唯一Type值的数量.大约有几百万行数据具有大约1MM的唯一地址值...平均而言,每个地址在数据集中显示大约6次.而且,尽管上面列出的Type值都是1,但是它们可以取0:5的任何值.我还意识到Address值很长,这增加了匹配所需的时间.

What I wish to do is to find the number of unique Type values for each Address. There are several million rows of data with roughly 1MM unique Address values... on average, each Address appears about 6 times in the data set. And, though the Type values listed above are all 1, they can take any value from 0:5. I also realize the Address values are quite long, which adds to the time required for the matching.

我尝试了以下操作:

uvals <- unique(A$Address)
utypes <- matrix(0,length(uvals),2)
utypes[,1] <- uvals

for (i in 1:length(unique(Address))) {
    b <- which(uvals[i] %in% A$Address)
    c <- length(unique(A$Type[b]))
    utypes[i,2] <- c
}

但是,上面的代码不是很有效-如果我遍历1MM值,我估计这将需要10到15个小时.

However, the code above is not very efficient - if I am looping over 1MM values, I estimate this will take 10-15 hours.

我也在循环中尝试了此方法,但是速度并不快.

I have tried this, as well, within the loop... but it is not considerably faster.

b <- which(A$Address == uvals[i])  

我知道有一种更优雅/更快的方法,我对R来说还很陌生,希望能对您有所帮助.

I know there is a more elegant/faster way, I am fairly new to R and would appreciate any help.

推荐答案

可以使用data.table中的unique函数,然后进行聚合来完成此操作.我将使用@Chinmay生成的示例数据来说明这一点:

This can be done using unique function in data.table, followed by an aggregation. I'll illustrate it using more or less the sample data generated by @Chinmay:

set.seed(100L)
dat = data.frame(
         address = sample(1e6L, 1e7L, TRUE), 
           value = sample(1:5, 1e7L, TRUE, prob=c(0.5, 0.3, 0.1, 0.07, 0.03))
      )

data.table解决方案:

require(data.table) ## >= 1.9.2
dat.u = unique(setDT(dat), by=c("address", "value"))
ans   = dat.u[, .N, by=address]

说明:

  • setDT函数通过引用将data.frame转换为data.table (非常快).
  • 在data.table上运行的
  • unique函数调用unique.data.table方法,该方法 非常快 base:::unique相比.现在,对于每个address,我们只有唯一的type值.
  • 剩下要做的就是聚合 group-by address并获取每个组中存在的观察数.通过address.N分组的by=address部分是内置的data.table变量,提供该组的观察次数.
  • The setDT function converts a data.frame to data.table by reference (which is very fast).
  • unique function operated on a data.table evokes the unique.data.table method, which is incredibly fast compared to base:::unique. Now, we've only unique values of type for every address.
  • All that's left to do is to aggregate or group-by address and get the number of observations that are there in each group. The by=address part groups by address and .N is an in-built data.table variable that provides the number of observations for that group.

基准:

我将创建函数以生成数据作为data.tabledata.frame,以针对@beginneR提出的针对dplyr解决方案(a)的基准data.table答案,尽管我认为不需要arrange(.)在那里,因此将跳过该部分.

Benchmarks:

I'll create functions to generate data as data.table and data.frame to benchmark data.table answer againstdplyr solution (a) proposed by @beginneR, although I don't see the need for arrange(.) there and therefore will skip that part.

## function to create data
foo <- function(type = "df") {
    set.seed(100L)
    dat = data.frame(
             address = sample(1e6L, 1e7L, TRUE), 
               value = sample(1:5, 1e7L, TRUE, prob=c(0.5, 0.3, 0.1, 0.07, 0.03))
          )
    if (type == "dt") setDT(dat)
    dat
} 

## DT function
dt_sol <- function(x) {
    unique(x, by=c("address", "value"))[, .N, by=address]
}

## dplyr function
dplyr_sol <- function(x) {
    distinct(x) %>% group_by(address) %>% summarise(N = n_distinct(value))
}

此处报告的计时是每个功能上连续三个system.time(.)运行.

The timings reported here are three consecutive runs of system.time(.) on each function.

## benchmark timings in seconds
##        pkg   run-01   run-02   run-03                                 command
## data.table     2.4       2.3      2.4  system.time(ans1 <- dt_sol(foo("dt")))
##      dplyr    15.3      16.3     15.7   system.time(ans2 <- dplyr_sol(foo()))

由于某些原因,dplyr通过分组变量自动对结果进行排序.因此,为了比较结果,我还将在data.table:

For some reason, dplyr automatically orders the result by the grouping variable. So in order to compare the results, I'll also order them in the result from data.table:

system.time(setkey(ans1, address)) ## 0.102 seconds
identical(as.data.frame(ans1), as.data.frame(ans2)) ## TRUE

所以,data.table的速度要快6倍左右.

So, data.table is ~6x faster here.

请注意,data.table也支持bit64:::integer64-由于您提到地址值太长,因此也可以将它们存储为integer64.

Note that bit64:::integer64 is also supported in data.table - since you mention the address values are too long, you can also store them as integer64.

这篇关于在R中使用"fastmatch"包的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆