提高 R 中字符串匹配的性能和速度 [英] Accelerate performance and speed of string match in R

查看:49
本文介绍了提高 R 中字符串匹配的性能和速度的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个性能问题需要帮助.请耐心等待我的解释:

I have a performance issue I need help with. Please bear with me for the explanation:

我有一个已知汽车 Vin# 和年份的数据库(为了方便,只显示了大约 5,000 行的前 4 行):

I have a database of known Car Vin# and years (only first 4 lines of ~5,000 shown for ease):

>vinDB
>ToyotaCarola 2008
 IJDINJNDJIJKNDJIMKDK0897
 NissanAltima 1998
 LJIODJJNJDJNJDNJNJDJ7765

我还有一个 .txt 文档,其中显示了一个唯一的 DMV ID、一个 vin 编号和一个参考编号(为方便起见,仅显示了 4 行约 5500 万行):

I also have a a .txt document that shows a unique DMV ID, a vin number, and a reference number in the following way (only 4 lines of ~55 million shown for ease):

>carFile
>#DMVcorrNumber33:1245638:563892:6378
 IJDINJNDJIJKNDJIMKDK0897
 +
 VIN#IDref6388546
 #DMVcorrNumber33:1245638:563892:6378
 LJIODJJNJDJNJDNJNJDJ7765
 +
 VIN#IDref2453663

我想要做的是扫描我的 'vinDB' 文件中的每第二行(VIN#)与我的 'carFile' 文件的每四行(从第二行开始)以进行完美匹配.如果匹配存在,我想输出汽车的名称,以及它在carFile"文件中出现的次数.

What I would like to do is scan every second line (the VIN#) from my 'vinDB' file against every fourth line (starting with line two) of my 'carFile' file for a perfect match. If the match exists, I would like to output the name of the car, and how many times it is seen in the 'carFile' file.

所以基本上,我需要这个:

So basically, I need this:

    Car          Year     NumTimesFound
ToyotaCarola     2008          238
NissanAltima     1998          1755

到目前为止,我有以下代码,它适用于截断的carFile"文件,但是当我尝试时,我的 R 程序崩溃了大约 5500 万行:

So far I have the following code, which works on a truncated 'carFile' file, but crashes my R program when I try it will all ~55 million lines:

VinCounter<-function(carFile, vinDB)

{
i=1   #index inner while loop
j=1   #index outer while loop
m=2   #index of vinDB, starts at '2' because first VIN# is on line 2
s=2   #index of carFile
count=0

while(j<=length(rownames(vinDB))/2)  # VIN# is on every 2nd line in vinDB file
{
  while(i<=length(rownames(carFile))/4)# VIN# is on every 4th line in carFile file
  {
    if(vinDB[m,1]==carFile[s,1])
      {
      count=count+1
      s=s+4
      }
    else
      {
      s=s+4
      }
    i=i+1
  }
 print(vinDB[m-1,1])
 print(count)
 count=0
 s=2
 i=1
 m=m+2
 j=j+1
 }  

}

所以,基本上,我想弄清楚如何:

So, basically, I would like to figure out how to:

1) 使上面的代码更快更高效.

1) Make the code above quicked and more efficient.

2) 如何将我的输出存储在 .txt 或 .csv 文件中(因为现在,它只在屏幕上显示输出).

2) How to have my output be stored in a .txt or .csv file (because right now, it just shows me the output on the screen).

谢谢!

推荐答案

您可以使用 data.table 相对轻松地做到这一点:

You can do this relatively easily with data.table:

vin.names <- vinDB[seq(1, nrow(vinDB), 2), ]
vin.vins <- vinDB[seq(2, nrow(vinDB), 2), ]
car.vins <- carFile[seq(2, nrow(carFile), 4), ]

library(data.table)
dt <- data.table(vin.names, vin.vins, key="vin.vins")
dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]
#         vin.names NumTimesFound
#  1:     Ford 2014            15
#  2: Chrysler 1998            10
#  3:       GM 1998             9
#  4:     Ford 1998            11
#  5:   Toyota 2000            12
# ---                            
# 75:   Toyota 2007             7
# 76: Chrysler 1995             4
# 77:   Toyota 2010             5
# 78:   Toyota 2008             1
# 79:       GM 1997             5    

要理解的主要事情是使用 J(car.vins) 我们正在创建一个单列 data.table 与要匹配的 vins (J 只是 data.table 的简写,只要您在 data.table 中使用它).通过在 dt 中使用 data.table,我们将 vins 列表加入到汽车列表中,因为我们键入了 dt 在上一步中通过vin.vins".最后一个参数告诉我们通过 vin.names 对连接的集合进行分组,中间的参数我们想知道每个组的实例数 .N (.N>.N 是一个特殊的 data.table 变量).

The main thing to understand is with J(car.vins) we are creating a one column data.table with the vins to match (J is just shorthand for data.table, so long as you use it within a data.table). By using that data.table inside dt, we are joining the list of vins to the list of cars because we keyed dt by "vin.vins" in the prior step. The last argument tells us to group the joined set by vin.names, and the middle argument that we want to know the number of instances .N for each group (.N is a special data.table variable).

此外,我制作了一些垃圾数据来运行它.以后请提供这样的数据.

Also, I made some junk data to run this on. In the future, please provide data like this.

set.seed(1)
makes <- c("Toyota", "Ford", "GM", "Chrysler")
years <- 1995:2014
cars <- paste(sample(makes, 500, rep=T), sample(years, 500, rep=T))
vins <- unlist(replicate(500, paste0(sample(LETTERS, 16), collapse="")))
vinDB <- data.frame(c(cars, vins)[order(rep(1:500, 2))])               
carFile <- 
  data.frame(
    c(rep("junk", 1000), sample(vins, 1000, rep=T), rep("junk", 2000))[order(rep(1:1000, 4))]
  )  

这篇关于提高 R 中字符串匹配的性能和速度的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆