基于多列中的直接和间接相似性对变量进行分组的快速方法 [英] Fast way to group variables based on direct and indirect similarities in multiple columns

查看:20
本文介绍了基于多列中的直接和间接相似性对变量进行分组的快速方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相对较大的数据集(1,750,000 行,5 列),其中包含具有唯一 ID 值的记录(第一列),由四个标准(其他 4 列)描述.一个小例子是:

I have a relatively large data set (1,750,000 lines, 5 columns) which contains records with unique ID values (first column), described by four criteria (4 other columns). A small example would be:

# example
library(data.table)
dt <- data.table(id=c("a1","b3","c7","d5","e3","f4","g2","h1","i9","j6"), 
                 s1=c("a","b","c","l","l","v","v","v",NA,NA), 
                 s2=c("d","d","e","k","k","o","o","o",NA,NA),
                 s3=c("f","g","f","n","n","s","r","u","w","z"),
                 s4=c("h","i","j","m","m","t","t","t",NA,NA))

看起来像这样:

   id   s1   s2 s3   s4
 1: a1    a    d  f    h
 2: b3    b    d  g    i
 3: c7    c    e  f    j
 4: d5    l    k  n    m
 5: e3    l    k  n    m
 6: f4    v    o  s    t
 7: g2    v    o  r    t
 8: h1    v    o  u    t
 9: i9 <NA> <NA>  w <NA>
10: j6 <NA> <NA>  z <NA>

我的最终目标是在 any 描述列(不考虑 NA)上找到所有具有相同字符的记录,并将它们分组到一个新的 ID 下,以便我可以轻松识别重复记录.这些 ID 是通过连接每一行的 ID 构成的.

My ultimate goal is to find all records with the same character on any description columns (disregarding NAs), and group them under a new ID, so that I can easily identify duplicated records. These IDs are constructed by concatenating the IDs of each row.

事情变得更加混乱,因为我可以直接间接地找到具有重复描述的那些记录.所以,我目前分两步做这个操作.

Things got messier because I can find those records with duplicated descriptions directly and indirectly. Therefore, I am currently doing this operation in two steps.

# grouping ids with duplicated info in any of the columns
#sorry, I could not find search for duplicates using multiple columns simultaneously...
dt[!is.na(dt$s1),ids1:= paste(id,collapse="|"), by = list(s1)]
dt[!is.na(dt$s1),ids2:= paste(id,collapse="|"), by = list(s2)]
dt[!is.na(dt$s1),ids3:= paste(id,collapse="|"), by = list(s3)]
dt[!is.na(dt$s1),ids4:= paste(id,collapse="|"), by = list(s4)]

# getting a unique duplicated ID for each row
dt$new.id <- apply(dt[,.(ids1,ids2,ids3,ids4)], 1, paste, collapse="|")
dt$new.id <- apply(dt[,"new.id",drop=FALSE], 1, function(x) paste(unique(strsplit(x,"\|")[[1]]),collapse="|"))

此操作产生以下结果,唯一的重复 ID 定义为new.id":

This operation results in the following, with the unique duplicated ID define as "new.id":

   id   s1   s2 s3   s4     ids1     ids2  ids3     ids4   new.id
 1: a1    a    d  f    h       a1    a1|b3 a1|c7       a1 a1|b3|c7
 2: b3    b    d  g    i       b3    a1|b3    b3       b3    b3|a1
 3: c7    c    e  f    j       c7       c7 a1|c7       c7    c7|a1
 4: d5    l    k  n    m    d5|e3    d5|e3 d5|e3    d5|e3    d5|e3
 5: e3    l    k  n    m    d5|e3    d5|e3 d5|e3    d5|e3    d5|e3
 6: f4    v    o  s    t f4|g2|h1 f4|g2|h1    f4 f4|g2|h1 f4|g2|h1
 7: g2    v    o  r    t f4|g2|h1 f4|g2|h1    g2 f4|g2|h1 f4|g2|h1
 8: h1    v    o  u    t f4|g2|h1 f4|g2|h1    h1 f4|g2|h1 f4|g2|h1
 9: i9 <NA> <NA>  w <NA>     <NA>     <NA>  <NA>     <NA>       NA
10: j6 <NA> <NA>  z <NA>     <NA>     <NA>  <NA>     <NA>       NA

请注意,记录b3"和c7"是通过a1"间接复制的(所有其他示例都是直接复制,应该保持不变).这就是我们需要下一步的原因.

Note that records "b3" and "c7" are duplicated indirectly through "a1" (all other examples are direct duplicates that should remain the same). That is why we need the next step.

#filtering the relevant columns for the indirect search
dt = dt[,.(id,new.id)]

#creating the patterns to be used by grepl() for the look-up for each row
dt[,patt:= .(paste(paste("^",id,"\||",sep=""),paste("\|",id,"\||",sep=""),paste("\|",id,"$",sep=""),collapse = "" ,sep="")), by = list(id)]

#Transforming the ID vector into factor and setting it as a 'key' to the data.table (speed up the processing)
dt$new.id = as.factor(dt$new.id)
setkeyv(dt, c("new.id"))

#Performing the loop using sapply
library(stringr)
for(i in 1:nrow(dt)) {
  pat = dt$patt[i] # retrieving the research pattern
  tmp = dt[new.id %like% pat] # searching the pattern using grepl()
  if(dim(tmp)[1]>1) {
    x = which.max(str_count(tmp$new.id, "\|"))
    dt$new.id[i] = as.character(tmp$new.id[x])
  }
}

#filtering the final columns 
dt = dt[,.(id,new.id)]

决赛桌长这样:

   id   new.id
 1: a1 a1|b3|c7
 2: b3 a1|b3|c7
 3: c7 a1|b3|c7
 4: d5    d5|e3
 5: e3    d5|e3
 6: f4 f4|g2|h1
 7: g2 f4|g2|h1
 8: h1 f4|g2|h1
 9: i9       NA
10: j6       NA

请注意,现在前三个记录(a1"、b3"、c7")被分组在一个更广泛的重复 ID 下,其中包含直接记录和间接记录.

Note that now the first three records ("a1","b3","c7") are grouped under a broader duplicated ID, which contains both direct and indirect records.

一切正常,但我的代码非常慢.运行一半的数据集(~800,0000)花了整整 2 天的时间.我可以将循环并行化到不同的内核中,但这仍然需要几个小时.而且我几乎可以肯定我可以以更好的方式使用 data.table 功能,也许在循环中使用set".我今天花了几个小时尝试使用 data.table 实现相同的代码,但我对它的语法不熟悉,我在这里真的很难.关于如何优化此代码的任何建议?

Everything is working out fine, but my code is horrendously slow. It took 2 entire days to run half of the data set (~800,0000). I could parallelize the loop into different cores, but it would still take hours. And I am almost sure that I could use data.table functionalities in a better way, maybe using using 'set' inside the loop. I spent hours today trying to implement the same codes using data.table, but I am new to its syntax and I am really having a hard time here. Any suggestions on how I could optimize this code?

注意:代码中最慢的部分是循环,而循环内部效率最低的步骤是 data.table 中模式的 grepl().似乎为 data.table 设置一个键"可以加快这个过程,但在我的例子中,我没有改变执行 grepl() 的时间.

Note: The slowest part of the code is the loop and inside the loop the most inefficient step is the grepl() of the patterns inside the data.table. It seems that setting a 'key' to the data.table can speed up the process, but I did not changed the time it took to do the grepl() in my case.

推荐答案

您可以将此视为网络问题.这里我使用 igraph 包中的函数.基本步骤:

You may approach this as a network problem. Here I use functions from the igraph package. The basic steps:

  1. 融化将数据转长格式.

使用 graph_from_data_frame 创建图表,其中 'id' 和 'value' 列被视为边列表.

Use graph_from_data_frame to create a graph, where 'id' and 'value' columns are treated as an edge list.

使用 components 获取图的连接组件,即通过其标准直接或间接连接哪些id".

Use components to get connected components of the graph, i.e. which 'id' are connected via their criteria, directly or indirectly.

选择membership元素获取每个顶点所属的簇id".

Select the membership element to get "the cluster id to which each vertex belongs".

加入原始数据的成员.

连接按集群成员分组的id".

Concatenate 'id' grouped by cluster membership.

<小时>

library(igraph)

# melt data to long format, remove NA values
d <- melt(dt, id.vars = "id", na.rm = TRUE)

# convert to graph
g <- graph_from_data_frame(d[ , .(id, value)])

# get components and their named membership id 
mem <- components(g)$membership

# add membership id to original data
dt[.(names(mem)), on = .(id), mem := mem] 

# for groups of length one, set 'mem' to NA
dt[dt[, .I[.N == 1], by = mem]$V1, mem := NA]

如果需要,将 'id' 与 'mem' 列连接(对于非 NA 'mem')(恕我直言,这只会使进一步的数据操作更加困难;)).无论如何,我们开始吧:

If desired, concatenate 'id' by 'mem' column (for non-NA 'mem') (IMHO this just makes further data manipulation more difficult ;) ). Anyway, here we go:

dt[!is.na(mem), id2 := paste(id, collapse = "|"), by = mem]

#     id   s1   s2 s3   s4  mem      id2
#  1: a1    a    d  f    h    1 a1|b3|c7
#  2: b3    b    d  g    i    1 a1|b3|c7
#  3: c7    c    e  f    j    1 a1|b3|c7
#  4: d5    l    k  l    m    2    d5|e3
#  5: e3    l    k  l    m    2    d5|e3
#  6: f4    o    o  s    o    3 f4|g2|h1
#  7: g2    o    o  r    o    3 f4|g2|h1
#  8: h1    o    o  u    o    3 f4|g2|h1
#  9: i9 <NA> <NA>  w <NA>   NA     <NA>
# 10: j6 <NA> <NA>  z <NA>   NA     <NA>

<小时>

这个小例子中的基本图,只是为了说明连接的组件:


A basic plot of the graph in this small example, just to illustrate the connected components:

plot(g, edge.arrow.size = 0.5, edge.arrow.width = 0.8, vertex.label.cex = 2, edge.curved = FALSE)

这篇关于基于多列中的直接和间接相似性对变量进行分组的快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆