基于多列中直接和间接相似性对变量进行分组的快速方法 [英] Fast way to group variables based on direct and indirect similarities in multiple columns

查看:108
本文介绍了基于多列中直接和间接相似性对变量进行分组的快速方法的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相对较大的数据集(1,750,000行,5列),其中包含具有唯一ID值的记录(第一列),由四个条件(其他4列)描述.一个小例子是:

# example
library(data.table)
dt <- data.table(id=c("a1","b3","c7","d5","e3","f4","g2","h1","i9","j6"), 
                 s1=c("a","b","c","l","l","v","v","v",NA,NA), 
                 s2=c("d","d","e","k","k","o","o","o",NA,NA),
                 s3=c("f","g","f","n","n","s","r","u","w","z"),
                 s4=c("h","i","j","m","m","t","t","t",NA,NA))

看起来像这样:

   id   s1   s2 s3   s4
 1: a1    a    d  f    h
 2: b3    b    d  g    i
 3: c7    c    e  f    j
 4: d5    l    k  n    m
 5: e3    l    k  n    m
 6: f4    v    o  s    t
 7: g2    v    o  r    t
 8: h1    v    o  u    t
 9: i9 <NA> <NA>  w <NA>
10: j6 <NA> <NA>  z <NA>

我的最终目标是在任何说明列中查找所有具有相同字符的记录(不考虑NA),并将它们分组为新的ID,以便我可以轻松地识别重复的记录.这些ID是通过串联每行的ID来构造的.

事情变得更加混乱,因为我可以直接间接找到重复描述的记录.因此,我目前分两个步骤进行此操作.

步骤1-根据直接重复项构造重复的ID

# grouping ids with duplicated info in any of the columns
#sorry, I could not find search for duplicates using multiple columns simultaneously...
dt[!is.na(dt$s1),ids1:= paste(id,collapse="|"), by = list(s1)]
dt[!is.na(dt$s1),ids2:= paste(id,collapse="|"), by = list(s2)]
dt[!is.na(dt$s1),ids3:= paste(id,collapse="|"), by = list(s3)]
dt[!is.na(dt$s1),ids4:= paste(id,collapse="|"), by = list(s4)]

# getting a unique duplicated ID for each row
dt$new.id <- apply(dt[,.(ids1,ids2,ids3,ids4)], 1, paste, collapse="|")
dt$new.id <- apply(dt[,"new.id",drop=FALSE], 1, function(x) paste(unique(strsplit(x,"\\|")[[1]]),collapse="|"))

此操作将导致以下结果,唯一的重复ID定义为"new.id":

   id   s1   s2 s3   s4     ids1     ids2  ids3     ids4   new.id
 1: a1    a    d  f    h       a1    a1|b3 a1|c7       a1 a1|b3|c7
 2: b3    b    d  g    i       b3    a1|b3    b3       b3    b3|a1
 3: c7    c    e  f    j       c7       c7 a1|c7       c7    c7|a1
 4: d5    l    k  n    m    d5|e3    d5|e3 d5|e3    d5|e3    d5|e3
 5: e3    l    k  n    m    d5|e3    d5|e3 d5|e3    d5|e3    d5|e3
 6: f4    v    o  s    t f4|g2|h1 f4|g2|h1    f4 f4|g2|h1 f4|g2|h1
 7: g2    v    o  r    t f4|g2|h1 f4|g2|h1    g2 f4|g2|h1 f4|g2|h1
 8: h1    v    o  u    t f4|g2|h1 f4|g2|h1    h1 f4|g2|h1 f4|g2|h1
 9: i9 <NA> <NA>  w <NA>     <NA>     <NA>  <NA>     <NA>       NA
10: j6 <NA> <NA>  z <NA>     <NA>     <NA>  <NA>     <NA>       NA

请注意,记录"b3"和"c7"是通过"a1"间接复制的(所有其他示例都是直接复制,应保持相同).这就是为什么我们需要下一步.

步骤2-根据间接重复更新重复的ID

#filtering the relevant columns for the indirect search
dt = dt[,.(id,new.id)]

#creating the patterns to be used by grepl() for the look-up for each row
dt[,patt:= .(paste(paste("^",id,"\\||",sep=""),paste("\\|",id,"\\||",sep=""),paste("\\|",id,"$",sep=""),collapse = "" ,sep="")), by = list(id)]

#Transforming the ID vector into factor and setting it as a 'key' to the data.table (speed up the processing)
dt$new.id = as.factor(dt$new.id)
setkeyv(dt, c("new.id"))

#Performing the loop using sapply
library(stringr)
for(i in 1:nrow(dt)) {
  pat = dt$patt[i] # retrieving the research pattern
  tmp = dt[new.id %like% pat] # searching the pattern using grepl()
  if(dim(tmp)[1]>1) {
    x = which.max(str_count(tmp$new.id, "\\|"))
    dt$new.id[i] = as.character(tmp$new.id[x])
  }
}

#filtering the final columns 
dt = dt[,.(id,new.id)]

决赛桌如下:

   id   new.id
 1: a1 a1|b3|c7
 2: b3 a1|b3|c7
 3: c7 a1|b3|c7
 4: d5    d5|e3
 5: e3    d5|e3
 6: f4 f4|g2|h1
 7: g2 f4|g2|h1
 8: h1 f4|g2|h1
 9: i9       NA
10: j6       NA

请注意,现在,前三个记录("a1","b3","c7")被归为一个更广泛的重复ID,该ID既包含直接记录,也包含间接记录.

一切正常,但是我的代码非常慢.花了整整2天的时间来运行一半的数据集(〜800,0000).我可以将循环并行化为不同的内核,但是仍然需要几个小时.而且我几乎可以肯定,我可以以更好的方式使用data.table功能,也许可以在循环中使用"set".今天,我花了数小时试图使用data.table实现相同的代码,但是我对它的语法并不陌生,在这里我真的很难受.关于如何优化此代码的任何建议?

注意:代码最慢的部分是循环,而在循环内部,效率最低的步骤是data.table内部模式的grepl().似乎为data.table设置了一个键"可以加快该过程,但是在我的情况下,我没有改变执行grepl()所花费的时间.

解决方案

您可能将此视为网络问题.在这里,我使用igraph包中的函数.基本步骤:

  1. melt数据转为长格式.

  2. 使用graph_from_data_frame创建一个图形,其中"id"和"value"列被视为边列表.

  3. 使用components获取图的连接组件,即通过其条件直接或间接连接的"id".

  4. 选择membership元素以获取每个顶点所属的群集ID".

  5. 将成员资格加入原始数据.

  6. 按集群成员资格连接"id​​".


library(igraph)

# melt data to long format, remove NA values
d <- melt(dt, id.vars = "id", na.rm = TRUE)

# convert to graph
g <- graph_from_data_frame(d[ , .(id, value)])

# get components and their named membership id 
mem <- components(g)$membership

# add membership id to original data
dt[.(names(mem)), on = .(id), mem := mem] 

# for groups of length one, set 'mem' to NA
dt[dt[, .I[.N == 1], by = mem]$V1, mem := NA]

如果需要,可将'id'与'mem'列连接起来(对于非NA'mem')(恕我直言,这只会使进一步的数据处理更加困难;)).无论如何,我们开始:

dt[!is.na(mem), id2 := paste(id, collapse = "|"), by = mem]

#     id   s1   s2 s3   s4  mem      id2
#  1: a1    a    d  f    h    1 a1|b3|c7
#  2: b3    b    d  g    i    1 a1|b3|c7
#  3: c7    c    e  f    j    1 a1|b3|c7
#  4: d5    l    k  l    m    2    d5|e3
#  5: e3    l    k  l    m    2    d5|e3
#  6: f4    o    o  s    o    3 f4|g2|h1
#  7: g2    o    o  r    o    3 f4|g2|h1
#  8: h1    o    o  u    o    3 f4|g2|h1
#  9: i9 <NA> <NA>  w <NA>   NA     <NA>
# 10: j6 <NA> <NA>  z <NA>   NA     <NA>


在这个小例子中,图的基本图只是为了说明连接的组件:

plot(g, edge.arrow.size = 0.5, edge.arrow.width = 0.8, vertex.label.cex = 2, edge.curved = FALSE)

I have a relatively large data set (1,750,000 lines, 5 columns) which contains records with unique ID values (first column), described by four criteria (4 other columns). A small example would be:

# example
library(data.table)
dt <- data.table(id=c("a1","b3","c7","d5","e3","f4","g2","h1","i9","j6"), 
                 s1=c("a","b","c","l","l","v","v","v",NA,NA), 
                 s2=c("d","d","e","k","k","o","o","o",NA,NA),
                 s3=c("f","g","f","n","n","s","r","u","w","z"),
                 s4=c("h","i","j","m","m","t","t","t",NA,NA))

which looks like this:

   id   s1   s2 s3   s4
 1: a1    a    d  f    h
 2: b3    b    d  g    i
 3: c7    c    e  f    j
 4: d5    l    k  n    m
 5: e3    l    k  n    m
 6: f4    v    o  s    t
 7: g2    v    o  r    t
 8: h1    v    o  u    t
 9: i9 <NA> <NA>  w <NA>
10: j6 <NA> <NA>  z <NA>

My ultimate goal is to find all records with the same character on any description columns (disregarding NAs), and group them under a new ID, so that I can easily identify duplicated records. These IDs are constructed by concatenating the IDs of each row.

Things got messier because I can find those records with duplicated descriptions directly and indirectly. Therefore, I am currently doing this operation in two steps.

STEP 1 - Constructing duplicated IDs based on direct duplicates

# grouping ids with duplicated info in any of the columns
#sorry, I could not find search for duplicates using multiple columns simultaneously...
dt[!is.na(dt$s1),ids1:= paste(id,collapse="|"), by = list(s1)]
dt[!is.na(dt$s1),ids2:= paste(id,collapse="|"), by = list(s2)]
dt[!is.na(dt$s1),ids3:= paste(id,collapse="|"), by = list(s3)]
dt[!is.na(dt$s1),ids4:= paste(id,collapse="|"), by = list(s4)]

# getting a unique duplicated ID for each row
dt$new.id <- apply(dt[,.(ids1,ids2,ids3,ids4)], 1, paste, collapse="|")
dt$new.id <- apply(dt[,"new.id",drop=FALSE], 1, function(x) paste(unique(strsplit(x,"\\|")[[1]]),collapse="|"))

This operation results in the following, with the unique duplicated ID define as "new.id":

   id   s1   s2 s3   s4     ids1     ids2  ids3     ids4   new.id
 1: a1    a    d  f    h       a1    a1|b3 a1|c7       a1 a1|b3|c7
 2: b3    b    d  g    i       b3    a1|b3    b3       b3    b3|a1
 3: c7    c    e  f    j       c7       c7 a1|c7       c7    c7|a1
 4: d5    l    k  n    m    d5|e3    d5|e3 d5|e3    d5|e3    d5|e3
 5: e3    l    k  n    m    d5|e3    d5|e3 d5|e3    d5|e3    d5|e3
 6: f4    v    o  s    t f4|g2|h1 f4|g2|h1    f4 f4|g2|h1 f4|g2|h1
 7: g2    v    o  r    t f4|g2|h1 f4|g2|h1    g2 f4|g2|h1 f4|g2|h1
 8: h1    v    o  u    t f4|g2|h1 f4|g2|h1    h1 f4|g2|h1 f4|g2|h1
 9: i9 <NA> <NA>  w <NA>     <NA>     <NA>  <NA>     <NA>       NA
10: j6 <NA> <NA>  z <NA>     <NA>     <NA>  <NA>     <NA>       NA

Note that records "b3" and "c7" are duplicated indirectly through "a1" (all other examples are direct duplicates that should remain the same). That is why we need the next step.

STEP 2 - Updating the duplicated IDs based on indirect duplicates

#filtering the relevant columns for the indirect search
dt = dt[,.(id,new.id)]

#creating the patterns to be used by grepl() for the look-up for each row
dt[,patt:= .(paste(paste("^",id,"\\||",sep=""),paste("\\|",id,"\\||",sep=""),paste("\\|",id,"$",sep=""),collapse = "" ,sep="")), by = list(id)]

#Transforming the ID vector into factor and setting it as a 'key' to the data.table (speed up the processing)
dt$new.id = as.factor(dt$new.id)
setkeyv(dt, c("new.id"))

#Performing the loop using sapply
library(stringr)
for(i in 1:nrow(dt)) {
  pat = dt$patt[i] # retrieving the research pattern
  tmp = dt[new.id %like% pat] # searching the pattern using grepl()
  if(dim(tmp)[1]>1) {
    x = which.max(str_count(tmp$new.id, "\\|"))
    dt$new.id[i] = as.character(tmp$new.id[x])
  }
}

#filtering the final columns 
dt = dt[,.(id,new.id)]

The final table looks like:

   id   new.id
 1: a1 a1|b3|c7
 2: b3 a1|b3|c7
 3: c7 a1|b3|c7
 4: d5    d5|e3
 5: e3    d5|e3
 6: f4 f4|g2|h1
 7: g2 f4|g2|h1
 8: h1 f4|g2|h1
 9: i9       NA
10: j6       NA

Note that now the first three records ("a1","b3","c7") are grouped under a broader duplicated ID, which contains both direct and indirect records.

Everything is working out fine, but my code is horrendously slow. It took 2 entire days to run half of the data set (~800,0000). I could parallelize the loop into different cores, but it would still take hours. And I am almost sure that I could use data.table functionalities in a better way, maybe using using 'set' inside the loop. I spent hours today trying to implement the same codes using data.table, but I am new to its syntax and I am really having a hard time here. Any suggestions on how I could optimize this code?

Note: The slowest part of the code is the loop and inside the loop the most inefficient step is the grepl() of the patterns inside the data.table. It seems that setting a 'key' to the data.table can speed up the process, but I did not changed the time it took to do the grepl() in my case.

解决方案

You may approach this as a network problem. Here I use functions from the igraph package. The basic steps:

  1. meltthe data to long format.

  2. Use graph_from_data_frame to create a graph, where 'id' and 'value' columns are treated as an edge list.

  3. Use components to get connected components of the graph, i.e. which 'id' are connected via their criteria, directly or indirectly.

  4. Select the membership element to get "the cluster id to which each vertex belongs".

  5. Join membership to original data.

  6. Concatenate 'id' grouped by cluster membership.


library(igraph)

# melt data to long format, remove NA values
d <- melt(dt, id.vars = "id", na.rm = TRUE)

# convert to graph
g <- graph_from_data_frame(d[ , .(id, value)])

# get components and their named membership id 
mem <- components(g)$membership

# add membership id to original data
dt[.(names(mem)), on = .(id), mem := mem] 

# for groups of length one, set 'mem' to NA
dt[dt[, .I[.N == 1], by = mem]$V1, mem := NA]

If desired, concatenate 'id' by 'mem' column (for non-NA 'mem') (IMHO this just makes further data manipulation more difficult ;) ). Anyway, here we go:

dt[!is.na(mem), id2 := paste(id, collapse = "|"), by = mem]

#     id   s1   s2 s3   s4  mem      id2
#  1: a1    a    d  f    h    1 a1|b3|c7
#  2: b3    b    d  g    i    1 a1|b3|c7
#  3: c7    c    e  f    j    1 a1|b3|c7
#  4: d5    l    k  l    m    2    d5|e3
#  5: e3    l    k  l    m    2    d5|e3
#  6: f4    o    o  s    o    3 f4|g2|h1
#  7: g2    o    o  r    o    3 f4|g2|h1
#  8: h1    o    o  u    o    3 f4|g2|h1
#  9: i9 <NA> <NA>  w <NA>   NA     <NA>
# 10: j6 <NA> <NA>  z <NA>   NA     <NA>


A basic plot of the graph in this small example, just to illustrate the connected components:

plot(g, edge.arrow.size = 0.5, edge.arrow.width = 0.8, vertex.label.cex = 2, edge.curved = FALSE)

这篇关于基于多列中直接和间接相似性对变量进行分组的快速方法的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆