如何有效地从访问位置列表中构造边缘列表? [英] How to construct an edgeliste from a list of visited places (effectively)?

查看：84 发布时间：2020/7/7 5:17:42 r data.table igraph sna tidygraph

本文介绍了如何有效地从访问位置列表中构造边缘列表?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我的原始data.table由三列组成.
site，observation_number和id.

My original data.table consists of three columns.
site, observation_number and id.

例如以下是对id = z的所有观察

E.g. the following which is all the observations for id = z

|site|observation_number|id
|a   |                 1| z                 
|b   |                 2| z
|c   |                 3| z

这意味着ID z已从a到b到c.

Which means that ID z has traveled from a to b to c.

每个ID没有固定数量的网站.

There is no fixed number of sites per id.

我希望将数据转换为这样的边缘列表

I wish to transform the data to an edge list like this

|from |to||id|
|a    | b| z |
|b    | c| z |

模拟数据

sox <- data.table(site =  c('a','b','c','a','c','c','a','d','e'),
       obsnum =c(1,2,3,1,2,1,2,3,4),
       id     =c('z','z','z','y','y','k','k','k','k'))

我目前的操作方式令人困惑并且非常慢(sox的行数为1.5 mio，dt_out的行数约为7.5 mio). 我基本上在observation_number上使用for循环将数据拆分为每个ID仅出现一次(即-从-到-仅有一次路程)的数据块. 然后，我投射数据，并将所有块都清洗到新的data.table中.

The way I am currently doing this, feels convoluted and is very slow (sox has 1.5 mio rows and dt_out has ca. 7.5 mio. rows). I basically use a for loop over observation_number to split the data in to chunks where each ID is only present once (that is - only one journey, to - from). Then I cast data, and rind all the chunks to a new data.table.

dt_out <- data.table()
maksimum = sox[,max(observation_number)]
for (i in 1:maksimum-1) {
  i=1
  mini = i
  maxi = i+1
  sox_t <- sox[observation_number ==maxi | observation_number ==mini, ]
  temp_dt <- dcast(sox_t[id %in% sox_t[, .N, by = id][N>=2]$id,
                             .SD[, list(site, observation_number, a=rep(c('from', 'to')))] ,by=id],
                       id='id', value.var='site', formula=id~a)
  dt_out <- rbind(dt_out, temp_dt)
  i=max
  }

我希望有人可以帮助我优化这一点，并且最好创建一个函数，在其中我可以输入data.table，站点ID，observationnumber id和id.出于某种原因，无论如何我都无法创建一个函数.

I hope someone can help me optimize this, and preferable create a function where I can input the data.table, the site id, observationnumber id, and the id. For some reason I can't create a function regardless that works.

使用系统时间(和几次运行系统时间):

Using sytem time (and running system time a few times):

                             User - System - Elapsed
make_edgelist (data.table):  5.38     0.00      5.38
Data.table. with shift:     13.96     0.06     14.08 
dplyr, with arrange:         6.06     0.36      6.44

p.s. make_edgelist已更新以订购data.table

p.s. make_edgelist was updated to order the data.table

make_edgelist <- function(DT, site_var = "site", id_var = "id", obsnum_var   = "rn1") {
    DT[order(get(obsnum_var)),
    list(from = get(site_var)[-.N], to = get(site_var)[-1]), by = id_var]
}

令我惊讶的是dplyr(带有lead)几乎与make_edgelist一样快，并且比带有shift的data.table快得多.我想这意味着dplyr实际上将具有更复杂的超前/滞后/移位功能.

I was surprised that dplyr (with lead) was almost as fast as make_edgelist and much faster than data.table with shift. I guess this means that dplyr will actually be faster with more complex lead/lags/shift.

我也感到困惑-但还不知道它是否有意义，因此dplyr使用的系统"时间比两个data.table解决方案中的任何一个都要多.

Also I find it puzzling - but don't know enough to know if it has any significance, that dplyr used more 'system' time than any of the two data.table solutions.

输入数据:150万行. 结果:60万行.

Input data: 1.5 million rows. Result: 0.6 million rows.

如何有效地从访问位置列表中构造边缘列表? [英] How to construct an edgeliste from a list of visited places (effectively)?

问题描述

模拟数据

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

如何有效地从访问位置列表中构造边缘列表? [英] How to construct an edgeliste from a list of visited places (effectively)?

问题描述

模拟数据

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭