如何有效地从访问位置列表中构造边缘列表? [英] How to construct an edgeliste from a list of visited places (effectively)?

查看:84
本文介绍了如何有效地从访问位置列表中构造边缘列表?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的原始data.table由三列组成.
siteobservation_numberid.

My original data.table consists of three columns.
site, observation_number and id.

例如以下是对id = z的所有观察

E.g. the following which is all the observations for id = z

|site|observation_number|id
|a   |                 1| z                 
|b   |                 2| z
|c   |                 3| z

这意味着ID z已从abc.

Which means that ID z has traveled from a to b to c.

每个ID没有固定数量的网站.

There is no fixed number of sites per id.

我希望将数据转换为这样的边缘列表

I wish to transform the data to an edge list like this

|from |to||id|
|a    | b| z |
|b    | c| z |

模拟数据

sox <- data.table(site =  c('a','b','c','a','c','c','a','d','e'),
       obsnum =c(1,2,3,1,2,1,2,3,4),
       id     =c('z','z','z','y','y','k','k','k','k'))

我目前的操作方式令人困惑并且非常慢(sox的行数为1.5 mio,dt_out的行数约为7.5 mio). 我基本上在observation_number上使用for循环将数据拆分为每个ID仅出现一次(即-从-到-仅有一次路程)的数据块. 然后,我投射数据,并将所有块都清洗到新的data.table中.

The way I am currently doing this, feels convoluted and is very slow (sox has 1.5 mio rows and dt_out has ca. 7.5 mio. rows). I basically use a for loop over observation_number to split the data in to chunks where each ID is only present once (that is - only one journey, to - from). Then I cast data, and rind all the chunks to a new data.table.

dt_out <- data.table()
maksimum = sox[,max(observation_number)]
for (i in 1:maksimum-1) {
  i=1
  mini = i
  maxi = i+1
  sox_t <- sox[observation_number ==maxi | observation_number ==mini, ]
  temp_dt <- dcast(sox_t[id %in% sox_t[, .N, by = id][N>=2]$id,
                             .SD[, list(site, observation_number, a=rep(c('from', 'to')))] ,by=id],
                       id='id', value.var='site', formula=id~a)
  dt_out <- rbind(dt_out, temp_dt)
  i=max
  }

我希望有人可以帮助我优化这一点,并且最好创建一个函数,在其中我可以输入data.table,站​​点ID,observationnumber id和id.出于某种原因,无论如何我都无法创建一个函数.

I hope someone can help me optimize this, and preferable create a function where I can input the data.table, the site id, observationnumber id, and the id. For some reason I can't create a function regardless that works.

使用系统时间(和几次运行系统时间):

Using sytem time (and running system time a few times):

                             User - System - Elapsed
make_edgelist (data.table):  5.38     0.00      5.38
Data.table. with shift:     13.96     0.06     14.08 
dplyr, with arrange:         6.06     0.36      6.44

p.s. make_edgelist已更新以订购data.table

p.s. make_edgelist was updated to order the data.table

make_edgelist <- function(DT, site_var = "site", id_var = "id", obsnum_var   = "rn1") {
    DT[order(get(obsnum_var)),
    list(from = get(site_var)[-.N], to = get(site_var)[-1]), by = id_var]
}

令我惊讶的是dplyr(带有lead)几乎与make_edgelist一样快,并且比带有shift的data.table快得多.我想这意味着dplyr实际上将具有更复杂的超前/滞后/移位功能.

I was surprised that dplyr (with lead) was almost as fast as make_edgelist and much faster than data.table with shift. I guess this means that dplyr will actually be faster with more complex lead/lags/shift.

我也感到困惑-但还不知道它是否有意义,因此dplyr使用的系统"时间比两个data.table解决方案中的任何一个都要多.

Also I find it puzzling - but don't know enough to know if it has any significance, that dplyr used more 'system' time than any of the two data.table solutions.

输入数据:150万行. 结果:60万行.

Input data: 1.5 million rows. Result: 0.6 million rows.

推荐答案

这是您要寻找的吗?

sox[, .(from = site[-.N], to = site[-1]), by = id]

#    id from to
# 1:  z    a  b
# 2:  z    b  c
# 3:  y    a  c
# 4:  k    c  a
# 5:  k    a  d
# 6:  k    d  e

包装在函数中

make_edgelist <- function(DT, site_var = "site", id_var = "id") {
  DT[, .(from = get(site_var)[-.N], to = get(site_var)[-1]), by = id_var]
}

注意:此解决方案假定数据已经按观察编号排序.为了避免这种假设,请在第一个逗号之前添加order(obsnum).

Note: This solution assumes the data is already ordered by observation number. To avoid this assumptions add order(obsnum) before the first comma.

这篇关于如何有效地从访问位置列表中构造边缘列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆