如何有效地从访问位置列表中构造边缘列表? [英] How to construct an edgeliste from a list of visited places (effectively)?
问题描述
我的原始data.table
由三列组成.
site
,observation_number
和id
.
My original data.table
consists of three columns.
site
, observation_number
and id
.
例如以下是对id = z的所有观察
E.g. the following which is all the observations for id = z
|site|observation_number|id
|a | 1| z
|b | 2| z
|c | 3| z
这意味着ID z
已从a
到b
到c
.
Which means that ID z
has traveled from a
to b
to c
.
每个ID没有固定数量的网站.
There is no fixed number of sites per id.
我希望将数据转换为这样的边缘列表
I wish to transform the data to an edge list like this
|from |to||id|
|a | b| z |
|b | c| z |
模拟数据
sox <- data.table(site = c('a','b','c','a','c','c','a','d','e'),
obsnum =c(1,2,3,1,2,1,2,3,4),
id =c('z','z','z','y','y','k','k','k','k'))
我目前的操作方式令人困惑并且非常慢(sox的行数为1.5 mio,dt_out的行数约为7.5 mio).
我基本上在observation_number
上使用for循环将数据拆分为每个ID仅出现一次(即-从-到-仅有一次路程)的数据块.
然后,我投射数据,并将所有块都清洗到新的data.table中.
The way I am currently doing this, feels convoluted and is very slow (sox has 1.5 mio rows and dt_out has ca. 7.5 mio. rows).
I basically use a for loop over observation_number
to split the data in to chunks where each ID is only present once (that is - only one journey, to - from).
Then I cast data, and rind all the chunks to a new data.table.
dt_out <- data.table()
maksimum = sox[,max(observation_number)]
for (i in 1:maksimum-1) {
i=1
mini = i
maxi = i+1
sox_t <- sox[observation_number ==maxi | observation_number ==mini, ]
temp_dt <- dcast(sox_t[id %in% sox_t[, .N, by = id][N>=2]$id,
.SD[, list(site, observation_number, a=rep(c('from', 'to')))] ,by=id],
id='id', value.var='site', formula=id~a)
dt_out <- rbind(dt_out, temp_dt)
i=max
}
我希望有人可以帮助我优化这一点,并且最好创建一个函数,在其中我可以输入data.table,站点ID,observationnumber id和id.出于某种原因,无论如何我都无法创建一个函数.
I hope someone can help me optimize this, and preferable create a function where I can input the data.table, the site id, observationnumber id, and the id. For some reason I can't create a function regardless that works.
使用系统时间(和几次运行系统时间):
Using sytem time (and running system time a few times):
User - System - Elapsed
make_edgelist (data.table): 5.38 0.00 5.38
Data.table. with shift: 13.96 0.06 14.08
dplyr, with arrange: 6.06 0.36 6.44
p.s. make_edgelist已更新以订购data.table
p.s. make_edgelist was updated to order the data.table
make_edgelist <- function(DT, site_var = "site", id_var = "id", obsnum_var = "rn1") {
DT[order(get(obsnum_var)),
list(from = get(site_var)[-.N], to = get(site_var)[-1]), by = id_var]
}
令我惊讶的是dplyr(带有lead
)几乎与make_edgelist一样快,并且比带有shift
的data.table快得多.我想这意味着dplyr实际上将具有更复杂的超前/滞后/移位功能.
I was surprised that dplyr (with lead
) was almost as fast as make_edgelist and much faster than data.table with shift
. I guess this means that dplyr will actually be faster with more complex lead/lags/shift.
我也感到困惑-但还不知道它是否有意义,因此dplyr使用的系统"时间比两个data.table解决方案中的任何一个都要多.
Also I find it puzzling - but don't know enough to know if it has any significance, that dplyr used more 'system' time than any of the two data.table solutions.
输入数据:150万行. 结果:60万行.
Input data: 1.5 million rows. Result: 0.6 million rows.
推荐答案
这是您要寻找的吗?
sox[, .(from = site[-.N], to = site[-1]), by = id]
# id from to
# 1: z a b
# 2: z b c
# 3: y a c
# 4: k c a
# 5: k a d
# 6: k d e
包装在函数中
make_edgelist <- function(DT, site_var = "site", id_var = "id") {
DT[, .(from = get(site_var)[-.N], to = get(site_var)[-1]), by = id_var]
}
注意:此解决方案假定数据已经按观察编号排序.为了避免这种假设,请在第一个逗号之前添加order(obsnum)
.
Note: This solution assumes the data is already ordered by observation number. To avoid this assumptions add order(obsnum)
before the first comma.
这篇关于如何有效地从访问位置列表中构造边缘列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!