使用igraph采样不同大小的子图 [英] sampling subgraphs from different sizes using igraph
问题描述
我有一个igraph对象mygraph
,具有约10,000个节点和约145,000个边,并且需要从该图创建许多子图,但大小不同.
我需要的是根据确定的大小(从5个节点到500个节点)创建子图,其中每个子图中的所有节点均已连接.我需要为每个大小创建约1,000个子图(即,大小5的1000个子图,大小6的1000个子图,依此类推),然后根据不同的节点属性为每个图计算一些值.
我有一些代码,但是所有的计算都需要很长时间.我曾想过使用graphlets
函数来获取不同的大小,但是每次在计算机上运行它时,由于内存问题而崩溃.
I have an igraph object mygraph
with ~10,000 nodes and ~145,000 edges, and I need to create a number of subgraphs from this graph but with different sizes.
What I need is to create subgraphs from a determined size (from 5 nodes to 500 nodes) where all the nodes are connected in each subgraph. I need to create ~1,000 subgraphs for each size (i.e, 1000 subgraphs for size5, 1000 for size 6, and so on), and then calculate some values for each graph according to different node attributes.
I have some code but it takes a long time to do all the calculations. I thought in using the graphlets
function in order to get the different sizes but every time I run it on my computer it crash due to memory issues.
这是我正在使用的代码:
Here is the code I am using:
第一步是创建一个函数,以创建不同大小的子图并进行所需的计算.
First step was to create a function to create the subgraphs of different sizes and do the calculations needed.
random_network<-function(size,G){
score_fun<-function(g){
subsum <- sum(V(g)$weight*V(g)$RWRNodeweight)/sqrt(sum(V(g)$RWRNodeweight^2))
subsum
}
genes.idx <- V(G)$name
perm <- c()
while(length(perm)<1000){
seed<-sample(genes.idx,1)
while( length(seed)<size ){
tmp.neigh <- V(G)[unlist(neighborhood(G,1,seed))]$name
tmp.neigh <- setdiff(tmp.neigh, seed)
if( length(tmp.neigh)>0 )
seed<-c(seed,sample(tmp.neigh,1)) else break
}
if( length(seed)==size )
perm <- c(perm,score_fun(induced.subgraph(G,seed)))
}
perm
}
第二步是将该功能应用于实际图形
Second step was to apply the function to the actual graph
### generate some example data
library(igraph)
my_graph <- erdos.renyi.game(10000, 0.0003)
V(my_graph)$name <- 1:vcount(my_graph)
V(my_graph)$weight <- rnorm(10000)
V(my_graph)$RWRNodeweight <- runif(10000, min=0, max=0.05)
### Run the code to get the subgraphs from different size and do calculations based on nodes
genesets.length<- seq(5:500)
genesets.length.null.dis <- list()
for(k in 5:max(genesets.length){
genesets.length.null.dis[[as.character(k)]] <- random_network(size=k,G=my_graph)
}
推荐答案
使用Rcpp软件包是一种比Base R进一步提高代码速度的方法.考虑以下完整操作的Rcpp实现.输入以下内容:
One approach to speed up your code further than what's possible in base R would be to use the Rcpp package. Consider the following Rcpp implementation of the full operation. It takes as input the following:
-
valid
:足够大的组件中所有节点的索引 -
el
,deg
,firstPos
:图形的边缘列表的表示形式(节点i
的邻居是el[firstPos[i]]
,el[firstPos[i]+1]
,...,el[firstPos[i]+deg[i]-1]
). li>
-
size
:要采样的子图大小 -
nrep
:重复次数 -
weights
:存储在V(G)$weight
中的边缘权重
-
RWRNodeweight
:存储在V(G)$RWRNodeweight
中的边缘权重
valid
: The indices of all nodes that are in a large-enough componentel
,deg
,firstPos
: A representation of the graph's edge list (nodei
's neighbors areel[firstPos[i]]
,el[firstPos[i]+1]
, ...,el[firstPos[i]+deg[i]-1]
).size
: The subgraph size to samplenrep
: The number of repetitionsweights
: The edge weights stored inV(G)$weight
RWRNodeweight
: The edge weights stored inV(G)$RWRNodeweight
library(Rcpp)
cppFunction(
"NumericVector scores(IntegerVector valid, IntegerVector el, IntegerVector deg,
IntegerVector firstPos, const int size, const int nrep,
NumericVector weights, NumericVector RWRNodeweight) {
const int n = deg.size();
std::vector<bool> used(n, false);
std::vector<bool> neigh(n, false);
std::vector<int> neighList;
std::vector<double> scores(nrep);
for (int outerIter=0; outerIter < nrep; ++outerIter) {
// Initialize variables
std::fill(used.begin(), used.end(), false);
std::fill(neigh.begin(), neigh.end(), false);
neighList.clear();
// Random first node
int recent = valid[rand() % valid.size()];
used[recent] = true;
double wrSum = weights[recent] * RWRNodeweight[recent];
double rrSum = RWRNodeweight[recent] * RWRNodeweight[recent];
// Each additional node
for (int idx=1; idx < size; ++idx) {
// Add neighbors of recent
for (int p=firstPos[recent]; p < firstPos[recent] + deg[recent]; ++p) {
if (!neigh[el[p]] && !used[el[p]]) {
neigh[el[p]] = true;
neighList.push_back(el[p]);
}
}
// Compute new node to add from all neighbors
int newPos = rand() % neighList.size();
recent = neighList[newPos];
used[recent] = true;
wrSum += weights[recent] * RWRNodeweight[recent];
rrSum += RWRNodeweight[recent] * RWRNodeweight[recent];
// Remove from neighList
neighList[newPos] = neighList[neighList.size() - 1];
neighList.pop_back();
}
// Compute score from wrSum and rrSum
scores[outerIter] = wrSum / sqrt(rrSum);
}
return NumericVector(scores.begin(), scores.end());
}
")
现在我们在基数R中需要做的就是为scores
生成参数,这很容易做到:
Now all that we need to do in base R is generate the arguments for scores
, which can be done pretty easily:
josilber.rcpp <- function(size, num.rep, G) {
n <- length(V(G)$name)
# Determine which nodes fall in sufficiently large connected components
comp <- components(G)
valid <- which(comp$csize[comp$membership] >= size)
# Construct an edge list representation for use in the Rcpp code
el <- get.edgelist(G, names=FALSE) - 1
el <- rbind(el, el[,2:1])
el <- el[order(el[,1]),]
deg <- degree(G)
first.pos <- c(0, cumsum(head(deg, -1)))
# Run the proper number of replications
scores(valid-1, el[,2], deg, first.pos, size, num.rep,
as.numeric(V(G)$weight), as.numeric(V(G)$RWRNodeweight))
}
与原始代码和迄今为止我们看到的所有igraph
解决方案相比,执行1000次复制的时间非常快(请注意,对于大多数基准测试,我已经针对以下测试了原始的josilber
和random_network
函数) 1次复制而不是1000次复制,因为测试1000次会花费非常长的时间):
The time to perform 1000 replications is blazing fast compared to the original code and all igraph
solutions we've seen so far (note that for much of this benchmarking I tested the original josilber
and random_network
functions for 1 replication instead of 1000 because testing for 1000 would take a prohibitively long time):
- 大小= 10:0.06秒(比我先前建议的
josilber
函数高1200倍,比原始random_network
函数高4000倍) - 大小= 100:0.08秒(比我以前建议的
josilber
函数高8700倍,比原始random_network
函数高162000倍) - 大小= 1000:0.13秒(比我以前建议的
josilber
函数快32000倍,比原始random_network
函数快 2040万倍) - 大小= 5000:0.32秒(比我以前建议的
josilber
函数快68000倍,比原始random_network
函数快 2.9亿倍)
- Size=10: 0.06 seconds (a 1200x speedup over my previously proposed
josilber
function and a 4000x speedup over the originalrandom_network
function) - Size=100: 0.08 seconds (a 8700x speedup over my previously proposed
josilber
function and a 162000x speedup over the originalrandom_network
function) - Size=1000: 0.13 seconds (a 32000x speedup over my previously proposed
josilber
function and a 20.4 million times speedup over the originalrandom_network
function) - Size=5000: 0.32 seconds (a 68000x speedup over my previously proposed
josilber
function and a 290 million times speedup over the originalrandom_network
function)
简而言之,Rcpp可能使得对于从5到500的每个大小计算1000个重复项是可行的(此操作可能总共花费大约1分钟),而为每个大小计算1000个重复项将太慢了使用到目前为止已经提出的纯R代码来确定大小.
In short, Rcpp probably makes it feasible to compute 1000 replicates for each size from 5 to 500 (this operation will probably take about 1 minute in total), while it would be prohibitively slow to compute the 1000 replicates for each of these sizes using the pure R code that's been proposed so far.
这篇关于使用igraph采样不同大小的子图的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!