如何在Python中为加权邻接矩阵计算拓扑重叠度量[TOM]? [英] How to compute the Topological Overlap Measure [TOM] for a weighted adjacency matrix in Python?

查看:166
本文介绍了如何在Python中为加权邻接矩阵计算拓扑重叠度量[TOM]?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试计算邻接矩阵的加权拓扑重叠,但是我无法弄清楚如何使用numpy正确地做到这一点.正确实现的R函数来自WGCNA( https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity ).

I'm trying to calculate the weighted topological overlap for an adjacency matrix but I cannot figure out how to do it correctly using numpy. The R function that does the correct implementation is from WGCNA (https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity). The formula for computing this (I THINK) is detailed in equation 4 which I believe is correctly reproduced below.

有人知道如何正确实现它,以反映出WGCNA版本吗?

是的,我对rpy2有所了解,但是如果可能,我正在尝试对此进行简化.

Yes, I know about rpy2 but I'm trying to go lightweight on this if possible.

对于初学者来说,我的对角线不是1,并且值与原始值没有一致的误差(例如,x不能全部消除).

For starters, my diagonal is not 1 and the values have no consistent error from the original (e.g. not all off by x).

当我在R中计算该值时,我使用了以下内容:

When I computed this in R, I used the following:

> library(WGCNA, quiet=TRUE)
> df_adj = read.csv("https://pastebin.com/raw/sbAZQsE6", row.names=1, header=TRUE, check.names=FALSE, sep="\t")
> df_tom = TOMsimilarity(as.matrix(df_adj), TOMType="unsigned", TOMDenom="min")
# ..connectivity..
# ..matrix multiplication (system BLAS)..
# ..normalization..
# ..done.
# I've uploaded it to this url: https://pastebin.com/raw/HT2gBaZC

我不确定我的代码在哪里不正确. R版本的源代码是此处,但是使用C后端脚本吗?这对我来说很难解释.

I'm not sure where my code is incorrect. The source code for the R version is here but it's using C backend scripts? which is very difficult for me interpret.

这是我在Python中的实现:

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

def get_iris_data():
    iris = load_iris()
    # Iris dataset
    X = pd.DataFrame(iris.data,
                     index = [*map(lambda x:f"iris_{x}", range(150))],
                     columns = [*map(lambda x: x.split(" (cm)")[0].replace(" ","_"), iris.feature_names)])

    y = pd.Series(iris.target,
                           index = X.index,
                           name = "Species")
    return X, y

# Get data
X, y = get_iris_data()

# Create an adjacency network
# df_adj = np.abs(X.T.corr()) # I've uploaded this part to this url: https://pastebin.com/raw/sbAZQsE6
df_adj = pd.read_csv("https://pastebin.com/raw/sbAZQsE6", sep="\t", index_col=0)
A_adj = df_adj.values

# Correct TOM from WGCNA for the A_adj
# See above for code
# https://www.rdocumentation.org/packages/WGCNA/versions/1.67/topics/TOMsimilarity
df_tom__wgcna = pd.read_csv("https://pastebin.com/raw/HT2gBaZC", sep="\t", index_col=0)

# My attempt
A = A_adj.copy()
dimensions = A.shape
assert dimensions[0] == dimensions[1]
d = dimensions[0]

# np.fill_diagonal(A, 0)

# Equation (4) from http://dibernardo.tigem.it/files/papers/2008/zhangbin-statappsgeneticsmolbio.pdf
A_tom = np.zeros_like(A)
for i in range(d):
    a_iu = A[i]
    k_i = a_iu.sum()
    for j in range(i+1, d):
        a_ju = A[:,j]
        k_j = a_ju.sum()
        l_ij = np.dot(a_iu, a_ju)
        a_ij = A[i,j]
        numerator = l_ij + a_ij
        denominator = min(k_i, k_j) + 1 - a_ij
        w_ij = numerator/denominator
        A_tom[i,j] = w_ij
A_tom = (A_tom + A_tom.T)

有一个名为GTOM的软件包( https://github.com/benmaier/gtom),但不适用于加权邻接. GTOM的作者也研究了这个问题(这个问题要复杂得多/有效的NumPy实施,但仍未产生预期的结果.

There is a package called GTOM (https://github.com/benmaier/gtom) but it is not for weighted adjacencies. The author of GTOM also took a look at this problem (which a much more sophisticated/efficient NumPy implementation but it's still not producing the expected results).

有人知道如何复制WGCNA实现吗?

2019.06.20 我已经修改了@scleronomic和 @benmaier 中的一些代码,并在文档字符串中添加了积分.此功能可从v2016.06及更高版本的 soothsayer 中使用.希望这将使人们能够更轻松地在Python中使用拓扑重叠,而不是只能使用R.

2019.06.20 I've adapted some of the code from @scleronomic and @benmaier with credits in the doc string. The function is available in soothsayer from v2016.06 and on. Hopefully this will allow people to use topological overlap in Python easier instead of only being able to use R.

https://github.com/jolespin/soothsayer/blob/master/soothsayer/networks/networks.py

import numpy as np
import soothsayer as sy
df_adj = sy.io.read_dataframe("https://pastebin.com/raw/sbAZQsE6")
df_tom = sy.networks.topological_overlap_measure(df_adj)
df_tom__wgcna = sy.io.read_dataframe("https://pastebin.com/raw/HT2gBaZC")
np.allclose(df_tom, df_tom__wgcna)
# True

推荐答案

首先让我们看一下二进制邻接矩阵a_ij情况下方程的各个部分:

First let's look at the parts of the equation for the case of a binary adjacency matrix a_ij:

  • a_ij:指示节点i是否已连接到节点j
  • k_i:节点i的邻居计数(连接性)
  • l_ij:节点i和节点j
  • 的公共邻居计数
  • a_ij: indicates if node i is connected to node j
  • k_i: count of the neighbors of node i (connectivity)
  • l_ij: count of the common neighbors of node i and node j

因此w_ij测量连接性较低的节点中有多少邻居也是另一个节点的邻居(即w_ij测量它们的相对互连性").

so w_ij measures how many of the neighbors of the node with the lower connectivity are also neighbors of the other node (ie. w_ij measures "their relative inter-connectedness").

我的猜测是,他们将 A 的对角线定义为零而不是一. 有了这个假设,我可以重现 WGCNA 的值.

My guess is that they define the diagonal of A to be zero instead of one. With this assumption I can reproduce the values of WGCNA.

A[range(d), range(d)] = 0  # Assumption
L = A @ A  # Could be done smarter by using the symmetry
K = A.sum(axis=1)

A_tom = np.zeros_like(A)
for i in range(d):
    for j in range(i+1, d):  
        numerator = L[i, j] + A[i, j]
        denominator = min(K[i], K[j]) + 1 - A[i, j]
        A_tom[i, j] = numerator / denominator
    
A_tom += A_tom.T
A_tom[range(d), range(d)] = 1  # Set diagonal to 1 by default

A_tom__wgcna = np.array(pd.read_csv("https://pastebin.com/raw/HT2gBaZC", 
                        sep="\t", index_col=0))
print(np.allclose(A_tom, A_tom__wgcna))

对于带有二进制A的简单示例,可以看到为什么A的对角线应该为零而不是一个的直觉:

An intuition why the diagonal of A should be zero instead of one can be seen for a simple example with a binary A:

 Graph      Case Zero    Case One
   B          A B C D      A B C D  
 /   \      A 0 1 1 1    A 1 1 1 1  
A-----D     B 1 0 0 1    B 1 1 0 1  
 \   /      C 1 0 0 1    C 1 0 1 1  
   C        D 1 1 1 0    D 1 1 1 1  

方程式4的给定说明解释了:

The given description of equation 4 explains:

请注意,w_ij = 1如果连接较少的节点满足两个条件:

Note that w_ij = 1 if the node with fewer connections satisfies two conditions:

  • (a)它的所有邻居也是另一个节点的邻居,并且
  • (b)它已连接到另一个节点.

相反,如果jj未连接并且两个节点不共享任何邻居,则w_ij = 0.

In contrast, w_ij = 0 if i and j are un-connected and the two nodes do not share any neighbors.

因此A-D之间的连接应满足此条件,且应为w_14=1.

So the connection between A-D should fulfill this criterion and be w_14=1.

  • 情况为零:
  • 案例一:

应用公式时仍然缺少对角线值不匹配.我默认将它们设置为一个.无论如何,节点与其自身的相互联系是什么?不等于一(或零,取决于定义)的值对我来说没有意义. 在简单的示例中, Case Zero Case One 都不会导致w_ii=1. 在 Case Zero 中,有必要k_i+1 == l_ii,而在 Case One 中,有必要k_i == l_ii+1,这对我来说都是错误的.

What is still missing when applying the formula is that the diagonal values don't match. I set them to one by default. What is the inter-connectedness of a node with itself anyway? A value different than one (or zero, depending on definition) doesn't make sense to me. Neither Case Zero nor Case One result in w_ii=1 in the simple example. In Case Zero it would be necessary that k_i+1 == l_ii, and in Case One it would be necessary that k_i == l_ii+1, which both seems wrong to me.

因此,总而言之,我将邻接矩阵的对角线设置为zero,使用给定的方程式,并将结果的对角线默认设置为one.

So to summarize I would set the diagonal of the adjacency matrix to zero, use the given equation and set the diagonal of the result to one by default.

这篇关于如何在Python中为加权邻接矩阵计算拓扑重叠度量[TOM]?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆