通过 python (Py2neo) 将大型数据集转录为 Neo4j [英] Transcribing large datasets into Neo4j via python (Py2neo)

查看：103 发布时间：2021/6/7 19:00:54 python performance neo4j py2neo genome

本文介绍了通过 python (Py2neo) 将大型数据集转录为 Neo4j的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

过去几周我一直在尝试使用 Scikit Allel 库将基因组数据集加载到 Neo4j 中.我已经设法在 VCF 文件中加载了外显子组的所有变体以及具有相关表型数据的所有受试者，我现在只是尝试创建变体和受试者之间的关系.我在 python 方面不是很有经验，我认为这个问题不需要对基因组学或 Scikit-Allel 库有很好的理解，所以不要被那个吓到.

I've spent the last few weeks trying to load a genomic dataset into Neo4j, using the Scikit Allel library. I've managed to load all variants for the exomes in a VCF file and all subjects with related phenotype data, I'm now just attempting to create the relationships between the variants and the subjects. I'm not very experienced in python, and I don't think this question needs a good understanding of either genomics or the Scikit-Allel library, so don't be scared off by that.

下面的代码有效，但速度非常慢.我认为为每个主题创建一个数据框或一组列表可能是一个更好的方法，而不是为 j for 循环中的每个元素创建一个图形事务，但任何建议将不胜感激关于如何最好地做到这一点.我也考虑过将 Numba 包括在内，但不知道从哪里开始.

The code below works, but incredibly slowly. I think creating a dataframe, or sets of lists for each subject might be a better way to do this, rather than creating a graph transaction for each element in the j for loop, but any advice would be appreciated on how best to do this. I have looked at including Numba for this as well, but wouldn't know where to start.

这个过程每小时会在大约 1750 名受试者和 23 条染色体中的每个染色体创建大约 1 个新受试者，因此当前设置需要很长时间才能起作用.

This process is creating ~1 new subject per hour, per chromosome out of ~1750 subjects and 23 chromosomes, so it will take a very long time for the current set up to work.

import pandas as pd
import csv
import math
import allel
import zarr
from py2neo import Graph, Node, Relationship, NodeMatcher

zarr_path = '.../chroms.zarr'

callset = zarr.open_group(zarr_path, mode='r')

samples = callset[chrom]['samples']

graph = Graph(user="neo4j", password="password")

chrom_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,'X']

for chrom in chrom_list:


    variants = allel.VariantChunkedTable(callset[chrom]['variants'], names=['AC','AF_AFR', 'AF_AMR', 'AF_ASN', 'AF_EUR', 'AF_MAX', 'CGT', 'CLR', 'CSQ', 'DP', 'DP4', 'ESP_MAF', 'FILTER_LowQual', 'FILTER_MinHWE', 'FILTER_MinVQSLOD', 'FILTER_PASS', 'HWE', 'ICF', 'ID', 'IS', 'PC2', 'PCHI2', 'POS', 'PR', 'QCHI2', 'QUAL', 'REF', 'ALT', 'INDEL', 'SHAPEIT', 'SNP_ID', 'TYPE', 'UGT', 'VQSLOD', 'dbSNPmismatch', 'is_snp', 'numalt'], index='POS')
    pos = variants['POS'][:]
    SNPid = variants['ID'][:]
    ref = variants['REF'][:]
    alt = variants['ALT'][:]
    dp = variants['DP'][:]
    ac = variants['AC'][:]
    vartype = variants['TYPE'][:]
    qual = variants['QUAL'][:]
    vq = variants['VQSLOD'][:]
    numalt = variants['numalt'][:]
    csq = variants['CSQ'][:]
    vcfv = 'VCFv4.1'
    refv = 'file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta'
    dpz = callset[chrom]['calldata/DP']
    psz = callset[chrom]['calldata/PS']
    plz = callset[chrom]['calldata/PL']
    gpz = callset[chrom]['calldata/GP']
    calldata = callset[chrom]['calldata']
    gt = allel.GenotypeDaskArray(calldata['GT'])
    hap = gt.to_haplotypes()
    hap1 = hap[:, ::2]
    hap2 = hap[:, 1::2]
    i = 0
    for i in range(len(samples)):
        subject = samples[i]
        subject_node = matcher.match("Subject", subject_id= subject)
        if subject_node.first() is None:
            continue
        seq_tech = 'Illumina HiSeq 2000 (ILLUMINA)'
        dp = dpz[:, i]
        ps = psz[:, i]
        pl = plz[:, i]
        gp = gpz[:, i]
        list_h1 = hap1[:, i].compute()
        list_h2 = hap2[:, i].compute()
        chrom_label = "Chromosome_" + str(chrom)
        j = 0
        for j in range(len(pos)):  
            h1 = int(list_h1[j])
            h2 = int(list_h2[j])
            read_depth = int(dp[j])
            ps1 = int(ps[j])
            PL0 = int(pl[j][0])
            PL1 = int(pl[j][1])
            PL2 = int(pl[j][2])
            genotype = str(h1) + '|' + str(h2)
            GP0 = float(gp[j][0])
            GP1 = float(gp[j][1])
            GP2 = float(gp[j][2])
            k = int(pos[j])
            l = str(ref[j])
            m = str(alt[j][h1-1])
            o = str(alt[j][h2-1])

            if h1 == 0 and h2 == 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
                r2 = Relationship(subject_node.first(), "Homozygous", a1.first(), HTA=h1, HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)

            elif h1 == 0 and h2 > 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
                r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)
                a2 = matcher.match(chrom_label, "Allele", pos= k, bp = o)
                r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r3)

            elif h1 > 0 and h2 == 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
                r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)
                a2 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
                r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r3)

            elif h1 == h2 and h1 > 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
                r2 = Relationship(subject_node.first(), "Homozygous", a1.first(), HTA = h1, HTB = h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)

            else:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
                r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)

                a2 = matcher.match(chrom_label, "Allele", pos= k, bp = o)
                r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r3)
        print("Subject " + subject + " completed.")
print(chrom_label + "completed.")

非常感谢任何帮助！

通过 python (Py2neo) 将大型数据集转录为 Neo4j [英] Transcribing large datasets into Neo4j via python (Py2neo)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

通过 python (Py2neo) 将大型数据集转录为 Neo4j [英] Transcribing large datasets into Neo4j via python (Py2neo)

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭