通过 python (Py2neo) 将大型数据集转录为 Neo4j [英] Transcribing large datasets into Neo4j via python (Py2neo)

查看:103
本文介绍了通过 python (Py2neo) 将大型数据集转录为 Neo4j的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

过去几周我一直在尝试使用 Scikit Allel 库将基因组数据集加载到 Neo4j 中.我已经设法在 VCF 文件中加载了外显子组的所有变体以及具有相关表型数据的所有受试者,我现在只是尝试创建变体和受试者之间的关系.我在 python 方面不是很有经验,我认为这个问题不需要对基因组学或 Scikit-Allel 库有很好的理解,所以不要被那个吓到.

I've spent the last few weeks trying to load a genomic dataset into Neo4j, using the Scikit Allel library. I've managed to load all variants for the exomes in a VCF file and all subjects with related phenotype data, I'm now just attempting to create the relationships between the variants and the subjects. I'm not very experienced in python, and I don't think this question needs a good understanding of either genomics or the Scikit-Allel library, so don't be scared off by that.

下面的代码有效,但速度非常慢.我认为为每个主题创建一个数据框或一组列表可能是一个更好的方法,而不是为 j for 循环中的每个元素创建一个图形事务,但任何建议将不胜感激关于如何最好地做到这一点.我也考虑过将 Numba 包括在内,但不知道从哪里开始.

The code below works, but incredibly slowly. I think creating a dataframe, or sets of lists for each subject might be a better way to do this, rather than creating a graph transaction for each element in the j for loop, but any advice would be appreciated on how best to do this. I have looked at including Numba for this as well, but wouldn't know where to start.

这个过程每小时会在大约 1750 名受试者和 23 条染色体中的每个染色体创建大约 1 个新受试者,因此当前设置需要很长时间才能起作用.

This process is creating ~1 new subject per hour, per chromosome out of ~1750 subjects and 23 chromosomes, so it will take a very long time for the current set up to work.

import pandas as pd
import csv
import math
import allel
import zarr
from py2neo import Graph, Node, Relationship, NodeMatcher

zarr_path = '.../chroms.zarr'

callset = zarr.open_group(zarr_path, mode='r')

samples = callset[chrom]['samples']

graph = Graph(user="neo4j", password="password")

chrom_list = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,'X']

for chrom in chrom_list:


    variants = allel.VariantChunkedTable(callset[chrom]['variants'], names=['AC','AF_AFR', 'AF_AMR', 'AF_ASN', 'AF_EUR', 'AF_MAX', 'CGT', 'CLR', 'CSQ', 'DP', 'DP4', 'ESP_MAF', 'FILTER_LowQual', 'FILTER_MinHWE', 'FILTER_MinVQSLOD', 'FILTER_PASS', 'HWE', 'ICF', 'ID', 'IS', 'PC2', 'PCHI2', 'POS', 'PR', 'QCHI2', 'QUAL', 'REF', 'ALT', 'INDEL', 'SHAPEIT', 'SNP_ID', 'TYPE', 'UGT', 'VQSLOD', 'dbSNPmismatch', 'is_snp', 'numalt'], index='POS')
    pos = variants['POS'][:]
    SNPid = variants['ID'][:]
    ref = variants['REF'][:]
    alt = variants['ALT'][:]
    dp = variants['DP'][:]
    ac = variants['AC'][:]
    vartype = variants['TYPE'][:]
    qual = variants['QUAL'][:]
    vq = variants['VQSLOD'][:]
    numalt = variants['numalt'][:]
    csq = variants['CSQ'][:]
    vcfv = 'VCFv4.1'
    refv = 'file:///lustre/scratch105/projects/g1k/ref/main_project/human_g1k_v37.fasta'
    dpz = callset[chrom]['calldata/DP']
    psz = callset[chrom]['calldata/PS']
    plz = callset[chrom]['calldata/PL']
    gpz = callset[chrom]['calldata/GP']
    calldata = callset[chrom]['calldata']
    gt = allel.GenotypeDaskArray(calldata['GT'])
    hap = gt.to_haplotypes()
    hap1 = hap[:, ::2]
    hap2 = hap[:, 1::2]
    i = 0
    for i in range(len(samples)):
        subject = samples[i]
        subject_node = matcher.match("Subject", subject_id= subject)
        if subject_node.first() is None:
            continue
        seq_tech = 'Illumina HiSeq 2000 (ILLUMINA)'
        dp = dpz[:, i]
        ps = psz[:, i]
        pl = plz[:, i]
        gp = gpz[:, i]
        list_h1 = hap1[:, i].compute()
        list_h2 = hap2[:, i].compute()
        chrom_label = "Chromosome_" + str(chrom)
        j = 0
        for j in range(len(pos)):  
            h1 = int(list_h1[j])
            h2 = int(list_h2[j])
            read_depth = int(dp[j])
            ps1 = int(ps[j])
            PL0 = int(pl[j][0])
            PL1 = int(pl[j][1])
            PL2 = int(pl[j][2])
            genotype = str(h1) + '|' + str(h2)
            GP0 = float(gp[j][0])
            GP1 = float(gp[j][1])
            GP2 = float(gp[j][2])
            k = int(pos[j])
            l = str(ref[j])
            m = str(alt[j][h1-1])
            o = str(alt[j][h2-1])

            if h1 == 0 and h2 == 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
                r2 = Relationship(subject_node.first(), "Homozygous", a1.first(), HTA=h1, HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)

            elif h1 == 0 and h2 > 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
                r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)
                a2 = matcher.match(chrom_label, "Allele", pos= k, bp = o)
                r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r3)

            elif h1 > 0 and h2 == 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
                r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)
                a2 = matcher.match(chrom_label, "Allele", pos= k, bp = l)
                r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r3)

            elif h1 == h2 and h1 > 0:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
                r2 = Relationship(subject_node.first(), "Homozygous", a1.first(), HTA = h1, HTB = h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)

            else:
                a1 = matcher.match(chrom_label, "Allele", pos= k, bp = m)
                r2 = Relationship(subject_node.first(), "Heterozygous", a1.first(), HTA=h1, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r2)

                a2 = matcher.match(chrom_label, "Allele", pos= k, bp = o)
                r3 = Relationship(subject_node.first(), "Heterozygous", a2.first(), HTB=h2, GT=genotype, seq_tech=seq_tech, dp=read_depth, phase_set=ps1, PL0=PL0, PL1=PL1, PL2=PL2, GP0=GP0, GP1=GP1, GP2=GP2)
                graph.create(r3)
        print("Subject " + subject + " completed.")
print(chrom_label + "completed.")

非常感谢任何帮助!

推荐答案

单独创建大量元素总是很慢,主要是因为所需的网络跃点数.您还在每个人之间进行比赛,这将进一步增加时间.

Creating a lot of elements individually will always be slow, largely because of the number of network hops required. You're also carrying out matches in between each as well, which will further increase the time.

解决此类问题的最佳方法是查看批处理,包括读取和写入.虽然您也无法一次完成所有操作,但一次将您的操作分批进行至少数百个操作会产生显着影响.在您的情况下,您可能需要先执行批量读取,然后再执行批量写入,依此类推.

The best approach to this kind of problem is to look at batching, both for reads and for writes. While you can't do everything at once either, batching your operations into at least a few hundred at a time will have a significant effect. In your case, you will probably need to carry out a bulk read, followed by a bulk write, and so on.

因此,特别是要针对多个实体进行匹配(您可能可以为此使用in"修饰符,或者您可能需要放入原始 Cypher).对于写入,使用相关节点和关系在本地构建一个子图,并在一次调用中创建.

So specifically, look to carry out a match against multiple entities (you might be able to use an "in" modifier for this, or you might need to drop into raw Cypher). For writes, build up a subgraph locally with the relevant nodes and relationships and create that in a single call.

您的最佳批量大小只能通过实验发现,因此您可能不会第一次就做对.但批处理绝对是这里的关键.

Your optimal batch size will only be discovered through experimentation, so you likely won't get this right first time. But batching is definitely the key here.

这篇关于通过 python (Py2neo) 将大型数据集转录为 Neo4j的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆