从VCF文件中提取样本数据 [英] Extract sample data from VCF files

查看:1141
本文介绍了从VCF文件中提取样本数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个大型的Variant Call格式(VCF)文件(> 4GB),其中包含多个样本的数据.

I have a large Variant Call format (VCF) file (> 4GB) which has data for several samples.

我浏览了Google,Stackoverflow,并尝试使用R中的VariantAnnotation包以某种方式仅提取特定样本的数据,但没有找到有关如何在R中执行此操作的任何信息.

I have browsed Google, Stackoverflow as well as tried the VariantAnnotation package in R to somehow extract data only for a particular sample, but have not found any information on how to do that in R.

有人有没有尝试过类似的方法,或者是否知道其他可以启用此功能的软件包?

Did anybody try anything like that, or maybe knows of another package that would enable this?

推荐答案

VariantAnnotation 中使用ScanVcfParam指定您要提取的数据.使用软件包随附的示例VCF文件

In VariantAnnotation use a ScanVcfParam to specify the data that you'd like to extract. Using the sample VCF file included with the package

library(VariantAnnotation)
vcfFile = system.file(package="VariantAnnotation", "extdata", "chr22.vcf.gz")

发现有关文件的信息

scanVcfHeader(vcfFile)
## class: VCFHeader 
## samples(5): HG00096 HG00097 HG00099 HG00100 HG00101
## meta(1): fileformat
## fixed(0):
## info(22): LDAF AVGPOST ... VT SNPSOURCE
## geno(3): GT DS GL

为在坐标50300000、50400000之间的22号染色体上的样本"HG00097","HG00101"的"LDAF","AVGPOST"信息字段,"GT"基因型字段形成请求

Formulate a request for the "LDAF", "AVGPOST" info fields, "GT" genotype field for samples "HG00097", "HG00101" for variants on chromosome 22 between coordinates 50300000, 50400000

param = ScanVcfParam(
    info=c("LDAF", "AVGPOST"),
    geno="GT",
    samples=c("HG00097", "HG00101"),
    which=GRanges("22", IRanges(50300000, 50400000)))

读取请求的数据

vcf = readVcf(vcfFile, "hg19", param=param)

并从VCF中提取相关数据

and extract from VCF the relevant data

head(geno(vcf)[["GT"]])
##             HG00097 HG00101
## rs7410291   "0|0"   "0|0"  
## rs147922003 "0|0"   "0|0"  
## rs114143073 "0|0"   "0|0"  
## rs141778433 "0|0"   "0|0"  
## rs182170314 "0|0"   "0|0"  
## rs115145310 "0|0"   "0|0"  
head(info(vcf)[["LDAF"]])
## [1] 0.3431 0.0091 0.0098 0.0062 0.0041 0.0117
ranges(vcf)
## IRanges of length 1169
##           start      end width             names
## [1]    50300078 50300078     1         rs7410291
## [2]    50300086 50300086     1       rs147922003
## [3]    50300101 50300101     1       rs114143073
## [4]    50300113 50300113     1       rs141778433
## [5]    50300166 50300166     1       rs182170314
## ...         ...      ...   ...               ...
## [1165] 50364310 50364312     3 22:50364310_GCA/G
## [1166] 50364311 50364313     3 22:50364311_CAT/C
## [1167] 50364464 50364464     1       rs150069372
## [1168] 50364465 50364465     1       rs146661152
## [1169] 50364609 50364609     1       rs184235324

也许您只对基因型元素"GS"作为简单的R矩阵感兴趣,然后只需指定感兴趣的样本和/或范围,并使用readGeno(或readGTreadInfo类似的专门查询).

Maybe you're only interested in genotype element "GS" as a simple R matrix, then just specify the samples and / or ranges you're interested in and use readGeno (or readGT or readInfo for similar specialized queries).

VariantAnnotation 小插页和参考手册中有大量文档;另见?ScanVcfParam; example(ScanVcfParam).

There is extensive documentation in the VariantAnnotation vignettes and reference manual; see also ?ScanVcfParam; example(ScanVcfParam).

这篇关于从VCF文件中提取样本数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆