将FASTQ文件读入Spark数据帧 [英] Read FASTQ file into a Spark dataframe
本文介绍了将FASTQ文件读入Spark数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!
问题描述
我正在尝试将FASTQ文件读入Spark数据帧.我遇到了一些困难,因为FASTQ是一种多行格式.
I'm trying to read FASTQ files into Spark dataframes. I have some difficulties because FASTQ is a multi line format.
示例:
@seq1
AGTCAGTCGAC
+
?@@FFBFFDDH
@seq2
CCAGCGTCTCG
+
?88ADA?BDF8
有没有办法在Spark数据框中获取这些数据?
Is there a way to get these data in a Spark dataframe like
+-------------+-------------+------------+
| identifier | sequence | quality |
+-------------+-------------+------------+
|seq1 |AGTCAGTCGAC |?@@FFBFFDDH |
|seq2 |CCAGCGTCTCG |?88ADA?BDF8 |
+-------------+-------------+------------+
感谢您的时间
推荐答案
我要幻灯片
import org.apache.spark.mllib.rdd.RDDFunctions._
spark.createDataset(sc.textFile(path).sliding(4, 4).map {
case Array(id, seq, _, qual) => (id, seq, qual)
}).toDF("identifier", "sequence", "quality")
// +----------+-----------+-----------+
// |identifier| sequence| quality|
// +----------+-----------+-----------+
// | @seq1|AGTCAGTCGAC|?@@FFBFFDDH|
// | @seq2|CCAGCGTCTCG|?88ADA?BDF8|
// +----------+-----------+-----------+
这篇关于将FASTQ文件读入Spark数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
查看全文