将FASTQ文件读入Spark数据帧 [英] Read FASTQ file into a Spark dataframe

查看:102
本文介绍了将FASTQ文件读入Spark数据帧的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将FASTQ文件读入Spark数据帧.我遇到了一些困难,因为FASTQ是一种多行格式.

I'm trying to read FASTQ files into Spark dataframes. I have some difficulties because FASTQ is a multi line format.

示例:

@seq1
AGTCAGTCGAC
+
?@@FFBFFDDH
@seq2
CCAGCGTCTCG
+
?88ADA?BDF8

有没有办法在Spark数据框中获取这些数据?

Is there a way to get these data in a Spark dataframe like

+-------------+-------------+------------+
| identifier  | sequence    | quality    |
+-------------+-------------+------------+
|seq1         |AGTCAGTCGAC  |?@@FFBFFDDH |
|seq2         |CCAGCGTCTCG  |?88ADA?BDF8 |
+-------------+-------------+------------+

感谢您的时间

推荐答案

我要幻灯片

import org.apache.spark.mllib.rdd.RDDFunctions._

spark.createDataset(sc.textFile(path).sliding(4, 4).map {
  case Array(id, seq, _, qual) => (id, seq, qual)
}).toDF("identifier", "sequence", "quality")


// +----------+-----------+-----------+
// |identifier|   sequence|    quality|
// +----------+-----------+-----------+
// |     @seq1|AGTCAGTCGAC|?@@FFBFFDDH|
// |     @seq2|CCAGCGTCTCG|?88ADA?BDF8|
// +----------+-----------+-----------+

这篇关于将FASTQ文件读入Spark数据帧的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆