如何拆分在星火序列文件 [英] How to split a sequence file in Spark

查看:187
本文介绍了如何拆分在星火序列文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是新来的火花,并尝试读取序列文件,并在分类问题中使用它。这里是我读的顺序文件

I'm new to Spark and try to read a sequence file and use it in a classification problem. Here is how I read the sequence file

  val tfidf = sc.sequenceFile("/user/hadoop/strainingtesting/tfidf-vectors", classOf[Text], classOf[VectorWritable])

我不知道如何通过标签序列文件的每一行分割?即如何获取文本值?

I don't know how to split each line of the sequence file by tab? i.e. how to get the Text value?

我如何使用它在Mllib NAiveBayes分类?

How can I use it for NAiveBayes classifier in Mllib?

推荐答案

sc.sequenceFile 返回键/值元组的RDD(Tuple2 Scala中的对象)。所以,如果你只是想文本的RDD,你可以做一个地图刚拿起文件的每一行按键。

sc.sequenceFile returns an RDD of key/value tuples (Tuple2 objects in Scala). So if you just want an RDD of text, you can do a map to just pick up the keys in each line of the file.

val text = tfidf.map(_._1)

朴素贝叶斯预计标向量作为输入的RDD。而且,由于有转换您的 VectorWritable 对象非同小可的方式,也许你可以使用象夫的矢量转储程序实际上你的序列文件转换为文本文件。然后读入的火花。

Naive Bayes expects an RDD of labeled vectors as input. And since there is no trivial way to convert your VectorWritable objects, maybe you can use mahout's vector dump utility to actually convert your sequence files into text files. And then read into spark.

mahout vectordump \
-i /user/hadoop/strainingtesting/tfidf-vectors \
-d /user/hadoop/strainingtesting/dictionary.file-\* \
-dt sequencefile -c csv -p true \
-o /user/hadoop/strainingtesting/tf-vectors.txt

和现在使用的阅读文本文件转换成星火 sc.textFile 并进行必要的转换。

And now read the text files into Spark using sc.textFile and perform necessary transformations.

这篇关于如何拆分在星火序列文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆