用于 LogisticRegression 的 Spark MLLib TFIDF 实现 [英] Spark MLLib TFIDF implementation for LogisticRegression

查看:21
本文介绍了用于 LogisticRegression 的 Spark MLLib TFIDF 实现的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用 Spark 1.1.0 提供的新 TFIDF 算法.我正在用 Java 为 MLLib 编写我的工作,但我不知道如何让 TFIDF 实现工作.出于某种原因 IDFModel 只接受 JavaRDD 作为方法 transform 而不是简单的 Vector.如何使用给定的类为我的标签点建模 TFIDF 向量?

I try to use the new TFIDF algorithem that spark 1.1.0 offers. I'm writing my job for MLLib in Java but I can't figure out how to get the TFIDF implementation working. For some reason IDFModel only accepts a JavaRDD as input for the method transform and not simple Vector. How can I use the given classes to model a TFIDF vector for my LabledPoints?

注意:文档行的格式为 [Label;文字]

这是我目前的代码:

        // 1.) Load the documents
        JavaRDD<String> data = sc.textFile("/home/johnny/data.data.new"); 

        // 2.) Hash all documents
        HashingTF tf = new HashingTF();
        JavaRDD<Tuple2<Double, Vector>> tupleData = data.map(new Function<String, Tuple2<Double, Vector>>() {
            @Override
            public Tuple2<Double, Vector> call(String v1) throws Exception {
                String[] data = v1.split(";");
                List<String> myList = Arrays.asList(data[1].split(" "));
                return new Tuple2<Double, Vector>(Double.parseDouble(data[0]), tf.transform(myList));
            }
        });

        tupleData.cache();

        // 3.) Create a flat RDD with all vectors
        JavaRDD<Vector> hashedData = tupleData.map(new Function<Tuple2<Double,Vector>, Vector>() {
            @Override
            public Vector call(Tuple2<Double, Vector> v1) throws Exception {
                return v1._2;
            }
        });

        // 4.) Create a IDFModel out of our flat vector RDD
        IDFModel idfModel = new IDF().fit(hashedData);

        // 5.) Create Labledpoint RDD with TFIDF
        ???

解决方案 来自 Sean Owen:

        // 1.) Load the documents
        JavaRDD<String> data = sc.textFile("/home/johnny/data.data.new"); 

        // 2.) Hash all documents
        HashingTF tf = new HashingTF();
        JavaRDD<LabeledPoint> tupleData = data.map(v1 -> {
                String[] datas = v1.split(";");
                List<String> myList = Arrays.asList(datas[1].split(" "));
                return new LabeledPoint(Double.parseDouble(datas[0]), tf.transform(myList));
        }); 
        // 3.) Create a flat RDD with all vectors
        JavaRDD<Vector> hashedData = tupleData.map(label -> label.features());
        // 4.) Create a IDFModel out of our flat vector RDD
        IDFModel idfModel = new IDF().fit(hashedData);
        // 5.) Create tfidf RDD
        JavaRDD<Vector> idf = idfModel.transform(hashedData);
        // 6.) Create Labledpoint RDD
        JavaRDD<LabeledPoint> idfTransformed = idf.zip(tupleData).map(t -> {
            return new LabeledPoint(t._2.label(), t._1);
        });

推荐答案

IDFModel.transform() 接受 JavaRDDRDDIDFModel.transform()code>Vector,如你所见.在单个 Vector 上计算模型没有意义,所以这不是您要找的东西,对吗?

IDFModel.transform() accepts a JavaRDD or RDD of Vector, as you see. It does not make sense to compute a model over a single Vector, so that's not what you're looking for right?

我假设您正在使用 Java,所以您的意思是您想将其应用于 JavaRDD.LabeledPoint 包含一个 Vector 和一个标签.IDF 不是分类器或回归器,因此它不需要标签.你可以map一堆LabeledPoint来提取它们的Vector.

I assume you're working in Java, so you mean you want to apply this to a JavaRDD<LabeledPoint>. LabeledPoint contains a Vector and a label. IDF is not a classifier or regressor, so it needs no label. You can map a bunch of LabeledPoint to just extract their Vector.

但是您已经在上面有了 JavaRDD.TF-IDF 只是一种基于语料库中的词频将词映射到实值特征的方法.它也不输出标签.也许你的意思是你想从 TF-IDF 派生的特征向量和一些你已经拥有的其他标签中开发一个分类器?

But you already have a JavaRDD<Vector> above. TF-IDF is merely a way of mapping words to real-valued features based on word frequencies in the corpus. It also does not output a label. Maybe you mean you want to develop a classifier from TF-IDF-derived feature vectors, and some other labels you already have?

也许这可以解决问题,但否则您必须非常清楚您要使用 TF-IDF 实现的目标.

Maybe that clears things up but otherwise you'd have to greatly clarify what you are trying to achieve with TF-IDF.

这篇关于用于 LogisticRegression 的 Spark MLLib TFIDF 实现的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆