星火MLLib TFIDF实施逻辑回归 [英] Spark MLLib TFIDF implementation for LogisticRegression

查看:319
本文介绍了星火MLLib TFIDF实施逻辑回归的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试使用新的TFIDF算法用于火花1.1.0优惠。我在写我的MLLib在Java中的工作,但我无法弄清楚如何获得TFIDF实施工作。出于某种原因, IDFModel 只接受 JavaRDD 作为用于该方法的输入<一href=\"http://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/feature/IDFModel.html#transform%28org.apache.spark.api.java.JavaRDD%29\"相对=nofollow>变换而不是简单的矢量。 如何使用给定的类到TFIDF向量为我LabledPoints模型?

注:文件行格式[标签;文字]


下面我code迄今:

  // 1)加载文件
        JavaRDD&LT;串GT;数据= sc.textFile(/家庭/约翰尼/ data.data.new);        // 2)散列的所有文件
        HashingTF TF =新HashingTF();
        JavaRDD&LT; Tuple2&LT;四,向量&GT;&GT; tupleData = Data.Map中(新功能与LT;弦乐,Tuple2&LT;四,向量&GT;&GT;(){
            @覆盖
            公共Tuple2&LT;四,向量&GT;调用(字符串V1)抛出异常{
                的String []数据= v1.split();
                清单&LT;串GT; myList中= Arrays.asList(数据[1] .split());
                返回新Tuple2&LT;四,向量&GT;(Double.parseDouble(数据[0]),tf.transform(myList中));
            }
        });        tupleData.cache();        // 3)与所有矢量创建平面RDD
        JavaRDD&LT;向量&GT; hashedData = tupleData.map(新功能&LT; Tuple2&LT;双,矢量&gt;中矢量&GT;(){
            @覆盖
            公共向量调用(Tuple2&LT;四,向量&GT; V1)抛出异常{
                返回v1._2;
            }
        });        // 4)创建一个IDFModel我们平矢量RDD的
        IDFModel idfModel =新IDF()配合(hashedData)。        // 5)与TFIDF创建Labledpoint RDD
        ???

解决方案 肖恩·欧文的:

  // 1)加载文件
        JavaRDD&LT;串GT;数据= sc.textFile(/家庭/约翰尼/ data.data.new);        // 2)散列的所有文件
        HashingTF TF =新HashingTF();
        JavaRDD&LT; LabeledPoint&GT; tupleData = Data.Map中(V1 - &GT; {
                的String [] = DATAS v1.split();
                清单&LT;串GT; myList中= Arrays.asList(DATAS [1] .split());
                返回新LabeledPoint(Double.parseDouble(DATAS [0]),tf.transform(myList中));
        });
        // 3)与所有矢量创建平面RDD
        JavaRDD&LT;向量&GT; hashedData = tupleData.map(标签 - &GT; label.features());
        // 4)创建一个IDFModel我们平矢量RDD的
        IDFModel idfModel =新IDF()配合(hashedData)。
        // 5)创建TFIDF RDD
        JavaRDD&LT;向量&GT; IDF = idfModel.transform(hashedData);
        // 6.)创建Labledpoint RDD
        JavaRDD&LT; LabeledPoint&GT; idfTransformed = idf.zip(tupleData).MAP(T - &GT; {
            返回新LabeledPoint(t._2.label(),t._1);
        });


解决方案

IDFModel.transform()接受 JavaRDD RDD 矢量,如你所见。它没有意义的,在计算模型的单一矢量,所以这不是你要找的内容吧?

我假设你用Java开发的,所以你的意思是要将此应用到 JavaRDD&LT; LabeledPoint&GT; LabeledPoint 包含矢量和标签。 IDF不是一个分类或回归,因此它需要无标签。您可以地图一堆 LabeledPoint 的只是提取其矢量

但你已经有一个 JavaRDD&LT;载体&gt;中。 TF-IDF是仅仅映射词基于在语料库词频实值特征的一种方法。它也不输出标签。也许你的意思是你想开发的TF-IDF衍生的特征向量,并且你已经有一些其他标签分类器?

也许这将清除的东西了,但是,否则你必须清晰地阐明您正试图实现与TF-IDF的东西。

I try to use the new TFIDF algorithem that spark 1.1.0 offers. I'm writing my job for MLLib in Java but I can't figure out how to get the TFIDF implementation working. For some reason IDFModel only accepts a JavaRDD as input for the method transform and not simple Vector. How can I use the given classes to model a TFIDF vector for my LabledPoints?

Note: The document lines are in the format [Label; Text]


Here my code so far:

        // 1.) Load the documents
        JavaRDD<String> data = sc.textFile("/home/johnny/data.data.new"); 

        // 2.) Hash all documents
        HashingTF tf = new HashingTF();
        JavaRDD<Tuple2<Double, Vector>> tupleData = data.map(new Function<String, Tuple2<Double, Vector>>() {
            @Override
            public Tuple2<Double, Vector> call(String v1) throws Exception {
                String[] data = v1.split(";");
                List<String> myList = Arrays.asList(data[1].split(" "));
                return new Tuple2<Double, Vector>(Double.parseDouble(data[0]), tf.transform(myList));
            }
        });

        tupleData.cache();

        // 3.) Create a flat RDD with all vectors
        JavaRDD<Vector> hashedData = tupleData.map(new Function<Tuple2<Double,Vector>, Vector>() {
            @Override
            public Vector call(Tuple2<Double, Vector> v1) throws Exception {
                return v1._2;
            }
        });

        // 4.) Create a IDFModel out of our flat vector RDD
        IDFModel idfModel = new IDF().fit(hashedData);

        // 5.) Create Labledpoint RDD with TFIDF
        ???

Solution from Sean Owen:

        // 1.) Load the documents
        JavaRDD<String> data = sc.textFile("/home/johnny/data.data.new"); 

        // 2.) Hash all documents
        HashingTF tf = new HashingTF();
        JavaRDD<LabeledPoint> tupleData = data.map(v1 -> {
                String[] datas = v1.split(";");
                List<String> myList = Arrays.asList(datas[1].split(" "));
                return new LabeledPoint(Double.parseDouble(datas[0]), tf.transform(myList));
        }); 
        // 3.) Create a flat RDD with all vectors
        JavaRDD<Vector> hashedData = tupleData.map(label -> label.features());
        // 4.) Create a IDFModel out of our flat vector RDD
        IDFModel idfModel = new IDF().fit(hashedData);
        // 5.) Create tfidf RDD
        JavaRDD<Vector> idf = idfModel.transform(hashedData);
        // 6.) Create Labledpoint RDD
        JavaRDD<LabeledPoint> idfTransformed = idf.zip(tupleData).map(t -> {
            return new LabeledPoint(t._2.label(), t._1);
        });

解决方案

IDFModel.transform() accepts a JavaRDD or RDD of Vector, as you see. It does not make sense to compute a model over a single Vector, so that's not what you're looking for right?

I assume you're working in Java, so you mean you want to apply this to a JavaRDD<LabeledPoint>. LabeledPoint contains a Vector and a label. IDF is not a classifier or regressor, so it needs no label. You can map a bunch of LabeledPoint to just extract their Vector.

But you already have a JavaRDD<Vector> above. TF-IDF is merely a way of mapping words to real-valued features based on word frequencies in the corpus. It also does not output a label. Maybe you mean you want to develop a classifier from TF-IDF-derived feature vectors, and some other labels you already have?

Maybe that clears things up but otherwise you'd have to greatly clarify what you are trying to achieve with TF-IDF.

这篇关于星火MLLib TFIDF实施逻辑回归的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆