当DataFrame有列时如何使用Java Apache Spark MLlib? [英] How to work with Java Apache Spark MLlib when DataFrame has columns?

查看:149
本文介绍了当DataFrame有列时如何使用Java Apache Spark MLlib?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我是Apache Spark的新手,并且我有一个看起来像这样的文件:

So I'm new to Apache Spark and I have a file that looks like this:

Name     Size    Records 
File1    1,000   104,370 
File2    950     91,780 
File3    1,500   109,123 
File4    2,170   113,888
File5    2,000   111,974
File6    1,820   110,666
File7    1,200   106,771 
File8    1,500   108,991 
File9    1,000   104,007
File10   1,300   107,037
File11   1,900   111,109
File12   1,430   108,051
File13   1,780   110,006
File14   2,010   114,449
File15   2,017   114,889

这是我的样品/测试数据.我正在开发一个异常检测程序,我必须测试格式相同但值不同的其他文件,并检测哪个文件的大小异常并记录值(如果另一个文件的大小/记录与标准文件相差很大),或者大小和记录彼此之间不成比例).我决定开始尝试不同的ML算法,并且我想从k-Means方法开始.我尝试将此文件放在以下行中:

This is my sample/test data. I'm working on an anomaly detection program and I have to test other files with the same format but different values and detect which one have anomalies on the size and records values (if size/records on another file differ a lot from the standard one, or if size and records are not proportional within each other). I decided to start trying different ML algorithms and I wanted to start with the k-Means approach. I tried putting this file on the following line:

KMeansModel model = kmeans.fit(file)

文件已被解析为数据集变量.但是,我得到一个错误,并且我很确定这与文件的结构/架构有关.尝试适合模型时,是否可以使用结构化/标签化/组织化数据?

file is already parsed to a Dataset variable. However I get an error and I'm pretty sure it has to do with the structure/schema of the file. Is there a way to work with structured/labeled/organized data when trying to fit in on a model?

我收到以下错误:线程主"中的异常java.lang.IllegalArgumentException:字段功能"不存在.

I get the following error: Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.

这是代码:

public class practice {

public static void main(String[] args) {
    SparkConf conf = new SparkConf().setAppName("Anomaly Detection").setMaster("local");
    JavaSparkContext sc = new JavaSparkContext(conf);

    SparkSession spark = SparkSession
              .builder()
              .appName("Anomaly Detection")
              .getOrCreate();

String day1 = "C:\\Users\\ZK0GJXO\\Documents\\day1.txt";

    Dataset<Row> df = spark.read().
            option("header", "true").
            option("delimiter", "\t").
            csv(day1);
    df.show();
    KMeans kmeans = new KMeans().setK(2).setSeed(1L);
    KMeansModel model = kmeans.fit(df);
}

}

谢谢

推荐答案

默认情况下,所有Spark ML模型都在称为功能"的列上训练.可以通过setFeaturesCol方法

By default all Spark ML models train on a column called "features". One can specify a different input column name via the setFeaturesCol method http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/clustering/KMeans.html#setFeaturesCol(java.lang.String)

更新:

一个人可以使用VectorAssembler将多列合并为一个特征向量:

One can combine multiple columns into a single feature vector using VectorAssembler:

VectorAssembler assembler = new VectorAssembler()
.setInputCols(new String[]{"size", "records"})
.setOutputCol("features");

 Dataset<Row> vectorized_df = assembler.transform(df)

 KMeans kmeans = new KMeans().setK(2).setSeed(1L);
 KMeansModel model = kmeans.fit(vectorized_df);

可以使用管道API

One can further streamline and chain these feature transformations with the pipeline API https://spark.apache.org/docs/latest/ml-pipeline.html#example-pipeline

这篇关于当DataFrame有列时如何使用Java Apache Spark MLlib?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆