如何将属性类型更改为字符串(WEKA - CSV到ARFF) [英] How to change attribute type to String (WEKA - CSV to ARFF)

查看:2603
本文介绍了如何将属性类型更改为字符串(WEKA - CSV到ARFF)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用WEKA库制作一个SMS SPAM分类器。我有一个包含label和text标题的CSV文件。当我使用下面的代码,它创建一个ARFF文件有两个属性:

I'm trying to make an SMS SPAM classifier using the WEKA library. I have a CSV file with "label" and "text" headings. When I use the code below, it creates an ARFF file with two attributes:

@attribute label {ham,spam}
@attribute text {'Go until jurong point','Ok lar...', etc.}

目前,文字属性的格式似乎是标称属性,每个邮件的文字都会作为值。但我需要的text属性是一个String属性,而不是所有实例的所有文本的列表。将text属性设置为String将允许我使用StringToWordVector过滤器来训练分类器。

Currently, it seems that the text attribute is formatted as a nominal attribute with each message's text as a value. But I need the text attribute to be a String attribute, not a list of all of the text from all instances. Having the text attribute as a String will allow me to use the StringToWordVector filter for training a classifier.

// load CSV
CSVLoader loader = new CSVLoader();
loader.setSource(new File(args[0]));
Instances data = loader.getDataSet();

// save ARFF
ArffSaver saver = new ArffSaver();
saver.setInstances(data);
saver.setFile(new File(args[1]));
saver.setDestination(new File(args[1]));
saver.writeBatch();

我知道我可以这样创建一个String属性:

I know I can create a String attribute like this:

Attribute tmp = new Attribute("tmp", (FastVector) null);

但我不知道如何替换当前属性, CSV。

But I don't know how to replace the current attribute, or set the attribute type before reading in the CSV.

我尝试插入一个新的String属性并删除当前标称属性,但这会删除所有的SMS文本。我也尝试使用 renameAttributeValue ,但这似乎并不适用于更改属性类型。

I tried inserting a new String attribute and deleting the current nominal attribute, but this deletes all of the SMS text along with it. I also tried using renameAttributeValue, but this doesn't seem to work for changing the attribute type.

EDIT:
我怀疑这个 NominalToString过滤器将执行此作业,但我不知道如何使用它。

I suspect that this NominalToString filter will do the job, but I'm not sure how to use it.

任何建议都非常感谢。谢谢!

Any suggestions would be much appreciated. Thanks!

推荐答案

这样做了。它改变了文本属性类型,但不是标签属性类型(虽然我不知道为什么它做了一个,但不是其他)。

This did the trick. It changed the text attribute type, but not the label attribute type (though I'm not sure why it did one but not the other).

NominalToString filter1 = new NominalToString();
filter1.setInputFormat(data);
data = Filter.useFilter(data, filter1);

有一个小提示这里


默认情况下, NOMINAL
属性,这不一定是文本数据所需要的,特别是如果想使用StringToWordVector过滤器,则需要
。按照
将属性更改为STRING,可以对数据运行NominalToString
过滤器(包weka.filters.unsupervised.attribute),
指定属性索引或索引范围应该是
转换(注意:此过滤器不会从
转换中排除类属性!)

By default, non-numerical attributes get imported as NOMINAL attributes, which is not necessarily desired for textual data, especially if one wants to use the StringToWordVector filter. In order to change the attribute to STRING, one can run the NominalToString filter (package weka.filters.unsupervised.attribute) on the data, specifying the attribute index or range of indices that should be converted (NB: this filter does not exclude the class attribute from conversion!).

这篇关于如何将属性类型更改为字符串(WEKA - CSV到ARFF)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆