如何在weka中表示分类文本? [英] How to represent text for classification in weka?

查看:182
本文介绍了如何在weka中表示分类文本?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您能告诉我如何在weka中表示文本分类的属性或类。通过使用什么属性我可以做分类?单词频率还是单词? ARFF格式的可能结构是什么?你能给我几行结构的例子吗?

Can you please let me know how to represent attribute or class for text classification in weka. By using what attribute can I do classification? word frequency or just word? What would be possible structure of ARFF format? Can you give me several lines of example of that structure?

非常感谢你提前。

推荐答案

最简单的替代方案之一是从ARFF文件开始,解决两类问题:

One of the easiest alternatives is to start with an ARFF file for a two class problem like:


@relation corpus 

@attribute text string
@attribute class {pos,neg}

@data
'long text with words ... ',pos

文本表示为String类型,类是带有两个值的名义。

The text is represented as a String type and the class is a nominal with two values.

然后你可以应用两个过滤器:

Then you could apply two filters:


  1. StringToWordVector 将文本转换为单词矢量表示。过滤器使用每个单词的属性。您可以调整参数以选择二进制/频率表示,词干或停用词。最佳表示取决于问题。如果文本不长,通常二进制表示就足够了。

  2. 重新排序将类属性移动到最后一个位置,Weka认为它在那里。

  1. StringToWordVector that transforms the texts into a word vector representation. The filter uses an attribute for each word. You can tweak parameters to choose binary/frequency representation, stemming or stopwords. The best representation depends on the problem. If text are not long, usually binary representation is enough.
  2. Reorder to move the class atribute to the last position, Weka assumes it is there.

您可以在此Weka wiki页面中找到更多信息和其他方法来转换您的数据:
http://weka.wikispaces.com/Text+categorization+with+WEKA

You may find more info and other approaches to transform your data in this Weka wiki page: http://weka.wikispaces.com/Text+categorization+with+WEKA

这篇关于如何在weka中表示分类文本?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆