信息增益计算文本文件? [英] Information Gain Calculation for a text file?

查看:119
本文介绍了信息增益计算文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用信息增益,PCA和遗传算法进行文本分类但是在文档m上执行预处理(词干,删除词,TFIDF)之后困惑如何提前获取信息获取部分。

I'm working on "text categorization using Information gain,PCA and Genetic Algorithm" But after performing Preprocessing(Stemming, stopword removal, TFIDF) on the document m confused how to move ahead for information gain part.

我的输出文件包含并且 TFIDF 价值。


WORD - TFIDF VALUE

在一起(字) - 0.235(tfidf值)

together(word) - 0.235(tfidf value)

来(字) - 0.2548(tfidf值)

come(word) - 0.2548(tfidf value)

当使用weka获取信息时( InfoGainAttributeEval.java ),它需要 .arff 文件格式作为输入。

when using weka for information gain ("InfoGainAttributeEval.java") it require .arff file format as input.

是否可以将文本文件转换为 .arff 格式。
或任何其他方式预先形成除weka以外的信息收益?

Is there any to convert text file into .arff format. or any other way to preform Information gain other than weka?

是否有任何其他开源来计算文件的信息收益?

Is there any other open source for Calculating information gain for document ?

推荐答案

我找到了答案。
在此我们必须生成 arff 文件。

I found my answer. In this we have to generate arff file.

在.arff文件中

@RELATION部分将包含预处理后整个文档中的所有单词。每个单词的类型为真实,因为 tfidf值是实际值。

@RELATION section will contain all words present in your whole document after preprocessing .Each word will be of type real because tfidf value is a real value.

@data section 将包含其 tfidf 值在预处理期间。例如,
首先包含 tfidf值第一个文档中出现的所有单词以及最后一个colunm文档分类。

@data section will contain their tfidf value calculated during preprocessing. for example first will contain tfidf value all words present in first document an at last colunm the document categary.

@RELATION filename
@ATTRIBUTE word1 real
@ATTRIBUTE word2 real
@ATTRIBUTE word3 real
.
.
.
.so on
@ATTRIBUTE class {cacm,cisi,cran,med}

@data
0.5545479562,0.27,0.554544479562,0.4479562,cacm
0.5545479562,0.27,0.554544479562,0.4479562,cacm
0.55454479562,0.1619617,0.579562,0.5542,cisi
0.5545479562,0.27,0.554544479562,0.4479562,cisi
0.0,0.2396113617,0.44479562,0.2,cran
0.5545479562,0.27,0.554544479562,0.4479562,carn
0.5545177444479562,0.26196113617,0.0,0.0,med
0.5545479562,0.27,0.554544479562,0.4479562,med

生成此文件后,您可以将此文件作为 InfoGainAttributeEval.java 的输入。这对我有用。

after you generate this file you can give this file as input to InfoGainAttributeEval.java. and this working for me.

这篇关于信息增益计算文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆