使用svm-light进行邮件分类的Python文件格式 [英] Python file format for email classification with svm-light

查看:196
本文介绍了使用svm-light进行邮件分类的Python文件格式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理电子邮件主题,所以我有20个电子邮件我想分类,一个20行的文件 - 一行有一个电子邮件主题。我一直在努力,但我无法弄清楚功能参考和svmlight输入文件的格式。任何提示将会有所帮助。感谢提前!



编辑:我把前500个主题行的tf-idf作为试用版。但是,根据svm-light格式,我们需要:

 < line> 。=。 <目标> <特征>:其中值GT; <特征>:其中值GT; ...< feature>:< value> #< info> 

我只有500行的tf-idf功能。可悲的是,这不需要svm-light读取,因为它需要特征/值对。关于什么价值的任何想法可以或如何可以更改文件以便阅读?



我拥有的文件(前5个电子邮件功能)的想法:

  1 201 1.0 
2 280 0.123165672613
2 313 0.343915400191
2 515 0.157569797284
2 588 0.343915400191
2 652 0.343915400191
2 657 0.343915400191
2 774 0.23622904941
2 921 0.283118375032
2 1158 0.254849368195
2 1240 0.343915400191
2 1348 0.343915400191
2 1362 0.222321349873
3 57 0.342220321154
3 185 0.391349077827
3 244 0.391349077827
3 300 0.391349077827
3 693 0.391349077827
3 730 0.342220321154
3 1391 0.391349077827
4 57 0.342220321154
4 185 0.391349077827
4 244 0.391349077827
4 300 0.391349077827
4 693 0.391349077827
4 730 0.342220321154
4 1391 0.391349077827
5 32 0.323558487577
5 102 0.323558487577
5 157 0.364177022553
5 160 0.364177022553
5 718 0.151013895297
5 1171 0.364177022553
5 1277 0.323558487577
5 1308 0.364177022553
5 1336 0.364177022553

请帮助!

解决方案

如果您从每个单词中创建一个功能,请创建所有唯一字词的列表 w(1) .. w(n) EM>。现在,如果您正在检查的样本中存在 w(i),则功能(i)将获取值1。 (您也可以使该值等于出现次数,从而发生多次的功能会更加重要。)



假设以下示例; p>

  1我的气垫船充满鳗鱼
2您的帐户被暂停
3这是它!

...您可以提取以下字典;

  001我的
002气垫船
003是


009暂停
010这个
011它!

(前导零只是使功能看起来与本博览会中其他数字不同,通常应该可能没有任何前导零。)



样本1的功能是001到006;对于样本3,它们是010,003和011.其他特征获得值0.因此,样本3的完整表示将如下所示:

  3 001:0 002:0 003:1 004:0 005:0 ... 

(虽然我不认为你需要指定零,即不存在的功能)。



然而,考虑到小样本大小(只是科目),它是不太可能得到很好的结果。也许你会更好地使用例如二进制或三元组功能(使用滑动窗口分割每个单词; tri rig em> ram )。



我认为尝试将tf-idf与SVM进行混合是不正确的,它们是同一基础的不同方法问题。


I am working with email subject, so I have 20 emails i want to classify, and a file with 20 lines - one line has one email subject.I have been working on it, but I am unable to figure out what the features refer to and the format of the input file for svmlight. Any tips to proceed will be helpful. Thanks in advance!

Edit: I have taken the tf-idf of the first 500 subject lines as a trial. However, according to svm-light format, we need:

<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>

I have only the tf-idf features for 500 lines. Sadly, this is not read by the svm-light as it needs features/value pair. Any ideas on what the value could be or how I can change the file in order to be read?

An idea of the file I have(first 5 email features):

1 201 1.0
2 280 0.123165672613
2 313 0.343915400191
2 515 0.157569797284
2 588 0.343915400191
2 652 0.343915400191
2 657 0.343915400191
2 774 0.23622904941
2 921 0.283118375032
2 1158 0.254849368195
2 1240 0.343915400191
2 1348 0.343915400191
2 1362 0.222321349873
3 57 0.342220321154
3 185 0.391349077827
3 244 0.391349077827
3 300 0.391349077827
3 693 0.391349077827
3 730 0.342220321154
3 1391 0.391349077827
4 57 0.342220321154
4 185 0.391349077827
4 244 0.391349077827
4 300 0.391349077827
4 693 0.391349077827
4 730 0.342220321154
4 1391 0.391349077827
5 32 0.323558487577
5 102 0.323558487577
5 157 0.364177022553
5 160 0.364177022553
5 718 0.151013895297
5 1171 0.364177022553
5 1277 0.323558487577
5 1308 0.364177022553
5 1336 0.364177022553

Please help!

解决方案

If you make a feature out of each word, create a list of all unique words w(1)..w(n). Now feature(i) gets the value 1 if w(i) exists in the sample you are examining. (You could also make the value be equal to the number of occurrences, so that a feature which occurs multiple times gets more weight.)

Assuming the following samples;

1 My hovercraft is full of eels
2 Your account is suspended
3 This is it!

... you could extract the following dictionary;

001 My
002 hovercraft
003 is
 :
 :
009 suspended
010 This
011 it!

(The leading zeros are just to make the features look different than the other numbers in this exposition. Normally there should probably not be any leading zeros.)

The features for sample 1 are 001 through 006; for sample 3 they are 010, 003, and 011. The other features get the value 0. So the full representation of sample 3 would look like

3 001:0 002:0 003:1 004:0 005:0 ...

(though I don't think you need to specify the zero, i.e. absent, features).

However, given the small sample size (just subjects), it's unlikely that you get very good results. Perhaps you'd be better off using e.g. bigram or trigram features (split each word using a sliding window; tri, rig, igr, gra, ram).

I don't think it makes sense to try to mix tf-idf with SVM, they are different approaches to the same fundamental problem.

这篇关于使用svm-light进行邮件分类的Python文件格式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆