神经网络用于电子邮件垃圾邮件检测 [英] Neural networks for email spam detection

查看:148
本文介绍了神经网络用于电子邮件垃圾邮件检测的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

假设您可以访问一个电子邮件帐户,该帐户具有过去几年中收到的电子邮件的历史记录(约1万封电子邮件),分为2组

Let's say you have access to an email account with the history of received emails from the last years (~10k emails) classified into 2 groups

  • 真实电子邮件
  • 垃圾邮件

您将如何处理创建可用于垃圾邮件检测的神经网络解决方案的任务-基本上将任何电子邮件分类为垃圾邮件还是非垃圾邮件?

How would you approach the task of creating a neural network solution that could be used for spam detection - basically classifying any email either as spam or not spam?

让我们假设电子邮件提取已经到位,我们只需要关注分类部分.

Let's assume that the email fetching is already in place and we need to focus on classification part only.

我希望得到的要点是:

  1. 选择哪个参数作为NN的输入,为什么?
  2. 哪种神经网络结构最有可能最适合这种任务?

欢迎任何资源建议或现有实现(最好是C#)

Also any resource recommendations, or existing implementations (preferably in C#) are more than welcome

谢谢

编辑

  • 我将使用神经网络作为该项目的主要方面是测试NN方法如何用于垃圾邮件检测
  • 简单地在神经网络和垃圾邮件上研究主题也是一个玩具问题"

推荐答案

如果您坚持使用NN ...我会为每封电子邮件计算一些功能

If you insist on NNs... I would calculate some features for every email

基于字符,基于单词和词汇的功能(据我统计,大约有97个):

Both Character-Based, Word-based, and Vocabulary features (About 97 as I count these):

  1. 字符总数(C)
  2. 字母字符总数/C字母字符比率
  3. 数字字符总数/C
  4. 空格字符/C的总数
  5. 每个字母/C的频率(键盘的36个字母-A-Z,0-9)
  6. 特殊字符的频率(10个字符:*,_,+,=,%,$,@,ـ,\,/)
  7. 总字数(M)
  8. 短词总数/M不超过两个字母
  9. 单词/C中的字符总数
  10. 平均字长
  11. 平均句子长度(以字符为单位)
  12. 平均句子长度(以字为单位)
  13. 字长频率.分布/M长度为n,n在1到15之间的单词的比率
  14. 类型令牌比率唯一字数/M
  15. Hapax Legomena Freq.一次出现的单词数
  16. Hapax Dislegomena Freq.出现两次的单词
  17. Yule的K度量
  18. 辛普森的D度量
  19. Sichel的S量度
  20. Brunet的W度量
  21. Honore的R度量
  22. 标点符号的频率18个标点符号:. ,; ? ! :()–«»< > [] {}
  1. Total no of characters (C)
  2. Total no of alpha chars / C Ratio of alpha chars
  3. Total no of digit chars / C
  4. Total no of whitespace chars/C
  5. Frequency of each letter / C (36 letters of the keyboard – A-Z, 0-9)
  6. Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ , \,/ )
  7. Total no of words (M)
  8. Total no of short words/M Two letters or less
  9. Total no of chars in words/C
  10. Average word length
  11. Avg. sentence length in chars
  12. Avg. sentence length in words
  13. Word length freq. distribution/M Ratio of words of length n, n between 1 and 15
  14. Type Token Ratio No. Of unique Words/ M
  15. Hapax Legomena Freq. of once-occurring words
  16. Hapax Dislegomena Freq. of twice-occurring words
  17. Yule’s K measure
  18. Simpson’s D measure
  19. Sichel’s S measure
  20. Brunet’s W measure
  21. Honore’s R measure
  22. Frequency of punctuation 18 punctuation chars: . ، ; ? ! : ( ) – " « » < > [ ] { }

您还可以根据格式添加更多功能:颜色,字体,大小,使用的....

You could also add some more features based on the formatting: colors, fonts, sizes, ... used.

这些措施中的大多数都可以在网上,论文中甚至在Wikipedia上找到(它们都是简单的计算,可能基于其他功能).

Most of these measures can be found online, in papers, or even Wikipedia (they're all simple calculations, probably based on the other features).

因此,大约有100个要素,您需要100个输入,一个隐藏层中的一定数量的节点和一个输出节点.

So with about 100 features, you need 100 inputs, some number of nodes in a hidden layer, and one output node.

需要根据您当前的预分类语料对输入进行标准化.

The inputs would need to be normalized according to your current pre-classified corpus.

我将其分为两组,一组用作培训组,另一组用作测试组,从不混合使用.也许垃圾邮件/非垃圾邮件比率相似的培训/测试组比率为50/50.

I'd split it into two groups, use one as a training group, and the other as a testing group, never mixing them. Maybe at a 50/50 ratio of train/test groups with similar spam/nonspam ratios.

这篇关于神经网络用于电子邮件垃圾邮件检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆