神经网络用于电子邮件垃圾邮件检测 [英] Neural networks for email spam detection
问题描述
假设您可以访问一个电子邮件帐户,该帐户具有过去几年中收到的电子邮件的历史记录(约1万封电子邮件),分为2组
Let's say you have access to an email account with the history of received emails from the last years (~10k emails) classified into 2 groups
- 真实电子邮件
- 垃圾邮件
您将如何处理创建可用于垃圾邮件检测的神经网络解决方案的任务-基本上将任何电子邮件分类为垃圾邮件还是非垃圾邮件?
How would you approach the task of creating a neural network solution that could be used for spam detection - basically classifying any email either as spam or not spam?
让我们假设电子邮件提取已经到位,我们只需要关注分类部分.
Let's assume that the email fetching is already in place and we need to focus on classification part only.
我希望得到的要点是:
- 选择哪个参数作为NN的输入,为什么?
- 哪种神经网络结构最有可能最适合这种任务?
欢迎任何资源建议或现有实现(最好是C#)
Also any resource recommendations, or existing implementations (preferably in C#) are more than welcome
谢谢
编辑
- 我将使用神经网络作为该项目的主要方面是测试NN方法如何用于垃圾邮件检测
- 简单地在神经网络和垃圾邮件上研究主题也是一个玩具问题"
推荐答案
如果您坚持使用NN ...我会为每封电子邮件计算一些功能
If you insist on NNs... I would calculate some features for every email
基于字符,基于单词和词汇的功能(据我统计,大约有97个):
Both Character-Based, Word-based, and Vocabulary features (About 97 as I count these):
- 字符总数(C)
- 字母字符总数/C字母字符比率
- 数字字符总数/C
- 空格字符/C的总数
- 每个字母/C的频率(键盘的36个字母-A-Z,0-9)
- 特殊字符的频率(10个字符:*,_,+,=,%,$,@,ـ,\,/)
- 总字数(M)
- 短词总数/M不超过两个字母
- 单词/C中的字符总数
- 平均字长
- 平均句子长度(以字符为单位)
- 平均句子长度(以字为单位)
- 字长频率.分布/M长度为n,n在1到15之间的单词的比率
- 类型令牌比率唯一字数/M
- Hapax Legomena Freq.一次出现的单词数
- Hapax Dislegomena Freq.出现两次的单词
- Yule的K度量
- 辛普森的D度量
- Sichel的S量度
- Brunet的W度量
- Honore的R度量
- 标点符号的频率18个标点符号:. ,; ? ! :()–«»< > [] {}
- Total no of characters (C)
- Total no of alpha chars / C Ratio of alpha chars
- Total no of digit chars / C
- Total no of whitespace chars/C
- Frequency of each letter / C (36 letters of the keyboard – A-Z, 0-9)
- Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ , \,/ )
- Total no of words (M)
- Total no of short words/M Two letters or less
- Total no of chars in words/C
- Average word length
- Avg. sentence length in chars
- Avg. sentence length in words
- Word length freq. distribution/M Ratio of words of length n, n between 1 and 15
- Type Token Ratio No. Of unique Words/ M
- Hapax Legomena Freq. of once-occurring words
- Hapax Dislegomena Freq. of twice-occurring words
- Yule’s K measure
- Simpson’s D measure
- Sichel’s S measure
- Brunet’s W measure
- Honore’s R measure
- Frequency of punctuation 18 punctuation chars: . ، ; ? ! : ( ) – " « » < > [ ] { }
您还可以根据格式添加更多功能:颜色,字体,大小,使用的....
You could also add some more features based on the formatting: colors, fonts, sizes, ... used.
这些措施中的大多数都可以在网上,论文中甚至在Wikipedia上找到(它们都是简单的计算,可能基于其他功能).
Most of these measures can be found online, in papers, or even Wikipedia (they're all simple calculations, probably based on the other features).
因此,大约有100个要素,您需要100个输入,一个隐藏层中的一定数量的节点和一个输出节点.
So with about 100 features, you need 100 inputs, some number of nodes in a hidden layer, and one output node.
需要根据您当前的预分类语料对输入进行标准化.
The inputs would need to be normalized according to your current pre-classified corpus.
我将其分为两组,一组用作培训组,另一组用作测试组,从不混合使用.也许垃圾邮件/非垃圾邮件比率相似的培训/测试组比率为50/50.
I'd split it into two groups, use one as a training group, and the other as a testing group, never mixing them. Maybe at a 50/50 ratio of train/test groups with similar spam/nonspam ratios.
这篇关于神经网络用于电子邮件垃圾邮件检测的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!