神经网络用于电子邮件垃圾邮件检测 [英] Neural networks for email spam detection

查看：148 发布时间：2020/5/4 9:06:37 machine-learning neural-network classification spam-prevention

本文介绍了神经网络用于电子邮件垃圾邮件检测的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

假设您可以访问一个电子邮件帐户，该帐户具有过去几年中收到的电子邮件的历史记录(约1万封电子邮件)，分为2组

Let's say you have access to an email account with the history of received emails from the last years (~10k emails) classified into 2 groups

真实电子邮件
垃圾邮件

您将如何处理创建可用于垃圾邮件检测的神经网络解决方案的任务-基本上将任何电子邮件分类为垃圾邮件还是非垃圾邮件?

How would you approach the task of creating a neural network solution that could be used for spam detection - basically classifying any email either as spam or not spam?

让我们假设电子邮件提取已经到位，我们只需要关注分类部分.

Let's assume that the email fetching is already in place and we need to focus on classification part only.

我希望得到的要点是:

选择哪个参数作为NN的输入，为什么?
哪种神经网络结构最有可能最适合这种任务?

欢迎任何资源建议或现有实现(最好是C#)

Also any resource recommendations, or existing implementations (preferably in C#) are more than welcome

谢谢

编辑

我将使用神经网络作为该项目的主要方面是测试NN方法如何用于垃圾邮件检测
简单地在神经网络和垃圾邮件上研究主题也是一个玩具问题"

推荐答案

如果您坚持使用NN ...我会为每封电子邮件计算一些功能

If you insist on NNs... I would calculate some features for every email

基于字符，基于单词和词汇的功能(据我统计，大约有97个):

Both Character-Based, Word-based, and Vocabulary features (About 97 as I count these):

字符总数(C)
字母字符总数/C字母字符比率
数字字符总数/C
空格字符/C的总数
每个字母/C的频率(键盘的36个字母-A-Z，0-9)
特殊字符的频率(10个字符:*，_，+，=，％，$，@，ـ，\，/)
总字数(M)
短词总数/M不超过两个字母
单词/C中的字符总数
平均字长
平均句子长度(以字符为单位)
平均句子长度(以字为单位)
字长频率.分布/M长度为n，n在1到15之间的单词的比率
类型令牌比率唯一字数/M
Hapax Legomena Freq.一次出现的单词数
Hapax Dislegomena Freq.出现两次的单词
Yule的K度量
辛普森的D度量
Sichel的S量度
Brunet的W度量
Honore的R度量
标点符号的频率18个标点符号:. ，; ? ！ :()–«»< > [] {}

Total no of characters (C)
Total no of alpha chars / C Ratio of alpha chars
Total no of digit chars / C
Total no of whitespace chars/C
Frequency of each letter / C (36 letters of the keyboard – A-Z, 0-9)
Frequency of special chars (10 chars: *, _ ,+,=,%,$,@,ـ , \,/ )
Total no of words (M)
Total no of short words/M Two letters or less
Total no of chars in words/C
Average word length
Avg. sentence length in chars
Avg. sentence length in words
Word length freq. distribution/M Ratio of words of length n, n between 1 and 15
Type Token Ratio No. Of unique Words/ M
Hapax Legomena Freq. of once-occurring words
Hapax Dislegomena Freq. of twice-occurring words
Yule’s K measure
Simpson’s D measure
Sichel’s S measure
Brunet’s W measure
Honore’s R measure
Frequency of punctuation 18 punctuation chars: . ، ; ? ! : ( ) – " « » < > [ ] { }

您还可以根据格式添加更多功能:颜色，字体，大小，使用的....

You could also add some more features based on the formatting: colors, fonts, sizes, ... used.

这些措施中的大多数都可以在网上，论文中甚至在Wikipedia上找到(它们都是简单的计算，可能基于其他功能).

Most of these measures can be found online, in papers, or even Wikipedia (they're all simple calculations, probably based on the other features).

因此，大约有100个要素，您需要100个输入，一个隐藏层中的一定数量的节点和一个输出节点.

So with about 100 features, you need 100 inputs, some number of nodes in a hidden layer, and one output node.

需要根据您当前的预分类语料对输入进行标准化.

The inputs would need to be normalized according to your current pre-classified corpus.

我将其分为两组，一组用作培训组，另一组用作测试组，从不混合使用.也许垃圾邮件/非垃圾邮件比率相似的培训/测试组比率为50/50.

I'd split it into two groups, use one as a training group, and the other as a testing group, never mixing them. Maybe at a 50/50 ratio of train/test groups with similar spam/nonspam ratios.

这篇关于神经网络用于电子邮件垃圾邮件检测的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

神经网络用于电子邮件垃圾邮件检测 [英] Neural networks for email spam detection

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

神经网络用于电子邮件垃圾邮件检测 [英] Neural networks for email spam detection

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

登录关闭