KDD99数据集中的要素值有误吗? [英] Features' value in KDD99 data set was wrong?

查看：163 发布时间：2020/5/4 10:19:38 machine-learning dataset classification intrusion-detection network-security

本文介绍了KDD99数据集中的要素值有误吗?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在 KDD99数据集中，大量的连接是第32位并且第33个要素的值大于100.

In KDD99 data set, a huge number of connections 32nd and 33rd feature’s value is greater than 100.

我不明白为什么使用connection window的100个连接可以获得大于100的值?我查阅了很多信息，但一无所获.

I can’t understand the reason why used a connection window of 100 connections can get a value which is greater than 100? I consulted a lot of information, but found nothing.

推荐答案

数据集包含这些功能是通过预处理TCP转储文件获得的.

These features were obtained preprocessing TCP dump files.

为此，将TCP转储文件中的数据包信息汇总为连接.具体来说( http://kdd.ics.uci.edu/databases/kddcup99/task.html ):

To do so, packet information in the TCP dump file was summarized into connections. Specifically (http://kdd.ics.uci.edu/databases/kddcup99/task.html):

连接是一系列TCP数据包的开始和结束，并在某处结束定义的时间，数据从源IP地址流向目标的时间符合明确定义的协议的IP地址.

a connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows from a source IP address to a target IP address under some well defined protocol.

某些功能(所谓的基于时间的交通功能)是在2秒钟的时间窗口内计算的.

Some of the features (the so called Time-based Traffic Features) were calculated over a 2-seconds temporal windows.

使用通过多个连接(在本例中为100)估算的历史窗口的其他功能(基于主机的流量功能).

Other features (Host-based Traffic Features) using a historical window estimated over a number of connections (in this case 100).

基于主机的功能对于间隔时间大于2的攻击很有用秒.

Host-based features are useful for attacks which span intervals longer than 2 seconds.

2秒和100个连接在某种程度上是任意值.

2-seconds and 100-connections are somewhat arbitrary values.

这两类要素的值没有上限(例如在2秒的间隔内与同一主机的连接数可以大于100).

The values of these two class of features haven't an upper limit (e.g. the number of connections to the same host over the 2-seconds interval can be greater than 100).

对于相同"应为:

32. | dst host count | count of connections having the same destination host



33. | dst host srv count | count of connections having the same
                           destination host and using the same service

问题在于，没有文档说明KDD功能提取的详细信息.主要参考资料是:

The problem is that there was no documentation explaining the details of KDD features extraction. The main reference is:

构建入侵检测系统特征和模型的框架 -温克·李/SALVATORE J. STOLFO

A Framework for Constructing Features and Models for Intrusion Detection Systems - WENKE LEE / SALVATORE J. STOLFO

很明显，使用了 bro-ids工具:

使用Bro作为包过滤和重组引擎.我们扩展了Bro以处理ICMP数据包，并更改了其数据包片段检查模块，因为它在处理包含Teardrop或Ping-of-Death攻击的数据时崩溃了.我们使用Bro的连接完成"事件处理程序为每个连接输出摘要记录.

used Bro as the packet filtering and reassembling engine. We extended Bro to handle ICMP packets, and made changes to its packet fragment inspection modules since it crashed when processing data that contains Teardrop or Ping-of-Death attacks. We used a Bro "connection finished" event handler to output a summarized record for each connection.

和

在Bro事件处理程序中，我们添加了用于检查交互式TCP连接(例如telnet，ftp，smtp等)的数据交换的函数.这些函数将值分配给一组内容"功能，以指示数据内容是否暗示可疑行为.

In the Bro event handlers, we added functions that inspect data exchanges of interactive TCP connections (e.g., telnet, ftp, smtp, etc.). These functions assign values to a set of "content" features to indicate whether the data contents suggest suspicious behavior.

但这还不够.

dst host count和dst host srv count都在[0,255]范围内.

Github上的 AI-IDS/kdd99_feature_extractor 项目可以从中提取第32和第33个特征原始数据(请查看stats*.cpp文件)，但是:

The AI-IDS/kdd99_feature_extractor project on Github can extract the 32nd and 33rd feature from raw data (take a look at the stats*.cpp files) but:

某些功能的计算方法可能与KDD中的计算方法不完全相同

Some feature might not be calculated exactly same way as in KDD

有关Stackoverflow的相关问题是:

KDD99数据集中的要素值有误吗? [英] Features' value in KDD99 data set was wrong?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录关闭

KDD99数据集中的要素值有误吗? [英] Features&#39; value in KDD99 data set was wrong?

问题描述

推荐答案

相关文章

AI人工智能最新文章

热门教程

热门工具

登录 关闭

KDD99数据集中的要素值有误吗? [英] Features' value in KDD99 data set was wrong?

登录关闭