查找“最干净"垃圾邮件的方法数据子集,即变异性最低的子集 [英] Method to find "cleanest" subset of data i.e. subset with lowest variability

查看:68
本文介绍了查找“最干净"垃圾邮件的方法数据子集,即变异性最低的子集的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在几个数据集中找到一种趋势.趋势涉及找到最佳拟合线,但是如果我认为此过程对于任何其他模型都不会有太大不同(可能会花费更多时间).

I am trying to find a trend in several datasets. The trends involve finding the best fit line, but if i imagine the procedure would not be too different for any other model (just possibly more time consuming).

有3种可能的方案:

  1. 所有良好数据,其中所有数据均符合单个趋势且变化性较低
  2. 所有不良数据,其中所有或大部分数据都表现出极大的可变性,必须丢弃整个数据集.
  3. 部分良好数据,其中一些数据可能很好,而其他数据则需要丢弃.
  1. All good data where all the data fits a single trend with a low variability
  2. All bad data where all or most of the data exhibits tremendous variability and the entire dataset must be discarded.
  3. Partial good data where some of the data may be good while the rest needs to be discarded.

如果具有极高可变性的数据的净百分比太高,则必须丢弃整个集合.这意味着基本上只有这种类型的数据,不良数据的百分比也有所不同:

If the net percentage of data with extreme variability is too high then the entire set must be discarded. This implies that there is essentially only this type of data and the percentage of bad data varies:

0%不良=情况1
100%糟糕=情况2

0% bad = Case 1
100% bad = Case 2

我只是在寻找可变性较低的连续部分;即我不在乎是否有一些符合趋势的点

I am only looking for contiguous sections with low variablity; i.e. I don't care if there are some individual points that fit the trend

我正在寻找的是一种巧妙的方法来对数据集进行分段并搜索指定的趋势.由于问题的本质,我不是在寻找最适合整体趋势的部分.我知道带有更干净"数据的小节最终将具有与整体(包含异常值)略微不同的趋势线属性.这正是我想要的,因为这部分数据最好地反映了实际趋势.

What I am looking for is a smart way to subsection section the dataset and search for the specified trend. As is the nature of the problem, I am not looking for sections that best fit the overall trend. I understand that the subsection with "cleaner" data will end up having slightly different trendline properties than the overall (which would contain the outliers). This is exactly what i want since this part of the data would best best reflect the actual trend.

我精通C ++,但是由于我试图使代码开源和跨平台,所以我坚持使用ISO C ++标准.这意味着没有.NET,但是如果您有.NET示例,也可以帮助我将其转换为ISO C ++,则将不胜感激.我也了解JAVA,一些汇编语言和fortran.

I am fluent in C++ but, since I am trying to make the code open source and cross-platform, I am stick to ISO C++ standards. This implies no .NET but if you have a .NET example I would appreciate if you could also help me convert it to ISO C++. I also have knowledge of JAVA, some assembly and fortran.

数据集本身并不庞大,但是大约有1.5亿个,因此蛮力可能不是最好的方法.

The datasets themselves are not huge but there are about 150 million of them and so brute force may not be the best way.

预先感谢

我知道我还有点悬而未决,所以让我澄清一下:

I understand that I have left some things up in the air and so let me clarify:

  • 每个数据集可能并且可能会有不同的趋势;也就是说,我并不是在所有数据集中都寻找相同的趋势.
  • 程序用户将定义他们想要的拟合度
  • 程序用户将定义子集在考虑趋势拟合之前必须具有多大的连续性
  • 如果程序扩展为允许任何类型的拟合(不仅仅是线性拟合),用户将定义要拟合的模型-这不是优先考虑的问题,如果上述查询已解决,那么我确定这种扩张相对来说是微不足道的
  • 离群值的产生是由于实验的性质和数据采集技术所致,即使已知这些区域会产生离群值,仍必须收集来自不良"部分的数据.丢弃这些离群值并不意味着正在操纵数据以适应任何趋势(统计免责声明,呵呵).

推荐答案

如果我理解正确,那么RANSAC算法就是您要寻找的一种方法. http://en.wikipedia.org/wiki/RANSAC

The RANSAC algorithm is one approach to what you're looking for if I understand you right. http://en.wikipedia.org/wiki/RANSAC

这篇关于查找“最干净"垃圾邮件的方法数据子集,即变异性最低的子集的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆