从数据集中排除噪声数据以训练模型是一个好主意吗? [英] Is it a good idea to exclude noisy data from the dataset to train the model?

查看:51
本文介绍了从数据集中排除噪声数据以训练模型是一个好主意吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

从数据集中排除噪声数据(可能会降低模型准确性或导致测试数据集出现意外输出)是一个好主意,以生成训练和验证数据集吗?

Will it be a good idea to exclude the noisy data( which may reduce model accuracy or cause unexpected output for testing dataset) from a dataset to generate the training and validation dataset ?

假设:嘈杂的数据是我们已知的

任何建议深表感谢!

推荐答案

这取决于您的应用程序.如果嘈杂的数据 有效 ,则可以将其包括在内以找到最佳模型.

It depends on your application. If the noisy data is valid, then definitely include it to find the best model.

但是,如果嘈杂的数据为 无效 ,则应在拟合模型之前将其清除.

However, if the noisy data is invalid, then it should be cleaned out before fitting your model.

噪声是一个广义术语,您最好将它们视为离群值或离群值.

Noise is a broad term, you better consider them as inliers or outliers instead.

大多数异常值检测算法都指定一个阈值,并根据给定的分数对候选值进行分类.在这种情况下,您可以选择消除最极端的值.例如说3xSTD远离均值(当然,如果您有类似高斯的分布式数据集).

Most of the outliers detection algorithms specify a threshold and sort the outliers candidates according to some given score. In this case, you can choose to eradicate the most extreme values. Say for example 3xSTD far from the mean (of course that is in case you have a Gaussian-like distributed data set).

所以我的建议是基于两件事来建立您的判断力:

So my suggestion is to build your judgement based on two things:

  1. 您的业务概念以及关于有效性与无效性的逻辑.例如:房屋面积,面积或价格不能为负数.
  2. 您的数学/算法逻辑.例如:根据某个阈值检测极值,以决定(连同/不与1号点一起)是否为有效观测值.

嘈杂的数据本身并不会引起很大的问题.极端嘈杂的数据(即极端值/离群值)是您真正应该关注的数据!这些点将在拟合数据时调整模型的假设.因此,结果可能会发生巨大变化/不正确.

Noisy data doesn't cause a huge problem themselves. The extreme noisy data (i.e. extreme values / outliers) are those you should really concern about! Such points would adjust the hypothesis of your model while fitting the data. Hence, results might be drastically shifted / incorrect.

最后,您可以查看 Pyod 开源Pythonic工具箱,其中包含很多现成的各种算法.(您可以选择一种以上的算法,并创建一个投票池来决定观察结果的极端性).

Finally, you can look at Pyod open-source Pythonic toolbox which contains a lot of different algorithms implemented off-the-shelf. (You can choose more than one algorithm and create a voting pool to decide the extremeness of the observations).

这篇关于从数据集中排除噪声数据以训练模型是一个好主意吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆