ML够功能吗? [英] ML enough features?

查看:73
本文介绍了ML够功能吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在加速度计数据集上训练随机森林.我计算诸如均值,标准差,轴之间的相关性,曲线下的面积等特征.我是ML Noob.

I'm trying to train a random forest on accelerometer dataset. I calculate features like mean, sd, correlation between axes, area under curve and others. I'm a ML Noob.

我试图了解两件事:

1.如果我将一个人的数据集分成测试并训练和运行RF预测,则准确性很高(> 90%).但是,如果我使用来自不同人的数据训练RF,然后进行预测,则准确性会很低(<50%).为什么?我该如何调试?不知道我在做什么错.

1.If I split the dataset from one person into test and train and run the RF prediction the accuracy is high (> 90%). However, if I train the RF with data from different people and then predict, the accuracy is low (< 50%). Why? How do I debug this? Not sure what I'm doing wrong.

  1. 在上面的示例中,要达到90%的准确性,足够"有多少个功能? 足够"有多少数据?

我可以提供更多细节.数据集来自10个人,带有标记数据的大文件.我将自己限制在上述功能上,以避免进行大量计算.

I can furnish more details. Dataset is from 10 people, large files of labelled data. I have limited myself to the above features to avoid lots of compute.

推荐答案

  1. 您的分类器最有可能过拟合,当您仅对1个人进行分类训练时,它不能很好地泛化,它可能只是简单地用标签记忆"了数据集,而不是捕获一般的分布规则:每个特征如何与其他特征关联/如何关联他们影响结果/等.也许您需要更多数据或更多功能.

  1. Most probably your classifier overfits, when you training it only on 1 person it not generalizes well, it may simply "memorize" dataset with labels instead of capturing general rules of distribution:how each feature correlated with other/how they affect result/etc. Maybe you need more data, or more features.

这不是一个简单的问题,它是泛化问题,对此有很多理论研究,例如: Akaike_information_criterion .而且即使有了这些理论的知识,您也无法准确地回答这个问题.大多数此类理论的主要原理-您拥有的数据更多,尝试拟合的变异模型较少,训练和测试所需的准确性之间的差异也较小-这种理论将使您的模型排名更高.例如,如果您不想使测试和训练集的准确性之间的差异最小化(以确保测试数据的准确性不会崩溃)-您需要增加数据量,提供更多有意义的功能(相对于您的模型),或使用变化较小的模型进行拟合.如果您对理论方面的更详细的解释感兴趣,可以从以下 CaltechX-CS1156x观看caltech的讲座.从数据中学习.

It's not so easy question, it is generalization problem, there are many theoretical researches about this, for example: Vapnik–Chervonenkis theory Akaike_information_criterion. And even with knowledge of such theories you cannot answer to this question accurately. The main principle of most of such theories - the more data you have, less variative model you trying to fit and less difference between accuracy on training and test you requiring - this theories will rank your model higher. E.g if you wan't to minimize difference between accuracy on test and training set (to make sure that accuracy on test data will not collapse) - you need to increase amount of data, provide more meaningful features (with respect to your model), or use less variative model for fitting. If you interesting in a more detailed explanation about theoretical aspect, you can watch lectures from caltech, starting from this CaltechX - CS1156x Learning from data.

这篇关于ML够功能吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆