如何找到将具有两个具有不同属性的点的两个区域分隔开的最佳直线 [英] How to find the best straight line separating two regions having points with 2 different properties

查看:73
本文介绍了如何找到将具有两个具有不同属性的点的两个区域分隔开的最佳直线的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在2D图中有很多点.红点表示实验稳定的时间,黑点表示不稳定的时间.在此对数-对数图中,两个区域明显由一条线隔开,我想找到最佳的分离线",即给出将两个区域分开的标准并且在该标准上具有最小误差的线.我在各种书籍和在线上进行了搜索,但是找不到任何解决此问题的方法.您知道任何工具吗?首先必须定义错误.我想到的一件事是:如果未知线是ax + by + c = 0,则对于每个点(x0,y0),我们都定义一个误差函数,如下所示:

I have a bunch of points in a 2D plot. The red points indicate when my experiment is stable, the black when it is unstable. The two region are clearly separated by a line in this log-log plot, and I would like to find the best "separating line", i.e. the line that gives the criterion to separate the 2 regions and has the minimum error on this criterion. I did a search on various books and online but I could not find any approach to solve this problem. Are you aware of any tool? First of all one has to define the error. One thing that comes in my mind is: if the unknown line is ax+by+c=0, for each point (x0,y0) we define an error function like the following:

E=0 if point lays on the correct side of the line.
E= distance(a*x+b*y+c=0,(x0,y0)) = |a*x0+b*y0+c|/sqrt(a^2+b^2)   if the point
   lies on the wrong side.

,我们使错误的总和最小化.由于存在阈值,所以这并不简单. 如果你们知道解决该问题的方法的一些参考或链接,将不胜感激. 干杯 答:

and we minimize the sum of the errors. Not simple though since there is a threshold. If you guys know about some reference or link of approaches that solve this problem that would be appreciated. Cheers A.

推荐答案

一些参考: 维基百科 线性分类器支持向量机(SVM),
scikit-learn SVM , 一个具有3个类的示例,
questions/tagged/classification on SO
3000更多关于stats.stackexchange的 questions/tagged/classification
另外400个 questions/tagged/classification on datascience.stackexchange .

Some refs: Wikipedia Linear classifier and Support vector machine (SVM),
scikit-learn SVM, an example with 3 classes,
questions/tagged/classification on SO,
3000 more questions/tagged/classification on stats.stackexchange,
400 more questions/tagged/classification on datascience.stackexchange .

对于您的2类问题,请执行以下步骤:

For your 2-class problem, do these steps:

  1. 找到红色点的中点Rmid,黑色的Bmid,批次的中点

  1. find the midpoints Rmid of the red points, Bmid of the black, Mid of the lot

将线L从Rmid绘制到Bmid

draw the line L from Rmid to Bmid

穿过中线(垂直于L线)的(超)平面就是您想要的:线性分类器.
或者,您可以仅比较| x-Rmid |的距离.和| x-Bmid |: 叫x更接近Rmid红色,更接近Bmid黑色.

the (hyper)plane through Mid, perpendicular to line L, is what you want: a linear classifier.
Or you can just compare the distances |x - Rmid| and |x - Bmid|: call x nearer Rmid red, nearer Bmid black.

还有更多要说的. 将所有数据点投影到线L上会产生一维问题:

But there's more to be said. Projecting all the data points onto line L gives a 1-dimensional problem:

rrrrrrrrrrbrrrrrrrrbbrrr | rrbbbbbbbbbbbbbbb

在这条线上绘制所有点是一个好主意, 以查看并更好地了解数据.
(对于说5或10维的点云,它可能很有趣和/或提供了很多信息 从不同角度看2d或3d切片.)

It's a good idea to plot all the points on this line, to see and better understand the data.
(For point clouds in say 5 or 10 dimensions, it might be fun and/or informative to look at 2d or 3d slices from different angles.)

每次剪切,"|"上面给出了4个数字的混淆矩阵":

Each cut, "|" above, gives a "confusion matrix" of 4 numbers:

R-correct   R-called-B  e.g.  490   10
B-called-R  B-correct          50  450

这粗略地给出了红色/黑色预测的错误率;打印,讨论.
最好的削减取决于成本, 例如如果称呼R a B比称呼B a R糟10倍或100倍.

This gives a rough idea of the error rate of your predictions red / black; print it, discuss it.
The best cut depends on costs, e.g. if calling an R a B is 10 times or 100 times worse than calling a B an R.

如果红点和黑点的散布/协方差不同,请参见 费舍尔的线性判别式.

If the red points and the black points have different scatter / covariance, see Fisher's linear discriminant .

(" SVM是用于良好"分离超平面/超曲面的一类方法的术语- 没有机器".)

("SVM" is jargon for a class of methods for "good" separating hyperplanes / hypersurfaces -- there's no "machine".)

这篇关于如何找到将具有两个具有不同属性的点的两个区域分隔开的最佳直线的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆