用 r 在双峰分布中找到局部最小值 [英] Find local minimum in bimodal distribution with r
问题描述
我的数据是预处理的图像数据,我想分开两个类.理论上(并希望在实践中)最佳阈值是双峰分布数据中两个峰值之间的局部最小值.
My data are pre-processed image data and I want to seperate two classes. In therory (and hopefully in practice) the best threshold is the local minimum between the two peaks in the bimodal distributed data.
我的测试数据是:http://www.file-upload.net/download-9365389/data.txt.html
我试图遵循 此线程:我绘制了直方图并计算了核密度函数:
I tried to follow this thread: I plotted the histogram and calculated the kernel density function:
datafile <- read.table("....txt")
data <- data$V1
hist(data)
d <- density(data) # returns the density data with defaults
hist(data,prob=TRUE)
lines(d) # plots the results
但是如何继续呢?
我会计算密度函数的一阶和二阶导数以找到局部极值,特别是局部最小值.但是我不知道如何在 R 中执行此操作,并且 density(test)
似乎不是正常功能.因此请帮助我:如何计算导数并找到密度函数密度(测试)
中两个峰值之间的凹坑的局部最小值?
I would calculate the first and second derivates of the density function to find the local extrema, specifically the local minimum. However I have no idea how to do this in R and density(test)
seems not to be a normal function. Thus please help me: how can I calculate the derivates and find the local minimum of the pit between the two peaks in the density function density(test)
?
推荐答案
有几种方法可以做到这一点.
There are a few ways to do this.
首先,在您的问题中使用 d
作为密度,d$x
和 d$y
包含 x 和 y 值密度图.当导数 dy/dx = 0 时出现最小值.由于 x 值等距,我们可以使用 diff(d$y)
估计 dy,并寻找 d$x
其中 abs(diff(d$y))
最小化:
First, using d
for the density as in your question, d$x
and d$y
contain the x and y values for the density plot. The minimum occurs when the derivative dy/dx = 0. Since the x-values are equally spaced, we can estimate dy using diff(d$y)
, and seek d$x
where abs(diff(d$y))
is minimized:
d$x[which.min(abs(diff(d$y)))]
# [1] 2.415785
问题是当 dy/dx = 0 时,密度曲线也可以最大化.在这种情况下,最小值很浅,但最大值达到了峰值,所以它可以工作,但你不能指望这一点.
The problem is that the density curve could also be maximized when dy/dx = 0. In this case the minimum is shallow but the maxima are peaked, so it works, but you can't count on that.
所以第二种方法使用 optimize(...)
在给定的间隔内寻找局部最小值.optimize(...)
需要一个函数作为参数,所以我们使用 approxfun(d$x,d$y)
来创建一个插值函数.
So a second way uses optimize(...)
which seeks a local minimum in a given interval. optimize(...)
needs a function as argument, so we use approxfun(d$x,d$y)
to create an interpolation function.
optimize(approxfun(d$x,d$y),interval=c(1,4))$minimum
# [1] 2.415791
最后,我们证明这确实是最小值:
Finally, we show that this is indeed the minimum:
hist(data,prob=TRUE)
lines(d, col="red", lty=2)
v <- optimize(approxfun(d$x,d$y),interval=c(1,4))$minimum
abline(v=v, col="blue")
另一种实际上更受欢迎的方法是使用 k-means 聚类.
Another approach, which is preferred actually, uses k-means clustering.
df <- read.csv(header=F,"data.txt")
colnames(df) = "X"
# bimodal
km <- kmeans(df,centers=2)
df$clust <- as.factor(km$cluster)
library(ggplot2)
ggplot(df, aes(x=X)) +
geom_histogram(aes(fill=clust,y=..count../sum(..count..)),
binwidth=0.5, color="grey50")+
stat_density(geom="line", color="red")
数据实际上看起来比双峰更三峰.
The data actually looks more trimodal than bimodal.
# trimodal
km <- kmeans(df,centers=3)
df$clust <- as.factor(km$cluster)
library(ggplot2)
ggplot(df, aes(x=X)) +
geom_histogram(aes(fill=clust,y=..count../sum(..count..)),
binwidth=0.5, color="grey50")+
stat_density(geom="line", color="red")
这篇关于用 r 在双峰分布中找到局部最小值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!