2 个样本之间的 Kullback-Leibler 距离 [英] Kullback-Leibler distance between 2 samples

查看:24
本文介绍了2 个样本之间的 Kullback-Leibler 距离的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理属于 2 个组 A 和 B 的数据.我试图找到显示这两个群体之间最大差异的变量,我认为 Kullback-Leibler 距离将是一个很好的衡量标准.这是代表我的数据的示例

I am working with data that belong to 2 groups, A and B. I am trying to find the variable that shows the biggest difference between the 2 populations and I thought the Kullback-Leibler distance would be a good measure for that. Here's a sample that represent my data

df1 <- structure(list(Var1 = c(2L, 3L, 5L, 7L, 2L, 1L, 0L, 0L, 0L, 1L, 
3L, 4L), VarA = c(0.56, 0.43, 0.25, 0.12, 0.78, 0.55, 0.35, 0.36, 
0.3, 0.41, 0.43, 0.5), VarT = c(10L, 11L, 15L, 12L, 8L, 7L, 7L, 
7L, 6L, 5L, 1L, 2L), Var3 = c(152L, 187L, 149L, 132L, 132L, 178L, 
240L, 205L, 137L, 125L, 124L, 56L), group = structure(c(1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "A", class = "factor")), .Names = c("Var1", 
"VarA", "VarT", "Var3", "group"), class = "data.frame", row.names = c(NA, 
-12L))

df2 <- structure(list(Var1 = c(5L, 8L, 7L, 4L, 5L, 2L, 1L, 2L, 6L, 5L
), VarA = c(0.24, 0.76, 0.43, 0, 0.52, 0.63, 0.46, 0.64, 0.55, 
0.78), VarT = c(10L, 8L, 9L, 5L, 11L, 14L, 12L, 1L, 7L, 7L), 
    Var3 = c(205L, 120L, 531L, 203L, 215L, 224L, 211L, 212L, 
    134L, 222L), group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L), .Label = "B", class = "factor")), .Names = c("Var1", 
"VarA", "VarT", "Var3", "group"), class = "data.frame", row.names = c(NA, 
-10L))

我正在考虑在 for 循环中对类似列应用 Kullback-Leibler 距离测试,以查看显示两组中同一变量之间最大距离的变量.

I am thinking of applying the Kullback-Leibler distance test in a for loop for similar columns to see the variable that shows the largest distance between the same variable in the 2 groups.

首先,我尝试从 FNN 包中运行此命令

To start with, I have tried to run this command from the FNN package

require(FNN)
X <- df1[,2]
Y <- df2[,2]
KLx.dist(X, Y, k = 5)
[1]        NaN       -Inf -0.1928958  0.0312911  0.1972085

结果很有趣,这些距离甚至都不近!我的问题是:我是否正确应用了测试?如果是,为什么距离显示出如此巨大的差异?

The result is quite funny, none of these distances are even close to each other! My question here would be: am I applying the test correctly? if yes, why do the distances show that huge difference?

注意:如果任何其他测试可以完成这项工作,我很乐意尝试它们.

Note: If any other tests can do the job, I am happy to try them.

非常感谢,

推荐答案

问题是您没有足够的数据来使用最近邻准确计算 KL 散度.即使对于大型数据集,当最近邻居的数量很少时,这种特定的距离度量也会跳跃.例如:

The problem is that you don't have enough data to accurately compute KL-divergence using nearest neighbors. Even for large datasets, this particular distance measure jumps around when the number of nearest neighbors is small. For example:

set.seed(123)
x<-rnorm(50000)
y<-rnorm(50000)+0.1
plot(KLx.dist(x,y,100))

您有 12 个数据点,因此即使选择 6 个最近的邻居也将是数据集的一半.您是否考虑过简单地使用 T 检验,它可以处理小样本?

You have 12 datapoints, so even choosing 6 nearest neighbors would be half the dataset. Have you considered simply using a T-test, which can work with small samples?

这篇关于2 个样本之间的 Kullback-Leibler 距离的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆