Spearman排名与缺失值相关吗? [英] Spearman rank correlation with missing values?

查看:93
本文介绍了Spearman排名与缺失值相关吗?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有两个单词列表,按出现的次数排序

I have two list of words which are ordered by the number of occurrences

排序是通过对两个文件中每个单词在不同时间点采样来计数而生成的。

The ordering was generated by counting each word in two files sampled at different point in times.

我想计算spearman看看在第二个文件中找到第一个文件的顺序有多好。

I would like to calculate spearman to see how well the order of the first file was found in the second file.

例如:

文件a:1)是2)进行了3)工作

File a: 1) is 2) went 3) work

文件b :1)是2)工作3)去了

File b: 1) is 2) work 3) went

由于顺序不同,我不会获得1.0的分数,但是暗示这两个样本是相似

Because the ordering is different I would not achieve a score of 1.0 but yet one that would suggest that these two samples are rather similar

我的问题现在缺少值。文件B中可能没有文件A的单词。在这种情况下,我可以使用spearman rank吗?还是更适合使用其他相关度量?

My problem are now missing values. A word of file A might not exist in the file B. Can I use spearman rank in this case? Or would be another correlation measure better suited?

推荐答案

在排名中,在您的应用程序中,您不需要缺少值。当一个单词出现在一个文件中而不是另一个文件中时,您可以将其排在另一个文件中的最后排名(或对于多个缺失值等于最后一个排名)。

When it comes to rank, in your application, you don't need to have missing values. When a word has an occurrence in one file but not in the other, you can give it last ranking in the other file (or equal last ranking for multiple missing values).

但是,我不确定许多缺失值(很多并列末位)对Spearman值的影响。您可以考虑在原始相对频率上使用标准的相关性/回归指标,而不是在Spearman系数上。

However, I am not sure of the effect on the Spearman value of lots of missing values (lots of tied last ranks). You might instead consider using a standard correlation/regression on the raw relative frequencies, instead of the Spearman coefficient.

示例...

说文件x具有m = 113个单词,文件y具有n = 234个单词。我们可以创建一个相对单词频率表,如下所示:

Say file x has m=113 words and file y has n=234. We can create a table of relative word frequencies like so:

word xy

是5/113 23/234
4/113 45/234
a 4/113 17/234
雀巢1/113 0/234
肘部0/113 2/234
...
==============================
总计113/113 234/234

然后您将计算:

word xyu = x * yv = x * x

是5/113 23/234 115/26442 25/12769
4/113 45/234 180/26442 16/12769
a 4/113 17/234 68/26442 16/12769
Farnarkling 1/113 0/234 0/26442 1/12769
弯头0/113 2/234 0/26442 0/12769
...
================================================= =======
总计113/113 234/234 s =(u的总和)t =(v的总和)

您的答案由s / t给出。接近m / n的值表示良好的对应性。

Your answer is given by s/t. A value close to m/n implies a good correspondence.

一些可能有用的链接是:

Some possibly useful links are:

https://statistics.laerd.com/statistical-guides/spearmans-rank-order-correlation-statistical-guide .php

http:// en.wikipedia.org/wiki/Simple_linear_regression

这篇关于Spearman排名与缺失值相关吗?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆