R-运行Spearman相关中的p值不一致 [英] R - Inconsistent p-value in running Spearman correlation
问题描述
我的问题是当我出于某种奇怪的原因计算运行相关性时,对于相同的估计/相关性值,我没有获得相同的p值.
My problem is when I compute running correlation for some odd reason I do not get the same p-value for the same estimates/correlations values.
我的目标是要在同一data.frame(以下示例中的subject1和subject2)上的两个向量上计算连续的Spearman相关性.另外,我的窗口(向量的长度)和步幅(每个窗口之间的跳跃/步长)是恒定的.因此,查看下面的公式(来自 wiki ),我应该得到相同的结果临界t,因此对于相同的Spearman相关性,具有相同的p值.这是因为n
表示相同(窗口大小相同),而r
相同.但是,我的最终p值是不同的.
My target is to calculate a running Spearman correlation on two vectors in the same data.frame (subject1 and subject2 in the example below). In addition, my window (length of the vector) and stide (the jumps/steps between each window) are constant. As such, when looking at the formula below (from wiki) I should get the same critical t hence the same p-value for the same Spearman correlation. These is because the n
states the same (it's the same window size) and the r
is same. However, my end p value is different.
#Needed pkgs
require(tidyverse)
require(pspearman)
require(gtools)
#Sample data
set.seed(528)
subject1 <- rnorm(40, mean = 85, sd = 5)
set.seed(528)
subject2 <- c(
lag(subject1[1:21]) - 10,
rnorm(n = 6, mean = 85, sd = 5),
lag(subject1[length(subject1):28]) - 10)
df <- data.frame(subject1 = subject1,
subject2 = subject2) %>%
rowid_to_column(var = "Time")
df[is.na(df)] <- subject1[1] - 10
rm(subject1, subject2)
#Function for Spearman
psSpearman <- function(x, y)
{
out <- pspearman::spearman.test(x, y,
alternative = "two.sided",
approximation = "t-distribution") %>%
broom::tidy()
return(data.frame(estimate = out$estimate,
statistic = out$statistic,
p.value = out$p.value )
}
#Running correlation along the subjects
dfRunningCor <- running(df$subject1, df$subject2,
fun = psSpearman,
width = 20,
allow.fewer = FALSE,
by = 1,
pad = FALSE,
align = "right") %>%
t() %>%
as.data.frame()
#Arranging the Results into easy to handle data.frame
Results <- do.call(rbind.data.frame, dfRunningCor) %>%
t() %>%
as.data.frame() %>%
rownames_to_column(var = "Win") %>%
gather(CorValue, Value, -Win) %>%
separate(Win, c("fromIndex", "toIndex")) %>%
mutate(fromIndex = as.numeric(substring(fromIndex, 2)),
toIndex = as.numeric(toIndex, 2)) %>%
spread(CorValue, Value) %>%
arrange(fromIndex) %>%
select(fromIndex, toIndex, estimate, statistic, p.value)
我的问题是当我绘制带有估计值(Spearman rho; estimate
),窗口编号(fromIndex
)的Results
并为p值上色时,我应该像跨相同区域的相同颜色的隧道"/路径"-我不知道.
例如,在下面的图片中,红色圆圈中相同高度的点应该具有相同的颜色-但不是.
My problem is when I plot the Results
with estimates (Spearman rho;estimate
), window number (fromIndex
) and I color the p value, I should get like a "tunnel"/"path" of the same color across the same area - I don't.
For example, in the picture below, points in the same height in the red circle should be with the same color - but the aren't.
图形代码:
Results %>%
ggplot(aes(fromIndex, estimate, color = p.value)) +
geom_line()
我到目前为止发现的原因可能是:
1.像Hmisc::rcorr()
这样的函数在小样本或多次联系中往往不会给出相同的p.value.这就是为什么我使用pspearman::spearman.test
的原因,根据我在这里阅读的内容,它可以解决此问题.
2.小样本-我尝试使用大样本.我仍然遇到同样的问题.
3.我尝试将p值取整-我仍然遇到相同的问题.
What I found so far is that it might might be due to:
1. Functions like Hmisc::rcorr()
tend to not give the same p.value in small sample or many ties. This is why I use pspearman::spearman.test
which from what I read here suppose to solve this problem.
2. Small sample size - I tried using a bigger sample size. I still get the same problem.
3. I tried rounding my p values - I still get the same problem.
谢谢您的帮助!
可能是由ggplot进行的伪"着色吗?难道ggplot
只是插值最后一个"颜色直到下一个点?这就是为什么我从第5点到第6点变成浅蓝色",而从第7点到第8点变成深蓝色"的原因呢?
Could it be "pseudo" coloring by ggplot? Could it be that ggplot
just interpolate "last" color until the next point?. Which is why I get "light blue" from point 5 to 6 but "dark blue" from point 7 to 8?
推荐答案
您为p.value
变量获得的结果与estimate
值一致.
您可以按以下方式检查它:
The results you obtain for the p.value
variable are coherent with the estimate
value.
You can check it as follows:
Results$orderestimate <- order(-abs(Results$estimate))
Results$orderp.value <- order(abs(Results$p.value))
identical(Results$orderestimate ,Results$orderp.value)
我不认为您应该在图中为p.value
加上颜色,这是不必要的视觉干扰,很难解释.
I don't think you should include a colour for the p.value
in the graph, it is an unnecessary visual distraction and it is hard to interpret.
如果您是我,我只会显示p.value
,也许还会包含一个指示estimate
变量符号的点.
If I were you I would only display the p.value
and perhaps include a point to indicate the sign of the estimate
variable.
p <- Results %>%
ggplot(aes(fromIndex, p.value)) +
geom_line()
# If you want to display the sign of the estimate
Results$estimate.sign <- as.factor(sign(Results$estimate))
p+geom_point( aes(color = estimate.sign ))
这篇关于R-运行Spearman相关中的p值不一致的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!