Fisher 和 Pearson 的独立性检验 [英] Fisher's and Pearson's test for indepedence
问题描述
在 R 中,我有 2 个数据集:group1
和 group2
.
In R I have 2 datasets: group1
and group2
.
对于 group 1
我有 10 个 game_id
这是游戏的 id,我们有 number
是这个游戏的次数游戏已在 group1
中进行.
For group 1
I have 10 game_id
which is the id of a game, and we have number
which is the numbers of times this games has been played in group1
.
所以如果我们输入
group1
我们得到这个输出
game_id number
1 758565
2 235289
...
10 87084
对于 group2
我们得到
game_id number
1 79310
2 28564
...
10 9048
如果我想测试前 2 个 game_id
在 group1
和 group2
之间是否存在统计差异,我可以使用 Pearson chi-平方测试.
If I want to test if there is a statistical difference between group1
and group2
for the first 2 game_id
I can use Pearson chi-square test.
在 R 中,我只是创建矩阵
In R I simply create the matrix
# The first 2 'numbers' in group1
a <- c( group1[1,2] , group1[2,2] )
# The first 2 'numbers' in group2
b <- c( group2[1,2], group2[2,2] )
# Creating it on matrix-form
m <- rbind(a,b)
所以 m
给了我们
a 758565 235289
b 79310 28564
这里我可以测试 H:a 独立于 b",这意味着 group1
中的用户玩 game_id
1 比 group2
多 2代码>.
Here I can test H: "a is independent from b", meaning that users in group1
play game_id
1 more than 2 compared to group2
.
在 R 中,我们输入 chisq.test(m)
并且我们得到一个非常低的 p 值,这意味着我们可以拒绝 H,这意味着 a 和 b 不独立.
In R we type chisq.test(m)
and we get a very low p-value meaning that we can reject H, meaning that a and b is not independent.
如何找到在group1
中比在group2
中玩得更多的game_id
?
How should one find game_id
's that are played significantly more in group1
than in group2
?
推荐答案
我创建了一个只有 3 个游戏的简单版本.我正在使用卡方检验和比例比较检验.就个人而言,我更喜欢第二个,因为它可以让您了解所比较的百分比.运行脚本并确保您了解该过程.
I created a simpler version of only 3 games. I'm using a chi squared test and a proportions comparison test. Personally, I prefer the second one as it gives you an idea about what percentages you're comparing. Run the script and make sure you understand the process.
# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
number_games = c(758565,235289,87084))
dt_group1
# game_id number_games
# 1 1 758565
# 2 2 235289
# 3 3 87084
# add extra variables
dt_group1$number_rest_games = sum(dt_group1$number_games) - dt_group1$number_games # needed for chisq.test
dt_group1$number_all_games = sum(dt_group1$number_games) # needed for prop.test
dt_group1$Prc = dt_group1$number_games / dt_group1$number_all_games # just to get an idea about the percentages
dt_group1
# game_id number_games number_rest_games number_all_games Prc
# 1 1 758565 322373 1080938 0.70176550
# 2 2 235289 845649 1080938 0.21767113
# 3 3 87084 993854 1080938 0.08056336
# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
number_games = c(79310,28564,9048))
# add extra variables
dt_group2$number_rest_games = sum(dt_group2$number_games) - dt_group2$number_games
dt_group2$number_all_games = sum(dt_group2$number_games)
dt_group2$Prc = dt_group2$number_games / dt_group2$number_all_games
# input the game id you want to investigate
input_game_id = 1
# create a table of successes (games played) and failures (games not played)
dt_test = rbind(c(dt_group1$number_games[dt_group1$game_id==input_game_id], dt_group1$number_rest_games[dt_group1$game_id==input_game_id]),
c(dt_group2$number_games[dt_group2$game_id==input_game_id], dt_group2$number_rest_games[dt_group2$game_id==input_game_id]))
# perform chi sq test
chisq.test(dt_test)
# Pearson's Chi-squared test with Yates' continuity correction
#
# data: dt_test
# X-squared = 275.9, df = 1, p-value < 2.2e-16
# create a vector of successes (games played) and vector of total games
x = c(dt_group1$number_games[dt_group1$game_id==input_game_id], dt_group2$number_games[dt_group2$game_id==input_game_id])
y = c(dt_group1$number_all_games[dt_group1$game_id==input_game_id], dt_group2$number_all_games[dt_group2$game_id==input_game_id])
# perform test of proportions
prop.test(x,y)
# 2-sample test for equality of proportions with continuity correction
#
# data: x out of y
# X-squared = 275.9, df = 1, p-value < 2.2e-16
# alternative hypothesis: two.sided
# 95 percent confidence interval:
# 0.02063233 0.02626776
# sample estimates:
# prop 1 prop 2
# 0.7017655 0.6783155
主要是chisq.test
是比较计数/比例的测试,所以你需要提供你比较的组的成功"和失败"的数量(列联表作为输入).prop.test
是另一个计数/比例测试命令,您需要提供成功"和总计"的数量.
The main thing is that chisq.test
is a test that compares counts/proportions, so you need to provide the number of "successes" and "failures" for the groups you compare (contingency table as input). prop.test
is another counts/proportions testing command that you need to provide the number of "successes" and "totals".
既然您对结果感到满意,并且您看到了流程的运作方式,我将添加一种更有效的方法来执行这些测试.
Now that you're happy with the result and you saw how the process works I'll add a more efficient way to perform those tests.
第一个是使用 dplyr
和 broom
包:
The first one is using dplyr
and broom
packages:
library(dplyr)
library(broom)
# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
number_games = c(758565,235289,87084),
group_id = 1) ## adding the id of the group
# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
number_games = c(79310,28564,9048),
group_id = 2) ## adding the id of the group
# combine datasets
dt = rbind(dt_group1, dt_group2)
dt %>%
group_by(group_id) %>% # for each group id
mutate(number_all_games = sum(number_games), # create new columns
number_rest_games = number_all_games - number_games,
Prc = number_games / number_all_games) %>%
group_by(game_id) %>% # for each game
do(tidy(prop.test(.$number_games, .$number_all_games))) %>% # perform the test
ungroup()
# game_id estimate1 estimate2 statistic p.value parameter conf.low conf.high
# (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1 0.70176550 0.67831546 275.89973 5.876772e-62 1 0.020632330 0.026267761
# 2 2 0.21767113 0.24429962 435.44091 1.063385e-96 1 -0.029216006 -0.024040964
# 3 3 0.08056336 0.07738492 14.39768 1.479844e-04 1 0.001558471 0.004798407
另一个是使用 data.table
和 broom
包:
The other one is using data.table
and broom
packages:
library(data.table)
library(broom)
# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
number_games = c(758565,235289,87084),
group_id = 1) ## adding the id of the group
# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
number_games = c(79310,28564,9048),
group_id = 2) ## adding the id of the group
# combine datasets
dt = data.table(rbind(dt_group1, dt_group2))
# create new columns for each group
dt[, number_all_games := sum(number_games), by=group_id]
dt[, `:=`(number_rest_games = number_all_games - number_games,
Prc = number_games / number_all_games) , by=group_id]
# for each game id compare percentages
dt[, tidy(prop.test(.SD$number_games, .SD$number_all_games)) , by=game_id]
# game_id estimate1 estimate2 statistic p.value parameter conf.low conf.high
# 1: 1 0.70176550 0.67831546 275.89973 5.876772e-62 1 0.020632330 0.026267761
# 2: 2 0.21767113 0.24429962 435.44091 1.063385e-96 1 -0.029216006 -0.024040964
# 3: 3 0.08056336 0.07738492 14.39768 1.479844e-04 1 0.001558471 0.004798407
您可以看到每一行代表一场比赛,比较是在组 1 和组 2 之间.您可以从相应列中获取 p 值,但也可以获取测试/比较的其他信息.
You can see that each row represent one game and the comparison is between group 1 and 2. You can get the p values from the corresponding column, but other info of the test/comparison as well.
这篇关于Fisher 和 Pearson 的独立性检验的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!