Fisher 和 Pearson 的独立性检验 [英] Fisher's and Pearson's test for indepedence

查看:27
本文介绍了Fisher 和 Pearson 的独立性检验的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在 R 中,我有 2 个数据集:group1group2.

In R I have 2 datasets: group1 and group2.

对于 group 1 我有 10 个 game_id 这是游戏的 id,我们有 number 是这个游戏的次数游戏已在 group1 中进行.

For group 1 I have 10 game_id which is the id of a game, and we have number which is the numbers of times this games has been played in group1.

所以如果我们输入

group1

我们得到这个输出

game_id  number
1        758565
2        235289
...
10       87084

对于 group2 我们得到

game_id  number
1        79310
2        28564
...
10       9048

如果我想测试前 2 个 game_idgroup1group2 之间是否存在统计差异,我可以使用 Pearson chi-平方测试.

If I want to test if there is a statistical difference between group1 and group2 for the first 2 game_id I can use Pearson chi-square test.

在 R 中,我只是创建矩阵

In R I simply create the matrix

# The first 2 'numbers' in group1
a <- c( group1[1,2] , group1[2,2] )
# The first 2 'numbers' in group2
b <- c( group2[1,2], group2[2,2] )
# Creating it on matrix-form
m <- rbind(a,b)

所以 m 给了我们

a 758565  235289
b 79310  28564

这里我可以测试 H:a 独立于 b",这意味着 group1 中的用户玩 game_id 1 比 group2 多 2代码>.

Here I can test H: "a is independent from b", meaning that users in group1 play game_id 1 more than 2 compared to group2.

在 R 中,我们输入 chisq.test(m) 并且我们得到一个非常低的 p 值,这意味着我们可以拒绝 H,这意味着 a 和 b 不独立.

In R we type chisq.test(m) and we get a very low p-value meaning that we can reject H, meaning that a and b is not independent.

如何找到在group1中比在group2中玩得更多的game_id?

How should one find game_id's that are played significantly more in group1 than in group2 ?

推荐答案

我创建了一个只有 3 个游戏的简单版本.我正在使用卡方检验和比例比较检验.就个人而言,我更喜欢第二个,因为它可以让您了解所比较的百分比.运行脚本并确保您了解该过程.

I created a simpler version of only 3 games. I'm using a chi squared test and a proportions comparison test. Personally, I prefer the second one as it gives you an idea about what percentages you're comparing. Run the script and make sure you understand the process.

# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
                       number_games = c(758565,235289,87084))

dt_group1

#   game_id number_games
# 1       1       758565
# 2       2       235289
# 3       3        87084


# add extra variables
dt_group1$number_rest_games = sum(dt_group1$number_games) - dt_group1$number_games   # needed for chisq.test
dt_group1$number_all_games = sum(dt_group1$number_games)  # needed for prop.test
dt_group1$Prc = dt_group1$number_games / dt_group1$number_all_games  # just to get an idea about the percentages

dt_group1

#   game_id number_games number_rest_games number_all_games        Prc
# 1       1       758565            322373          1080938 0.70176550
# 2       2       235289            845649          1080938 0.21767113
# 3       3        87084            993854          1080938 0.08056336



# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
                       number_games = c(79310,28564,9048))

# add extra variables
dt_group2$number_rest_games = sum(dt_group2$number_games) - dt_group2$number_games
dt_group2$number_all_games = sum(dt_group2$number_games)
dt_group2$Prc = dt_group2$number_games / dt_group2$number_all_games




# input the game id you want to investigate
input_game_id = 1

# create a table of successes (games played) and failures (games not played)
dt_test = rbind(c(dt_group1$number_games[dt_group1$game_id==input_game_id], dt_group1$number_rest_games[dt_group1$game_id==input_game_id]),
                c(dt_group2$number_games[dt_group2$game_id==input_game_id], dt_group2$number_rest_games[dt_group2$game_id==input_game_id]))

# perform chi sq test
chisq.test(dt_test)

# Pearson's Chi-squared test with Yates' continuity correction
# 
# data:  dt_test
# X-squared = 275.9, df = 1, p-value < 2.2e-16


# create a vector of successes (games played) and vector of total games
x = c(dt_group1$number_games[dt_group1$game_id==input_game_id], dt_group2$number_games[dt_group2$game_id==input_game_id])
y = c(dt_group1$number_all_games[dt_group1$game_id==input_game_id], dt_group2$number_all_games[dt_group2$game_id==input_game_id])

# perform test of proportions
prop.test(x,y)

# 2-sample test for equality of proportions with continuity correction
# 
# data:  x out of y
# X-squared = 275.9, df = 1, p-value < 2.2e-16
# alternative hypothesis: two.sided
# 95 percent confidence interval:
#   0.02063233 0.02626776
# sample estimates:
#   prop 1    prop 2 
# 0.7017655 0.6783155 

主要是chisq.test是比较计数/比例的测试,所以你需要提供你比较的组的成功"和失败"的数量(列联表作为输入).prop.test 是另一个计数/比例测试命令,您需要提供成功"和总计"的数量.

The main thing is that chisq.test is a test that compares counts/proportions, so you need to provide the number of "successes" and "failures" for the groups you compare (contingency table as input). prop.test is another counts/proportions testing command that you need to provide the number of "successes" and "totals".

既然您对结果感到满意,并且您看到了流程的运作方式,我将添加一种更有效的方法来执行这些测试.

Now that you're happy with the result and you saw how the process works I'll add a more efficient way to perform those tests.

第一个是使用 dplyrbroom 包:

The first one is using dplyr and broom packages:

library(dplyr)
library(broom)

# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
                       number_games = c(758565,235289,87084),
                       group_id = 1)  ## adding the id of the group

# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
                       number_games = c(79310,28564,9048),
                       group_id = 2)  ## adding the id of the group

# combine datasets
dt = rbind(dt_group1, dt_group2)


dt %>%
  group_by(group_id) %>%                                           # for each group id
  mutate(number_all_games = sum(number_games),                     # create new columns
         number_rest_games = number_all_games - number_games,
         Prc = number_games / number_all_games) %>%
  group_by(game_id) %>%                                            # for each game
  do(tidy(prop.test(.$number_games, .$number_all_games))) %>%      # perform the test
  ungroup()


#   game_id  estimate1  estimate2 statistic      p.value parameter     conf.low    conf.high
#     (int)      (dbl)      (dbl)     (dbl)        (dbl)     (dbl)        (dbl)        (dbl)
# 1       1 0.70176550 0.67831546 275.89973 5.876772e-62         1  0.020632330  0.026267761
# 2       2 0.21767113 0.24429962 435.44091 1.063385e-96         1 -0.029216006 -0.024040964
# 3       3 0.08056336 0.07738492  14.39768 1.479844e-04         1  0.001558471  0.004798407

另一个是使用 data.tablebroom 包:

The other one is using data.table and broom packages:

library(data.table)
library(broom)

# dataset of group 1
dt_group1 = data.frame(game_id = 1:3,
                       number_games = c(758565,235289,87084),
                       group_id = 1)  ## adding the id of the group

# dataset of group 2
dt_group2 = data.frame(game_id = 1:3,
                       number_games = c(79310,28564,9048),
                       group_id = 2)  ## adding the id of the group

# combine datasets
dt = data.table(rbind(dt_group1, dt_group2))

# create new columns for each group
dt[, number_all_games := sum(number_games), by=group_id]

dt[, `:=`(number_rest_games = number_all_games - number_games,
          Prc = number_games / number_all_games) , by=group_id]

# for each game id compare percentages
dt[, tidy(prop.test(.SD$number_games, .SD$number_all_games)) , by=game_id]


#    game_id  estimate1  estimate2 statistic      p.value parameter     conf.low    conf.high
# 1:       1 0.70176550 0.67831546 275.89973 5.876772e-62         1  0.020632330  0.026267761
# 2:       2 0.21767113 0.24429962 435.44091 1.063385e-96         1 -0.029216006 -0.024040964
# 3:       3 0.08056336 0.07738492  14.39768 1.479844e-04         1  0.001558471  0.004798407

您可以看到每一行代表一场比赛,比较是在组 1 和组 2 之间.您可以从相应列中获取 p 值,但也可以获取测试/比较的其他信息.

You can see that each row represent one game and the comparison is between group 1 and 2. You can get the p values from the corresponding column, but other info of the test/comparison as well.

这篇关于Fisher 和 Pearson 的独立性检验的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆