使用`dplyr :: group_by()`为多个组获取`chisq.test()$ p.value` [英] Get `chisq.test()$p.value` for several groups using `dplyr::group_by()`

查看:172
本文介绍了使用`dplyr :: group_by()`为多个组获取`chisq.test()$ p.value`的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试在dplyr框架内的几个组上进行卡方检验。问题是 group_by()%>%summarise()似乎并没有成功。

I'm trying to conduct a chi square test on several groups within the dplyr frame. The problem is, group_by() %>% summarise() doesn't seem to do trick.

模拟数据(结构与问题数据相同,但随机,因此p.values应该很高)

Simulated data (same structure as problematic data, but random, so p.values should be high)

set.seed(1)
data.frame(partido=sample(c("PRI", "PAN"), 100, 0.6),
       genero=sample(c("H", "M"), 100, 0.7), 
       GM=sample(c("Bajo", "Muy bajo"), 100, 0.8)) -> foo

我想比较GM定义的几个组,以查看p.values是否有变化适用于GM的部分和通用交叉表。

I want to compare several groups defined by GM to see if there are changes in the p.values for the crosstab of partido and genero, conditional to GM.

显而易见的dplyr方式应该是:

The obvious dplyr way should be:

foo %>% 
  group_by(GM) %>% 
  summarise(pvalue=chisq.test(.$partido, .$genero)$p.value)  #just the p.value, so summarise is happy

但是我得到的是未分组数据的p.value,只是时间,而不是p.value。每个表:

But I get the p.values for the ungrouped data, just to times, not the p.value for each table:

#小标题:2×2
GM pvalue
< fctr> < dbl>
1 Bajo 0.8660521
2 Muy bajo 0.8660521

使用过滤器测试每个组,我得到:

Testing each group using filter I get:

foo %>% 
  filter(GM=="Bajo") %$% 
  table(partido, genero) %>% 
  chisq.test()

返回值: X平方= 0.015655,df = 1,p值= 0.9004

foo %>% 
  filter(GM=="Muy bajo") %$% 
  table(partido, genero) %>% chisq.test()

返回值: X平方= 0.50409,df = 1,p值= 0.4777

dplyr:summarise()与带有多个参数的函数一起使用,所以这不应该是问题:

dplyr:summarise() works with functions with more than one argument, so this shouldn't be the problem:

data.frame(a=1:10, b=10:1, c=sample(c("Grupo 1", "Grupo 2"), 10, 0.5)) %>% 
    group_by(c) %>% 
    summarise(r=cor(a, b))

就像魅力一样工作。它似乎不适用于chisq.test。

works like charm. It just doesn't seem to work with chisq.test.

我设法使用 tidyr :: nest() purrr :: map(),但是我发现代码很麻烦-至少对我的学生而言。实际上,我已经投入了很多我们的教学来教他们(数学和编程方面非常有挑战性的小组)dplyr,以便他们可以尽可能地避免使用向量函数。

I managed to get what I wanted with nested models using tidyr::nest() and purrr::map(), but I find the code cumbersome --at least for my students. Actually, I´ve invested many ours teaching them (a very math and programming challenged group) dplyr so they could avoid vector functions as much as possible.

foo %>% 
  nest(-GM) %>% 
  mutate(tabla=map(data, ~table(.))) %>% 
  mutate(pvalue=map(tabla, ~chisq.test(.)$p.value)) %>% 
  select(GM, pvalue) %>% 
  unnest()

A tibble: 2 × 2
       GM   pvalue
    <fctr>  <dbl>
1     Bajo  0.9004276
2 Muy bajo  0.4777095

do()也会做到这一点:

foo %>% 
  group_by(GM) %>% 
  do(tidy(chisq.test(.$partido, .$genero)))

Source: local data frame [2 x 5]
Groups: GM [2]
    GM statistic   p.value parameter
<fctr>     <dbl>     <dbl>     <int>
1     Bajo 0.0156553 0.9004276         1
2 Muy bajo 0.5040878 0.4777095         1
# ... with 1 more variables: method <fctr>

例如:费舍尔和皮尔逊的独立性测试

但是,¿为什么不 group_by() summarise(chisq.test()$ p.value)一起使用吗?

But, ¿why doesn't group_by() work with summarise(chisq.test()$p.value)?

推荐答案

dplyr 中,通常只能使用未加引号的变量名来访问相关列,无论您是在groupby还是其他情况下。因此,从。$ partido 。$ genero 。$ 访问器$ c>我不需要的:

In dplyr you can generally just use unquoted variable names to access the relevant columns, whether you're in a groupby or otherwise. So removing the .$ accessors from .$partido and .$genero which are not needed I get:

foo %>% 
    group_by(GM) %>% 
    summarise(pvalue= chisq.test(partido, genero)$p.value) 

# A tibble: 2 × 2
        GM    pvalue
    <fctr>     <dbl>
1     Bajo 0.9004276
2 Muy bajo 0.4777095

这篇关于使用`dplyr :: group_by()`为多个组获取`chisq.test()$ p.value`的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆