科学技术中的费舍尔精确技术作为使用 pandas 的新专栏 [英] Fisher's Exact in scipy as new column using pandas

查看:105
本文介绍了科学技术中的费舍尔精确技术作为使用 pandas 的新专栏的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用ipython笔记本,熊猫数据框具有4列: numerator1 numerator2 denominator1 denominator2 .

Using ipython notebook, a pandas dataframe has 4 columns: numerator1, numerator2, denominator1 and denominator2.

在不遍历每条记录的情况下,我试图创建标题为FishersExact的第五列.我希望该列的值存储由

Without iterating through each record, I am trying to create a fifth column titled FishersExact. I would like the value of the column to store the tuple returned by scipy.stats.fisher_exact using values (or some derivation of the values) from each of the four columns as my inputs.

df['FishersExact'] = scipy.stats.fisher_exact( [[df.numerator1, df.numerator2],
[df.denominator1 - df.numerator1 , df.denominator2 - df.numerator2]])

返回:

/home/kevin/anaconda/lib/python2.7/site-packages/scipy/stats/stats.pyc in fisher_exact(table, alternative)
2544     c = np.asarray(table, dtype=np.int64)  # int32 is not enough for the algorithm
2545     if not c.shape == (2, 2):
-> 2546         raise ValueError("The input `table` must be of shape (2, 2).")
2547 
2548     if np.any(c < 0):

ValueError: The input `table` must be of shape (2, 2).

当我仅索引数据帧的第一条记录时:

When I index only the first record of the dataframe:

odds,pval = scipy.stats.fisher_exact([[df.numerator1[0], df.numerator2[0]], 
[df.denominator1[0] - df.numerator1[0], df.denominator2[0] -df.numerator2[0]]])

此返回:

1.1825710754 0.581151431104

我本质上是在尝试模拟类似于以下内容的算术功能:

I'm essentially trying to emulate the arithmetic functionality similar to:

df['freqnum1denom1'] = df.numerator1 / df.denominator1

返回一个新列,该新列已添加到数据框中,其中每个记录的频率都在新列中.

which returns a new column added to the dataframe where each records' frequency is in the new column.

可能丢失了一些东西,任何方向将不胜感激,谢谢!

Probably missing something, any direction would be greatly appreciated, thank you!

推荐答案

您似乎正在构建pandas系列的矩阵,并将其传递给函数.该函数需要一个标量矩阵;您可以多次调用它.这两件事并不完全相同.

It looks like you're building a matrix of pandas series, and passing it to the function. The function wants a matrix of scalars; you can call it multiple times. These two things are not quite the same.

(至少)有两种方法可以到达这里.

There are (at least) two ways to go here.

使用apply

Using apply

您可以使用pandas .

df['FishersExact'] = df.apply(
    lambda r: scipy.stats.fisher_exact([[r.numerator1, ... ]]),
    axis=1)

请注意以下几点:

  • axis=1将函数应用于每一行.

  • axis=1 applies a function to each row.

lambda中,r.numerator是标量.

返回基础

Fischer的精确测试在原始列中可以描述为矢量化操作,快得多.为了最大程度地提高速度,您需要使用阶乘的矢量化版本(我不知道).这甚至可能是一个单独的(好!)问题.

Fischer's exact test can be described as vectorized operations in the original columns, which should be much faster. To increase the speed to the maximum, you need to use a vectorized version of factorial (which I don't know). This could even be a separate (good!) SO question.

这篇关于科学技术中的费舍尔精确技术作为使用 pandas 的新专栏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆