科学技术中的费舍尔精确技术作为使用 pandas 的新专栏 [英] Fisher's Exact in scipy as new column using pandas
问题描述
使用ipython笔记本,熊猫数据框具有4列: numerator1 , numerator2 , denominator1 和 denominator2 .
Using ipython notebook, a pandas dataframe has 4 columns: numerator1, numerator2, denominator1 and denominator2.
在不遍历每条记录的情况下,我试图创建标题为FishersExact的第五列.我希望该列的值存储由
Without iterating through each record, I am trying to create a fifth column titled FishersExact. I would like the value of the column to store the tuple returned by scipy.stats.fisher_exact using values (or some derivation of the values) from each of the four columns as my inputs.
df['FishersExact'] = scipy.stats.fisher_exact( [[df.numerator1, df.numerator2],
[df.denominator1 - df.numerator1 , df.denominator2 - df.numerator2]])
返回:
/home/kevin/anaconda/lib/python2.7/site-packages/scipy/stats/stats.pyc in fisher_exact(table, alternative)
2544 c = np.asarray(table, dtype=np.int64) # int32 is not enough for the algorithm
2545 if not c.shape == (2, 2):
-> 2546 raise ValueError("The input `table` must be of shape (2, 2).")
2547
2548 if np.any(c < 0):
ValueError: The input `table` must be of shape (2, 2).
当我仅索引数据帧的第一条记录时:
When I index only the first record of the dataframe:
odds,pval = scipy.stats.fisher_exact([[df.numerator1[0], df.numerator2[0]],
[df.denominator1[0] - df.numerator1[0], df.denominator2[0] -df.numerator2[0]]])
此返回:
1.1825710754 0.581151431104
我本质上是在尝试模拟类似于以下内容的算术功能:
I'm essentially trying to emulate the arithmetic functionality similar to:
df['freqnum1denom1'] = df.numerator1 / df.denominator1
返回一个新列,该新列已添加到数据框中,其中每个记录的频率都在新列中.
which returns a new column added to the dataframe where each records' frequency is in the new column.
可能丢失了一些东西,任何方向将不胜感激,谢谢!
Probably missing something, any direction would be greatly appreciated, thank you!
推荐答案
您似乎正在构建pandas
系列的矩阵,并将其传递给函数.该函数需要一个标量矩阵;您可以多次调用它.这两件事并不完全相同.
It looks like you're building a matrix of pandas
series, and passing it to the function. The function wants a matrix of scalars; you can call it multiple times. These two things are not quite the same.
(至少)有两种方法可以到达这里.
There are (at least) two ways to go here.
使用apply
Using apply
df['FishersExact'] = df.apply(
lambda r: scipy.stats.fisher_exact([[r.numerator1, ... ]]),
axis=1)
请注意以下几点:
-
axis=1
将函数应用于每一行.
axis=1
applies a function to each row.
在lambda
中,r.numerator
是标量.
返回基础
Fischer的精确测试在原始列中可以描述为矢量化操作,快得多.为了最大程度地提高速度,您需要使用阶乘的矢量化版本(我不知道).这甚至可能是一个单独的(好!)问题.
Fischer's exact test can be described as vectorized operations in the original columns, which should be much faster. To increase the speed to the maximum, you need to use a vectorized version of factorial (which I don't know). This could even be a separate (good!) SO question.
这篇关于科学技术中的费舍尔精确技术作为使用 pandas 的新专栏的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!