在Python中合并交叉表 [英] Merging crosstabs in Python

查看:92
本文介绍了在Python中合并交叉表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试将多个交叉表合并为一个交叉表.请注意,提供的数据显然仅用于测试目的.实际数据要大得多,所以效率对我来说很重要.

I am trying to merge multiple crosstabs into a single one. Note that the data provided is obviously only for test purposes. The actual data is much larger so efficiency is quite important for me.

生成,列出交叉表,然后将其与word列上的lambda函数合并.但是,合并的结果不是我期望的那样.我认为问题在于,即使使用dropna = False时,也将删除仅具有交叉表的NA值的列,这将导致merge函数失败.我将首先显示代码,然后显示中间数据和错误.

The crosstabs are generated, listed, and then merged with a lambda function on the word column. However, the result of this merging is not what I expect it to be. I think the problem is that the columns with only NA values of the crosstabs are being dropped even when using dropna = False, which would then result in the merge function failing. I'll first show the code and after that present the intermediate data and errors.

import pandas as pd
import numpy as np
import functools as ft

def main():
    # Create dataframe
    df = pd.DataFrame(data=np.zeros((0, 3)), columns=['word','det','source'])
    df["word"] = ('banana', 'banana', 'elephant', 'mouse', 'mouse', 'elephant', 'banana', 'mouse', 'mouse', 'elephant', 'ostrich', 'ostrich')
    df["det"] = ('a', 'the', 'the', 'a', 'the', 'the', 'a', 'the', 'a', 'a', 'a', 'the')
    df["source"] = ('BE', 'BE', 'BE', 'NL', 'NL', 'NL', 'FR', 'FR', 'FR', 'FR', 'FR', 'FR')

    create_frequency_list(df)

def create_frequency_list(df):
    # Create a crosstab of ALL values
    # NOTE that dropna = False does not seem to work as expected
    total = pd.crosstab(df.word, df.det, dropna = False)
    total.fillna(0)
    total.reset_index(inplace=True)
    total.columns = ['word', 'a', 'the']

    crosstabs = [total]

    # For the column headers, multi-level
    first_index = [('total','total')]
    second_index = [('a','the')]

    # Create crosstabs per source (one for BE, one for NL, one for FR)
    # NOTE that dropna = False does not seem to work as expected
    for source, tempDf in df.groupby('source'):
        crosstab = pd.crosstab(tempDf.word, tempDf.det, dropna = False)
        crosstab.fillna(0)
        crosstab.reset_index(inplace=True)
        crosstab.columns = ['word', 'a', 'the']
        crosstabs.append(crosstab)

        first_index.extend((source,source))
        second_index.extend(('a','the'))

    # Just for debugging: result as expected
    for tab in crosstabs:
        print(tab)

    merged = ft.reduce(lambda left,right: pd.merge(left,right, on='word'), crosstabs).set_index('word')

    # UNEXPECTED RESULT
    print(merged)    

    arrays = [first_index, second_index]

    # Throws error: NotImplementedError: > 1 ndim Categorical are not supported at this time
    columns = pd.MultiIndex.from_arrays(arrays)

    df_freq = pd.DataFrame(data=merged.as_matrix(),
                      columns=columns,
                      index = crosstabs[0]['word'])
    print(df_freq)

main()

各个交叉表:与预期不符. NA列已删除

Individual crosstabs: not as expected. The NA columns are dropped

       word  a  the
0    banana  2    1
1  elephant  1    2
2     mouse  2    2
3   ostrich  1    1

       word  a  the
0    banana  1    1
1  elephant  0    1

       word  a  the
0    banana  1    0
1  elephant  1    0
2     mouse  1    1
3   ostrich  1    1

       word  a  the
0  elephant  0    1
1     mouse  1    1

这意味着数据帧之间不会共享所有值,这反过来可能会破坏合并.

That means that the dataframes do not share all values among each other which in turn will probably mess up the merging.

合并:显然不是预期的

          a_x  the_x  a_y  the_y  a_x  the_x  a_y  the_y
word                                                    
elephant    1      2    0      1    1      0    0      1

但是,错误仅在列分配时抛出:

However, the error only gets thrown at the columns assignment:

# NotImplementedError: > 1 ndim Categorical are not supported at this time
columns = pd.MultiIndex.from_arrays(arrays)

据我所知,问题从NA开始得很早,并且使整个过程失败了.但是,由于我对Python没有足够的经验,所以我不确定.

So as far as I can tell the problem starts early, with the NAs and makes the whole thing fail. However, as I a not experienced enough in Python, I cannot know for sure.

我期望的是多索引输出:

What I expected, was a multi index output:

    source       total        BE          FR          NL
    det         a   the     a   the     a   the     a   the
    word
0   banana      2   1       1   1       1   0       0   0
1   elephant    1   2       0   1       1   0       0   1
2   mouse       2   2       0   0       1   1       1   1
3   ostrich     1   1       0   0       1   1       0   0

推荐答案

我刚刚决定为您提供一种更好的方式来获取您想要的东西:

I just decided to give you a better way of getting you what you want:

我通常将df.groupby([col1, col2]).size().unstack()用作pd.crosstab的代理.您试图为每个source组创建一个交叉表.我可以通过df.groupby([col1, col2, col3]).size().unstack([2, 1])

I use df.groupby([col1, col2]).size().unstack() to proxy as my pd.crosstab as a general rule. You were trying to do a crosstab for every group of source. I can fit that in nicely with my existing groupby with df.groupby([col1, col2, col3]).size().unstack([2, 1])

sort_index(1).fillna(0).astype(int)只是为了美观.

如果您想更好地理解.尝试以下操作,看看会得到什么:

If you want to understand even better. Try the following things and look what you get:

  • df.groupby(['word', 'gender']).size()
  • df.groupby(['word', 'gender', 'source']).size()
  • df.groupby(['word', 'gender']).size()
  • df.groupby(['word', 'gender', 'source']).size()

unstackstack是将索引中的内容放入列中的便捷方法,反之亦然. unstack([2, 1])指定索引级别的堆积顺序.

unstack and stack are convenient ways to get things that were in the index into the columns instead and vice versa. unstack([2, 1]) is specifying the order in which index levels get unstacked.

最后,我再次使用我的xtabsstack并汇总各行和unstack以准备到pd.concat.瞧!

Finally, I take my xtabs and stack again and sum across the rows and unstack to prep to pd.concat. Voilà !

xtabs = df.groupby(df.columns.tolist()).size() \
          .unstack([2, 1]).sort_index(1).fillna(0).astype(int)

pd.concat([xtabs.stack().sum(1).rename('total').to_frame().unstack(), xtabs], axis=1)

您的代码 现在应如下所示:

Your Code should now look like this:

import pandas as pd
import numpy as np
import functools as ft

def main():
    # Create dataframe
    df = pd.DataFrame(data=np.zeros((0, 3)), columns=['word','gender','source'])
    df["word"] = ('banana', 'banana', 'elephant', 'mouse', 'mouse', 'elephant', 'banana', 'mouse', 'mouse', 'elephant', 'ostrich', 'ostrich')
    df["gender"] = ('a', 'the', 'the', 'a', 'the', 'the', 'a', 'the', 'a', 'a', 'a', 'the')
    df["source"] = ('BE', 'BE', 'BE', 'NL', 'NL', 'NL', 'FR', 'FR', 'FR', 'FR', 'FR', 'FR')

    return create_frequency_list(df)

def create_frequency_list(df):
    xtabs = df.groupby(df.columns.tolist()).size() \
              .unstack([2, 1]).sort_index(1).fillna(0).astype(int)

    total = xtabs.stack().sum(1)
    total.name = 'total'
    total = total.to_frame().unstack()

    return pd.concat([total, xtabs], axis=1)

main()

这篇关于在Python中合并交叉表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆