在Python中合并交叉表 [英] Merging crosstabs in Python
问题描述
我正在尝试将多个交叉表合并为一个交叉表.请注意,提供的数据显然仅用于测试目的.实际数据要大得多,所以效率对我来说很重要.
I am trying to merge multiple crosstabs into a single one. Note that the data provided is obviously only for test purposes. The actual data is much larger so efficiency is quite important for me.
生成,列出交叉表,然后将其与word
列上的lambda函数合并.但是,合并的结果不是我期望的那样.我认为问题在于,即使使用dropna = False
时,也将删除仅具有交叉表的NA值的列,这将导致merge
函数失败.我将首先显示代码,然后显示中间数据和错误.
The crosstabs are generated, listed, and then merged with a lambda function on the word
column. However, the result of this merging is not what I expect it to be. I think the problem is that the columns with only NA values of the crosstabs are being dropped even when using dropna = False
, which would then result in the merge
function failing. I'll first show the code and after that present the intermediate data and errors.
import pandas as pd
import numpy as np
import functools as ft
def main():
# Create dataframe
df = pd.DataFrame(data=np.zeros((0, 3)), columns=['word','det','source'])
df["word"] = ('banana', 'banana', 'elephant', 'mouse', 'mouse', 'elephant', 'banana', 'mouse', 'mouse', 'elephant', 'ostrich', 'ostrich')
df["det"] = ('a', 'the', 'the', 'a', 'the', 'the', 'a', 'the', 'a', 'a', 'a', 'the')
df["source"] = ('BE', 'BE', 'BE', 'NL', 'NL', 'NL', 'FR', 'FR', 'FR', 'FR', 'FR', 'FR')
create_frequency_list(df)
def create_frequency_list(df):
# Create a crosstab of ALL values
# NOTE that dropna = False does not seem to work as expected
total = pd.crosstab(df.word, df.det, dropna = False)
total.fillna(0)
total.reset_index(inplace=True)
total.columns = ['word', 'a', 'the']
crosstabs = [total]
# For the column headers, multi-level
first_index = [('total','total')]
second_index = [('a','the')]
# Create crosstabs per source (one for BE, one for NL, one for FR)
# NOTE that dropna = False does not seem to work as expected
for source, tempDf in df.groupby('source'):
crosstab = pd.crosstab(tempDf.word, tempDf.det, dropna = False)
crosstab.fillna(0)
crosstab.reset_index(inplace=True)
crosstab.columns = ['word', 'a', 'the']
crosstabs.append(crosstab)
first_index.extend((source,source))
second_index.extend(('a','the'))
# Just for debugging: result as expected
for tab in crosstabs:
print(tab)
merged = ft.reduce(lambda left,right: pd.merge(left,right, on='word'), crosstabs).set_index('word')
# UNEXPECTED RESULT
print(merged)
arrays = [first_index, second_index]
# Throws error: NotImplementedError: > 1 ndim Categorical are not supported at this time
columns = pd.MultiIndex.from_arrays(arrays)
df_freq = pd.DataFrame(data=merged.as_matrix(),
columns=columns,
index = crosstabs[0]['word'])
print(df_freq)
main()
各个交叉表:与预期不符. NA列已删除
Individual crosstabs: not as expected. The NA columns are dropped
word a the
0 banana 2 1
1 elephant 1 2
2 mouse 2 2
3 ostrich 1 1
word a the
0 banana 1 1
1 elephant 0 1
word a the
0 banana 1 0
1 elephant 1 0
2 mouse 1 1
3 ostrich 1 1
word a the
0 elephant 0 1
1 mouse 1 1
这意味着数据帧之间不会共享所有值,这反过来可能会破坏合并.
That means that the dataframes do not share all values among each other which in turn will probably mess up the merging.
合并:显然不是预期的
a_x the_x a_y the_y a_x the_x a_y the_y
word
elephant 1 2 0 1 1 0 0 1
但是,错误仅在列分配时抛出:
However, the error only gets thrown at the columns assignment:
# NotImplementedError: > 1 ndim Categorical are not supported at this time
columns = pd.MultiIndex.from_arrays(arrays)
据我所知,问题从NA开始得很早,并且使整个过程失败了.但是,由于我对Python没有足够的经验,所以我不确定.
So as far as I can tell the problem starts early, with the NAs and makes the whole thing fail. However, as I a not experienced enough in Python, I cannot know for sure.
我期望的是多索引输出:
What I expected, was a multi index output:
source total BE FR NL
det a the a the a the a the
word
0 banana 2 1 1 1 1 0 0 0
1 elephant 1 2 0 1 1 0 0 1
2 mouse 2 2 0 0 1 1 1 1
3 ostrich 1 1 0 0 1 1 0 0
推荐答案
我刚刚决定为您提供一种更好的方式来获取您想要的东西:
I just decided to give you a better way of getting you what you want:
我通常将df.groupby([col1, col2]).size().unstack()
用作pd.crosstab
的代理.您试图为每个source
组创建一个交叉表.我可以通过df.groupby([col1, col2, col3]).size().unstack([2, 1])
I use df.groupby([col1, col2]).size().unstack()
to proxy as my pd.crosstab
as a general rule. You were trying to do a crosstab for every group of source
. I can fit that in nicely with my existing groupby with df.groupby([col1, col2, col3]).size().unstack([2, 1])
sort_index(1).fillna(0).astype(int)
只是为了美观.
如果您想更好地理解.尝试以下操作,看看会得到什么:
If you want to understand even better. Try the following things and look what you get:
-
df.groupby(['word', 'gender']).size()
-
df.groupby(['word', 'gender', 'source']).size()
df.groupby(['word', 'gender']).size()
df.groupby(['word', 'gender', 'source']).size()
unstack
和stack
是将索引中的内容放入列中的便捷方法,反之亦然. unstack([2, 1])
指定索引级别的堆积顺序.
unstack
and stack
are convenient ways to get things that were in the index into the columns instead and vice versa. unstack([2, 1])
is specifying the order in which index levels get unstacked.
最后,我再次使用我的xtabs
和stack
并汇总各行和unstack
以准备到pd.concat
.瞧!
Finally, I take my xtabs
and stack
again and sum across the rows and unstack
to prep to pd.concat
. Voilà !
xtabs = df.groupby(df.columns.tolist()).size() \
.unstack([2, 1]).sort_index(1).fillna(0).astype(int)
pd.concat([xtabs.stack().sum(1).rename('total').to_frame().unstack(), xtabs], axis=1)
您的代码 现在应如下所示:
Your Code should now look like this:
import pandas as pd
import numpy as np
import functools as ft
def main():
# Create dataframe
df = pd.DataFrame(data=np.zeros((0, 3)), columns=['word','gender','source'])
df["word"] = ('banana', 'banana', 'elephant', 'mouse', 'mouse', 'elephant', 'banana', 'mouse', 'mouse', 'elephant', 'ostrich', 'ostrich')
df["gender"] = ('a', 'the', 'the', 'a', 'the', 'the', 'a', 'the', 'a', 'a', 'a', 'the')
df["source"] = ('BE', 'BE', 'BE', 'NL', 'NL', 'NL', 'FR', 'FR', 'FR', 'FR', 'FR', 'FR')
return create_frequency_list(df)
def create_frequency_list(df):
xtabs = df.groupby(df.columns.tolist()).size() \
.unstack([2, 1]).sort_index(1).fillna(0).astype(int)
total = xtabs.stack().sum(1)
total.name = 'total'
total = total.to_frame().unstack()
return pd.concat([total, xtabs], axis=1)
main()
这篇关于在Python中合并交叉表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!