比较 pandas 数据框的布尔值-返回字符串 [英] Comparing Boolean Values of Pandas Dataframes- Returning String

查看:51
本文介绍了比较 pandas 数据框的布尔值-返回字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要比较的有4个数据框,每个看起来像

I have 4 dataframes I'm going to be comparing, that each look like

ID    Jan    Feb    Mar
1     True   True   False
2     True   True   True
3     False  False  False

2到3000行之间的任意位置.它们将具有完全相同的列名,但可能并不总是共享所有相同的索引ID.

anywhere from 2 to 3000 rows. They will have the exact same column names but may not always share all the same index IDs.

我想做的是比较它们,并根据它们的值生成一个新的数据框.对于至少在一个数据帧中为False的任何单元格,我想为其分配一个字符串(例如,"Dataframe1中为False"),如果有多个,则附加两个字符串(例如"Dataframe1中为False,Dataframe2").

What I would like to do is compare them and generate a new dataframe based on their values. For any cell that was False in at least one dataframe, I want to assign it a string (e.g. "False in Dataframe1") and if multiple, append both (e.g. "False in Dataframe1, Dataframe2").

输出看起来像

ID    Jan            Feb              Mar
1     True           True             False in A, B, C
2     True           False in B       True
3     False in A     False in A, B    False in A

是否可以使用某种直接的数据框与数据框比较?还是我需要合并数据框,以便我可以相互比较列?

Is there some kind of direct dataframe to dataframe comparison I can use? Or do I need to concat the dataframes so I can compare the columns to each other?

编辑-在一个数据帧没有相同记录的情况下,我不想按行比较,而是基于索引.

EDIT- I do not want row-wise comparison, but rather based off of the index, for circumstances where one dataframes does not have the same records.

推荐答案

非常接近,您想要什么:

Very close, what you want:

import pandas as pd
import numpy as np
import io

#testing df1,df2,df3
temp=u"""ID,Jan,Feb,Mar
1,True,True,False
2,True,True,True
3,False,False,False"""
df3 = pd.read_csv(io.StringIO(temp), sep=",", index_col=[0])
print df3
temp1=u"""ID,Jan,Feb,Mar
1,True,False,False
2,False,True,True
3,False,True,True"""
df1 = pd.read_csv(io.StringIO(temp1), sep=",", index_col=[0])
print df1
temp2=u"""ID,Jan,Feb,Mar
1,False,False,False
2,False,False,True
3,False,True,True"""
df2 = pd.read_csv(io.StringIO(temp2), sep=",", index_col=[0])
print df2

#concat all dataframes by columns
pieces = {'df1': df1, 'df2': df2, 'df3': df3}
df = pd.concat(pieces, axis=1)
print df

#create new dataframe with size as df filled by column names
levels = df.columns.levels
labels = df.columns.labels
xyz = pd.DataFrame( np.array(levels[0][labels[0]].tolist()*len(df.index)).reshape((len(df.index), len(df.index)*len(pieces))), index=df.index, columns = df.columns)
print xyz

#reset multicolumn to column
xyz.columns = levels[1][labels[1]]
df.columns = levels[1][labels[1]]

#use df as mask - output names of df with False
print xyz.mask(df)

#use df as mask - output names of df with True
out_false =  xyz.mask(df)
print out_false

out_true =  xyz.mask(~df)
print out_true

#create output empty df - for False and for True values
result_false = result_true = pd.DataFrame(index = out_false.index)

#group output dataframe by columns and create new df from series - for False and for True values
for name, group in out_false.groupby(level=0, axis=1):
    #print name
    series = pd.Series(group.apply(lambda x: ','.join(map(str, x.dropna())), axis=1), name=name)
    print
    print series
    result_false = pd.concat([result_false, series], axis=1) 
print result_false   
#        Feb          Jan          Mar
#ID                                   
#1   df1,df2          df2  df1,df2,df3
#2       df2      df1,df2             
#3       df3  df1,df2,df3          df3 

for name, group in out_true.groupby(level=0, axis=1):
    #print name
    series = pd.Series(group.apply(lambda x: ','.join(map(str, x.dropna())), axis=1), name=name)
    result_true = pd.concat([result_true, series], axis=1) 
print result_true  
#        Feb      Jan          Mar
#ID                               
#1       df3  df1,df3             
#2   df1,df3      df3  df1,df2,df3
#3   df1,df2               df1,df2

这篇关于比较 pandas 数据框的布尔值-返回字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆