比较两个 pandas 数据帧中的值以保持运行计数 [英] comparing values in two pandas dataframes to keep a running count

查看:24
本文介绍了比较两个 pandas 数据帧中的值以保持运行计数的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我很抱歉这么长,但我想尽可能完整地解释一下.我完全不知道如何解决这个问题.

My apologies for the length of this but I want to explain as fully as possible. I am completely stumped on how to solve this.

设置:

我有两个数据框,第一个在第一列中有所有可能值的列表,此列中没有重复值.我们称之为 df_01.这些是每个列表中所有常见的可能值.所有附加列代表独立列表.每个都包含一个数字,表示所有可能值的任何给定值在该列表中出现的天数.此数据框 (df_01) 的形状为 (9277, 32).这些尺寸可以改变,但一般会保持不变.下面是它的外观的一个小例子.

I have two dataframes the first has a list of all possible values in the first column there are no duplicate values in this column. Let's call it df_01. Theses are all the common possible values in each list. All additional columns represent independent lists. Each contains a number that represents how many days any given value of all possible values has been on that list. This dataframe (df_01) has a shape of (9277, 32). These dimensions can change but will general stay the same. The following is a small example of what it looks like.

df_01 在任何操作之前:

df_01 before any actions:

index   values   list01   list02  ... list30   list31
  0       aaa      5         1    ...   NaN      83
  1       bbb     NaN       NaN   ...   NaN      4
  2       ccc      20       NaN   ...   NaN      32
  3       ddd      1         27   ...   NaN     NaN
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .  
  9274    qqq     NaN        15   ...   NaN      6 
  9275    rrr     238       NaN   ...   NaN     102
  9276    sss      3         2    ...   NaN     NaN
  9277    ttt      12       NaN   ...   NaN      99

第一个数据帧 (df_01) 将始终是前一天的值.

This first dataframe (df_01) will always be the values as they were the previous day.

第二个数据框.我们称之为 df_2 的行数总是较少,并且长度每天都在变化,但列数始终与 (df_01) 相同.它目前的形状为 (1351, 32).在此数据帧 (df_2) 中,第一列具有截至今天每个列表中的所有公共值,并且没有重复项.如果该值今天在列表中,则此数据框中的其他列 (df_2) 为 1,否则为 NaN.这是一个例子.

The second dataframe. Let's call it df_2 will always have less row, and change from day to day in length, but always the same number of columns as (df_01). It currently has a shape of (1351, 32). In this dataframe (df_2) the first column has all the common values from each list as of today, and has no duplicates. The other columns in this dataframe (df_2) have a 1 if the value is on the list today and NaN if it does not. Here's an example.

df_02 在任何操作之前:

df_02 before any actions:

index   values   list01   list02  ... list30   list31
  0       aaa      1         1    ...   NaN      1
  1       bbb     NaN        1    ...    1       1
  2       ddd      1         1    ...   NaN     NaN
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .  
  1349    qqq     NaN       NaN   ...    1       1 
  1350    rrr      1        NaN   ...    1      NaN
  1351    sss     NaN        1    ...   NaN      1

问题:

我想要完成的事情如下.

What I want to accomplish is as follows.

1) 对于每列中的每个值,如果该值存在于第一个数据帧 (df_01) 中而不存在于第二个 (df_02) 中,则其 (df_01) 中的计数器将基于每列重置为 NaN.

1) For every value in each column if the value exists in the first dataframe (df_01) and not in the second (df_02) its counter in (df_01) will reset to NaN on a per column basis.

2) 然后对于第二个数据帧 (df_02) 的每一列中的每个值,如果该值存在于两个数据帧的同一列中,则对这些值求和.

2) Then for every value in each column of the second dataframe (df_02) if the value exists in the same column on both dataframes sum the values.

3) 如果 (df_01) 的 list01 上的 aaa = 5 和 (df_2) 的 list01 上的 aaa = 1,那么 (df_02) 的 list01 上的 aaa 将变为 6.这将保持运行计数.

3) If aaa on list01 of (df_01) = 5 and aaa on list01 of (df_2) = 1 then aaa on list01 of (df_02) will become 6. This will keep a running count.

4) 如果两者中的值都是 NaN,则不需要任何操作.

4) If the value is NaN in both no action is needed.

5) 如果值在 (df_01) 上为 NaN,在 (df_02) 上为 1,则它保持为 1.

5) If a value is NaN on (df_01) and 1 on (df_02) it stays 1.

*** 在求和之前,df_02 中的值将始终为 1 或 NaN.值列中的值是否在当天的单个列表中是一个二元选择.

*** The value in df_02 will always be a 1 or an NaN prior to the summation. It is a binary choice of whether or not the value in the values column is in the individual list that day or not.

*** 注意值 ccc 和 ttt 位于 (df_01) 中,因为它们是可能的值,但不是 (df_02),因为它们今天不在任何列表中.

*** Notice value ccc, and, ttt are in (df_01) as they are possible values but not (df_02) as they were on none of the lists today.

*** 像 *NaN**6* 这样的值周围的星号表示将改变它的值实际上不会出现在数据中.

*** The asterisk around values like *NaN* or *6* is to denote the values that will change it would not actually be in the data.

程序后的数据框应如下所示:

The dataframes should look like this after the procedure:

df_01

index   values   list01   list02  ... list30   list31
  0       aaa      5         1    ...   NaN      83
  1       bbb     NaN       NaN   ...   NaN      4
  2       ccc     NaN       NaN   ...   NaN     NaN
  3       ddd      1         27   ...   NaN     NaN
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .  
  9274    qqq     NaN      *NaN*  ...   NaN      6 
  9275    rrr     238       NaN   ...   NaN     *NaN*
  9276    sss    *NaN*       2    ...   NaN      24
  9277    ttt      12       NaN   ...   NaN      99

df_02

index   values   list01   list02  ... list30   list31
  0       aaa     *6*       *2*   ...   NaN     *84*
  1       bbb     NaN        1    ...    1      *5*
  2       ddd     *2*      *28*   ...   NaN     NaN
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .
  .        .       .         .    ...    .       .  
  1349    qqq     NaN       NaN   ...    1      *7*
  1350    rrr    *239*      NaN   ...    1      NaN
  1351    sss     NaN       *3*   ...   NaN    *25*

我将如何完成这样的事情?我什至不知道从哪里开始.任何想法,即使没有完全奏效,只要能指出我正确的方向,我们将不胜感激.如果有任何需要澄清的地方,请告诉我.

How would I go about accomplishing something like this? I don't even know where to begin. Any ideas, even if not completely working, just to point me in the right direction would be appreciated. Please let me know if anything needs clarification.

谢谢

推荐答案

df1 = df1.set_index('values')
df2 = df2.set_index('values')

cols = [*df1.columns]
for col in cols:
    #Update to df1
    df1[col].update(df2.loc[df2[col].isnull(), col].fillna('-'))
    df1[col].replace('-', np.NaN, inplace = True)

    #Update to df2, sum if they both have numbers
    df2[col].update(df2.loc[~df2[col].isnull(), col] + df1.loc[~df1[col].isnull(), col])

这应该可以满足您的需求.我们将遍历每一行然后单独更新它们.确保 cols 列表包含基于您的 df 的正确列.

This should do what you want. We will loop over each row then update them individually. Make sure the cols list contains the correct columns based on your df's.

我们必须在 df1 的更新中使用 .fillna('-') 的原因是因为你不能用 NaN 替换一个值,所以我们必须用其他东西来填充它,然后我们可以将其替换回 NaN.

The reason we have to use .fillna('-') in the update to df1 is because you can't replace a value with NaN, so we have to fill it with something else, then we can replace it back to NaN.

这篇关于比较两个 pandas 数据帧中的值以保持运行计数的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆