使用python pandas比较数据框和序列并即时生成新的数据框 [英] Working with comparing dataframes and series and generating new dataframes on the fly in python pandas

查看:301
本文介绍了使用python pandas比较数据框和序列并即时生成新的数据框的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在创建一个函数,该函数将一个数据帧(DF)与一个序列(S)进行比较,并最终返回一个新的数据帧.通用列是名称".我希望函数返回一个数据帧,该数据帧具有与系列(S)相同的行数和与df相同的列数.该函数将搜索df中的名称列,并找到系列(S)中的所有匹配名称.如果找到匹配项,我希望创建一个新数据框的新行,该行与该特定名称的df行匹配.如果没有找到匹配项,我希望为结果数据帧创建一个新行,而不是包括该特定行的所有单元格的所有0.0.在过去的6个小时中,我一直在努力解决这个问题.我相信我在广播方面遇到问题.这是我尝试过的.

I am creating a function that compares a dataframe (DF) to a series (S) and eventually returns a new dataframe. The common column is 'name'. I want the function to return a dataframe with the same number of rows as the series (S) and the same number of columns as the df. The function will search name columns in the df and find all of the matching names in the series (S). If a match is found I want a new row of a new dataframe to be created that matches the df row for that specific name. If a match is not found I want a new row to be created for the result dataframe regardless but to include all 0.0 for the cells for that particular row. I've been trying to figure this out for the past 6 hours. I'm having issues with broadcasting I believe. Here is what I have tried.

以下是一些示例数据

系列:

  S[500:505]
  500                 Nanotechnology
  501                          Music
  502       Logistics & Supply Chain
  503    Computer & Network Security
  504              Computer Software
  Name: name, dtype: object

DataFrame:注意:有一个名为name的列,它也是行业.因此,行= 0,这里是防务&名称列中的空格.

DataFrame: NOTE: there is a column called name which is also industries. So row =0 here is Defense & Space in the name column.

          Defense & Space  Computer Software  Internet  Semiconductors  \
  0              1.0                0.0       0.0             0.0   
  1              0.0                1.0       0.5             0.5   
  2              0.0                0.5       1.0             0.5   
  3              0.0                0.5       0.5             1.0   
  4              0.5                0.0       0.0             0.0   


S.shape = (31454,)
df.shape = (100,101)

生成一个全为零的空数据框

Generate an empty dataframe with all zeros

all_zeros = np.zeros((len(S),len(df.columns)))

将numpy数组放入数据框

Put the numpy array into a dataframe

result = pd.DataFrame(data = all_zeros, columns=df.columns, index = range(len(s)))

我不希望名称列出现在最终结果中

I don't want the name column to be in the final result

result = result.drop('name', axis=1)

构建一个在lambda中使用的函数,以设置结果数据帧的新值

Build a function to be used in a lambda to set the new values for the result dataframe

def set_cell_values(row):
    return df.iloc[1,:]

这是我为新数据框设置新值的部分

Here is the part where I set the new values for the new dataframe

for index in range(len(df)):
    names_are_equal = df['name'][index] == result['name']
    map(lambda x: set_cell_values(row), result[names_are_equal]))

对我来说,这很有意义,但是似乎没有用.有没有一种我不知道的简单方法可以完成这项工作?该图之所以存在,是因为我需要将df行广播到新数据帧的几行中(而不是一次).

To me this makes sense but it seems not to be working. Is there an easy way to make this work that I am unaware of? The map is there because I needed to broadcast the df row into the new dataframe at several rows (not just once).

推荐答案

Don,
所以,让我们开始吧:

Don,
So, let's go:

# with this tables 
In [66]: S
Out[66]:
0    aaa
1    bbb
2    ccc
3    ddd
4    eee
Name: name, dtype: object

In [84]: df
Out[84]:
    a   b   c name
0  39  71  55  aaa
1   9  57   6  bbb
2  72  22  52  iii
3  68  97  81  jjj
4  30  64  78  kkk

# transform the series to a dataframe
Sd = pd.DataFrame(S)
# merge them with outer join (will keep both tables columns and values).
# fill the NAs with 0
In [86]: pd.merge(Sd,df, how='outer').fillna(0)
Out[86]:
  name   a   b   c
0  aaa  39  71  55
1  bbb   9  57   6
2  ccc   0   0   0
3  ddd   0   0   0
4  eee   0   0   0
5  iii  72  22  52
6  jjj  68  97  81
7  kkk  30  64  78

是吗?

这篇关于使用python pandas比较数据框和序列并即时生成新的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆