使用python pandas比较数据框和序列并即时生成新的数据框 [英] Working with comparing dataframes and series and generating new dataframes on the fly in python pandas
问题描述
我正在创建一个函数,该函数将一个数据帧(DF)与一个序列(S)进行比较,并最终返回一个新的数据帧.通用列是名称".我希望函数返回一个数据帧,该数据帧具有与系列(S)相同的行数和与df相同的列数.该函数将搜索df中的名称列,并找到系列(S)中的所有匹配名称.如果找到匹配项,我希望创建一个新数据框的新行,该行与该特定名称的df行匹配.如果没有找到匹配项,我希望为结果数据帧创建一个新行,而不是包括该特定行的所有单元格的所有0.0.在过去的6个小时中,我一直在努力解决这个问题.我相信我在广播方面遇到问题.这是我尝试过的.
I am creating a function that compares a dataframe (DF) to a series (S) and eventually returns a new dataframe. The common column is 'name'. I want the function to return a dataframe with the same number of rows as the series (S) and the same number of columns as the df. The function will search name columns in the df and find all of the matching names in the series (S). If a match is found I want a new row of a new dataframe to be created that matches the df row for that specific name. If a match is not found I want a new row to be created for the result dataframe regardless but to include all 0.0 for the cells for that particular row. I've been trying to figure this out for the past 6 hours. I'm having issues with broadcasting I believe. Here is what I have tried.
以下是一些示例数据
系列:
S[500:505]
500 Nanotechnology
501 Music
502 Logistics & Supply Chain
503 Computer & Network Security
504 Computer Software
Name: name, dtype: object
DataFrame:注意:有一个名为name的列,它也是行业.因此,行= 0,这里是防务&名称列中的空格.
DataFrame: NOTE: there is a column called name which is also industries. So row =0 here is Defense & Space in the name column.
Defense & Space Computer Software Internet Semiconductors \
0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.5 0.5
2 0.0 0.5 1.0 0.5
3 0.0 0.5 0.5 1.0
4 0.5 0.0 0.0 0.0
S.shape = (31454,)
df.shape = (100,101)
生成一个全为零的空数据框
Generate an empty dataframe with all zeros
all_zeros = np.zeros((len(S),len(df.columns)))
将numpy数组放入数据框
Put the numpy array into a dataframe
result = pd.DataFrame(data = all_zeros, columns=df.columns, index = range(len(s)))
我不希望名称列出现在最终结果中
I don't want the name column to be in the final result
result = result.drop('name', axis=1)
构建一个在lambda中使用的函数,以设置结果数据帧的新值
Build a function to be used in a lambda to set the new values for the result dataframe
def set_cell_values(row):
return df.iloc[1,:]
这是我为新数据框设置新值的部分
Here is the part where I set the new values for the new dataframe
for index in range(len(df)):
names_are_equal = df['name'][index] == result['name']
map(lambda x: set_cell_values(row), result[names_are_equal]))
对我来说,这很有意义,但是似乎没有用.有没有一种我不知道的简单方法可以完成这项工作?该图之所以存在,是因为我需要将df行广播到新数据帧的几行中(而不是一次).
To me this makes sense but it seems not to be working. Is there an easy way to make this work that I am unaware of? The map is there because I needed to broadcast the df row into the new dataframe at several rows (not just once).
推荐答案
Don,
所以,让我们开始吧:
Don,
So, let's go:
# with this tables
In [66]: S
Out[66]:
0 aaa
1 bbb
2 ccc
3 ddd
4 eee
Name: name, dtype: object
In [84]: df
Out[84]:
a b c name
0 39 71 55 aaa
1 9 57 6 bbb
2 72 22 52 iii
3 68 97 81 jjj
4 30 64 78 kkk
# transform the series to a dataframe
Sd = pd.DataFrame(S)
# merge them with outer join (will keep both tables columns and values).
# fill the NAs with 0
In [86]: pd.merge(Sd,df, how='outer').fillna(0)
Out[86]:
name a b c
0 aaa 39 71 55
1 bbb 9 57 6
2 ccc 0 0 0
3 ddd 0 0 0
4 eee 0 0 0
5 iii 72 22 52
6 jjj 68 97 81
7 kkk 30 64 78
是吗?
这篇关于使用python pandas比较数据框和序列并即时生成新的数据框的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!