创建自定义累积总和,以给定位置及其顺序的列表来计算下游数量 [英] Creating a custom cumulative sum that calculates the downstream quantities given a list of locations and their order

查看:36
本文介绍了创建自定义累积总和,以给定位置及其顺序的列表来计算下游数量的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试提出一些代码,这些代码本质上将在其下面的位置计算累积值.取累加的总和几乎可以做到这一点,但是某些位置对同一下游点有所贡献.此外,最上游的点(或起点)将没有任何贡献值,并且可以在最终的累积DataFrame中保留其起点值.

比方说,我对每个站点都有以下DataFrame.

  df = pd.DataFrame({网站1":np.random.rand(10),站点2":np.random.rand(10),站点3":np.random.rand(10),站点4":np.random.rand(10),站点5":np.random.rand(10)}) 

我还有一个数据表,其中包含每个站点及其对应的下游组件.

  df_order = pd.DataFrame({网站1":网站3,站点2":站点3,网站3":网站4站点4":站点5站点5":无}) 

我要执行以下操作:

1)对上游值进行求和,以获得各个下游值的累加和.例如,站点1和站点2贡献了站点3的价值.因此,我想将站点1,站点2和站点3加在一起,以获得站点3的累积值.

2)现在,我在站点3拥有该累积值,我想将该累积值保存到"df"中的站点3.现在,我想将该值传播到站点4,通过更新DataFrame进行保存,然后继续进行站点5.

我可以使用cumsum来获得近似值,以获取每个站点的累积值,如下所示:

  df = df.cumsum(axis = 1) 

但是,这没有考虑到站点1和站点2对站点3的贡献,而不是彼此无关.

好吧,我可以使用以下方法手动解决此问题:

  df ['Site 3'] = df.loc [:,'Site 1':'Site 3'].sum(axis = 1)df ['Site 4'] = df.loc [:,'Site 3':'Site 4'].sum(axis = 1)df ['Site 5'] = df.loc [:,'Site 4':'Site 5'].sum(axis = 1) 

但是,我的实际站点列表更为广泛,并且手动方法不会自动考虑所提供的"df_order".有没有一种方法可以逻辑地链接"df_order" DataFrame,使其可以自动计算出来?我知道如何手动执行此操作,如何扩展它以能够处理更大的DataFrame和站点顺序?

考虑一个更大的DataFrame(可能多达50个站点),看起来像:

  df_order = pd.DataFrame({网站1":网站3,站点2":站点3,网站3":网站4站点4":站点5网站5":网站8,网站6":网站8,网站7":网站8站点8":站点9网站9":无}) 

解决方案

您可以使用


使用此有向图,您希望获得所有前辈.我们可以递归地做到这一点.(您的图应该是有向 Acyclic 图,否则递归会带来麻烦)

  def all_preds(G,目标):preds = [目标]对于列表中的p(G.predecessors(target)):preds + = all_preds(G,p)回报物#前任.all_preds(G,网站4")[站点4",站点3",站点1",站点2"] 

我们现在可以为您的所有唯一站点创建通过此函数输出的列的下游总和.

  pd.concat([df [all_preds(G,target)].sum(1).rename(target)df_order ['source'].unique()中的目标],轴= 1) 


使用 np.random.seed(42)

输出

 站点1站点2站点3站点4站点50 0.374540 0.020584 1.006978 1.614522 1.7365611 0.950714 0.969910 2.060118 2.230642 2.7258192 0.731994 0.832443 1.856581 1.921633 1.9560213 0.598658 0.212339 1.177359 2.126245 3.0355654 0.156019 0.181825 0.793914 1.759546 2.0183265 0.155995 0.183405 1.124575 1.932972 2.5954956 0.058084 0.304242 0.562000 0.866613 1.1783247 0.866176 0.524756 1.905167 2.002839 2.5229078 0.601115 0.431945 1.625475 2.309708 2.8564189 0.708073 0.291229 1.045752 1.485905 1.670759 

I am trying to come up with some code that will essentially calculate the cumulative value at locations below it. Taking the cumulative sum almost accomplishes this, but some locations contribute to the same downstream point. Additionally, the most upstream points (or starting points) will not have any values contributing to them and can remain their starting value in the final cumulative DataFrame.

Let's say I have the following DataFrame for each site.

df = pd.DataFrame({
"Site 1": np.random.rand(10),
"Site 2": np.random.rand(10),
"Site 3": np.random.rand(10),
"Site 4": np.random.rand(10),
"Site 5": np.random.rand(10)})

I also have a table of data that has each site and its corresponding downstream component.

df_order = pd.DataFrame({
    "Site 1": Site 3,
    "Site 2": Site 3,
    "Site 3": Site 4,
    "Site 4": Site 5,
    "Site 5": None})

I want to do the following:

1) Sum the values upstream values to get cumulative sum on the respective downstream value. For instance, Site 1 and Site 2 contribute to the value at Site 3. So, I want to add Site 1, Site 2, and Site 3 together to get a cumulative value at Site 3.

2) Now that I have that cumulative value at Site 3, I want to save that cumulative value to Site 3 in "df". Now I want to propagate that value to Site 4, save it by updating the DataFrame, and then proceed to Site 5.

I can get close-ish using cumsum to get the cumulative value at each site, like this:

df = df.cumsum(axis=1)

However, this does not take into account that Site 1 and Site 2 are contributing to Site 3, and not each other.

Well, I can solve this manually using:

df['Site 3'] = df.loc[:,'Site 1':'Site 3'].sum(axis = 1)
df['Site 4'] = df.loc[:,'Site 3':'Site 4'].sum(axis = 1)
df['Site 5'] = df.loc[:,'Site 4':'Site 5'].sum(axis = 1)

However, my actual list of sites is much more extensive and the manual method doesn't automatically take into account the "df_order" provided. Is there a way to logically link the "df_order" DataFrame in such a way that it can calculate this automatically? I know how to do this manually, how would I expand this to be able to handle a larger DataFrame and order of sites?

Think of a larger DataFrame, potentially up to 50 sites, that looks like:

df_order = pd.DataFrame({
    "Site 1": Site 3,
    "Site 2": Site 3,
    "Site 3": Site 4,
    "Site 4": Site 5,
    "Site 5": Site 8,
    "Site 6": Site 8,
    "Site 7": Site 8,
    "Site 8": Site 9,
    "Site 9": None})

解决方案

You can use networkx to deal with the relationships. First, make your order DataFrame like:

print(df_order)
   source  target
0  Site 1  Site 3
1  Site 2  Site 3
2  Site 3  Site 4
3  Site 4  Site 5
4  Site 5    None


Create the directed graph

import networkx as nx
G = nx.from_pandas_edgelist(df_order.dropna(), 
                            source='source', target='target', 
                            create_using=nx.DiGraph)

nx.draw(G, with_labels=True)


With this directed graph you want to get all of the predecessors. We can do this recursively. (Your graph should be a Directed Acyclic Graph, otherwise recursion runs into trouble)

def all_preds(G, target):
    preds=[target]
    for p in list(G.predecessors(target)):
        preds += all_preds(G, p)
    return preds

#Ex.
all_preds(G, 'Site 4')
['Site 4', 'Site 3', 'Site 1', 'Site 2']

And we can now create you downstream sums looping over the columns output by this function for all of your unique Sites.

pd.concat([
    df[all_preds(G, target)].sum(1).rename(target)
    for target in df_order['source'].unique()
    ], axis=1)


Output using np.random.seed(42)

     Site 1    Site 2    Site 3    Site 4    Site 5
0  0.374540  0.020584  1.006978  1.614522  1.736561
1  0.950714  0.969910  2.060118  2.230642  2.725819
2  0.731994  0.832443  1.856581  1.921633  1.956021
3  0.598658  0.212339  1.177359  2.126245  3.035565
4  0.156019  0.181825  0.793914  1.759546  2.018326
5  0.155995  0.183405  1.124575  1.932972  2.595495
6  0.058084  0.304242  0.562000  0.866613  1.178324
7  0.866176  0.524756  1.905167  2.002839  2.522907
8  0.601115  0.431945  1.625475  2.309708  2.856418
9  0.708073  0.291229  1.045752  1.485905  1.670759

这篇关于创建自定义累积总和,以给定位置及其顺序的列表来计算下游数量的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆