如何使用pandas数据框定义sankey图的结构? [英] How to define the structure of a sankey diagram using a pandas dataframe?

查看:102
本文介绍了如何使用pandas数据框定义sankey图的结构?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

这听起来像是一个非常广泛的问题,但是如果您让我描述一些细节,我可以向您保证,这是 非常具体 .以及令人沮丧,沮丧和愤怒.

This may sound like a very broad question, but if you'll let me describe some details I can assure you it's very specific. As well as discouraging, frustrating and rage-inducing.

以下图表描述了苏格兰选举,并基于图表中的代码.ly :

The following plot describes a scottish election and is based on code from plot.ly:

情节1:

数据集1:

data = [['Source','Target','Value','Color','Node, Label','Link Color'],
        [0,5,20,'#F27420','Remain+No – 28','rgba(253, 227, 212, 0.5)'],
        [0,6,3,'#4994CE','Leave+No – 16','rgba(242, 116, 32, 1)'],
        [0,7,5,'#FABC13','Remain+Yes – 21','rgba(253, 227, 212, 0.5)'],
        [1,5,14,'#7FC241','Leave+Yes – 14','rgba(219, 233, 246, 0.5)'],
        [1,6,1,'#D3D3D3','Didn’t vote in at least one referendum – 21','rgba(73, 148, 206, 1)'],
        [1,7,1,'#8A5988','46 – No','rgba(219, 233, 246,0.5)'],
        [2,5,3,'#449E9E','39 – Yes','rgba(250, 188, 19, 1)'],
        [2,6,17,'#D3D3D3','14 – Don’t know / would not vote','rgba(250, 188, 19, 0.5)'],
        [2,7,2,'','','rgba(250, 188, 19, 0.5)'],
        [3,5,3,'','','rgba(127, 194, 65, 1)'],
        [3,6,9,'','','rgba(127, 194, 65, 0.5)'],
        [3,7,2,'','','rgba(127, 194, 65, 0.5)'],
        [4,5,5,'','','rgba(211, 211, 211, 0.5)'],
        [4,6,9,'','','rgba(211, 211, 211, 0.5)'],
        [4,7,8,'','','rgba(211, 211, 211, 0.5)']
        ]

地块的构建方式:

我从各种来源获得了有关sankey图表行为的一些重要细节,例如:

I've picked up some important details about the behavior of sankey charts from various sources, like:

链接为按照它们在数据集中出现的顺序分配(row_wise)

针对节点颜色是在构建顺序图中分配的.

挑战:

正如您将在下面的详细信息中看到的那样,节点,标签和颜色不会按照构造源数据框的顺序应用于图表.其中的 Some 非常完美,因为您具有描述相同节点的各种元素,例如颜色,目标,值和链接颜色.一个节点'Remain+No – 28'看起来像这样:

As you'll see in the details below, nodes, labels and colors are not applied to the chart in the same order that the source dataframe is structured. Some of that makes perfect sence, since you have various elements that describe the same node like color, targets, values and link color. One node 'Remain+No – 28' looks like this:

数据集的随附部分如下所示:

And the accompanying part of the dataset looks like this:

[0,5,20,'#F27420','Remain+No – 28','rgba(253, 227, 212, 0.5)'],
[0,6,3,'#4994CE','Leave+No – 16','rgba(242, 116, 32, 1)'],
[0,7,5,'#FABC13','Remain+Yes – 21','rgba(253, 227, 212, 0.5)'],

因此,源代码的这一部分描述了一个节点[0],该节点具有三个对应的目标[5, 6, 7]和三个具有值[20, 3, 5]的链接. '#F27420'是节点的橙色,而'rgba(253, 227, 212, 0.5)''rgba(242, 116, 32, 1)''rgba(253, 227, 212, 0.5)'颜色描述了从节点到某些目标的链接的颜色.到目前为止,以上示例中尚未使用的信息是:

So this part of the source describes a node [0] with three corresponding targets [5, 6, 7] and three links with the values [20, 3, 5]. '#F27420' is the orange(ish) color of the node, and the colors 'rgba(253, 227, 212, 0.5)', 'rgba(242, 116, 32, 1)' and 'rgba(253, 227, 212, 0.5)' describe the colors of the links from the node to some targets. So far, the information that has not been used from the sample above is:

数据示例2(部分)

[-,-,--'-------','---------------','-------------------'],
[-,-,-,'#4994CE','Leave+No – 16','-------------------'],
[-,-,-,'#FABC13','Remain+Yes – 21','-------------------'],

然后介绍该信息,作为图表的其余元素.

And that information is used as the remaining elements of the diagram are indtroduced.

那么,这是什么问题?在下面的更多详细信息中,您将看到一切有意义,只要数据集中的新数据行插入新链接,并在尚未使用该信息的情况下对其他元素(颜色,标签)进行其他更改.我将更详细地说明我使用的两个屏幕截图,其中左侧为绘图,右侧为代码.

So, what's the question? In the further details below, you'll see that everything makes sense as long as a new row of data in the dataset inserts a new link, and makes other changes to other elements (colors, labels) if that information has not yet ben used. I'll be even more specific with the use of two screenshots from a setup I've made with plot to the left and code to the right:

以下数据示例按照上面描述的逻辑生成了下图:

The following data sample produces the diagram below following the logic desbribed above:

数据示例3

data = [['Source','Target','Value','Color','Node, Label','Link Color'],
        [0,5,20,'#F27420','Remain+No – 28','rgba(253, 227, 212, 0.5)'],
        [0,6,3,'#4994CE','Leave+No – 16','rgba(242, 116, 32, 1)'],
        [0,7,5,'#FABC13','Remain+Yes – 21','rgba(253, 227, 212, 0.5)'],
        [1,5,14,'#7FC241','Leave+Yes – 14','rgba(219, 233, 246, 0.5)'],
        [1,6,1,'#D3D3D3','Didn’t vote in at least one referendum – 21','rgba(73, 148, 206, 1)']]

屏幕截图1-带有数据样本3的局部图

问题:

THE QUESTION:

在数据集中添加行[1,7,1,'#8A5988','46 – No','rgba(219, 233, 246,0.5)']会在源[5]和目标[7]之间产生新的链接,但是同时将颜色和标签应用于目标5 .我认为下一个要应用于图表的标签是'Remain+Yes – 21',因为尚未使用它.但是,这里发生的是将标签'46 – No'应用于目标5. 为什么?

Adding the row [1,7,1,'#8A5988','46 – No','rgba(219, 233, 246,0.5)'] in the dataset produces a new link between source [5] and target [7] but applies color and label to a target 5 at the same time. I would think that the next label to be applied to the chart was 'Remain+Yes – 21' since it hasn't been used. But what happens here is that the label '46 – No' is applied to Target 5. WHY?

屏幕截图2-带有数据样本3的局部图 + [1,7,1,'#8A5988','46 – No','rgba(219, 233, 246,0.5)'] :

您如何根据该数据框识别出什么是源,什么是目标?

And how do you discern what is a source and what is a target based on that dataframe?

我知道这个问题既奇怪又难以回答,但我希望有人提出建议.我也知道,数据框可能不是sankey图表的最佳来源.也许是json吗?

I know that the question is both strange and hard to answer, but I'm hoping someone has a suggestion. I also know that a dataframe may not be the best source for a sankey chart. Perhaps json instead?

完整的代码和数据示例,可轻松复制和粘贴Jupyter Notebook:

import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# Original data
data = [['Source','Target','Value','Color','Node, Label','Link Color'],
    [0,5,20,'#F27420','Remain+No – 28','rgba(253, 227, 212, 0.5)'],
    [0,6,3,'#4994CE','Leave+No – 16','rgba(242, 116, 32, 1)'],
    [0,7,5,'#FABC13','Remain+Yes – 21','rgba(253, 227, 212, 0.5)'],
    [1,5,14,'#7FC241','Leave+Yes – 14','rgba(219, 233, 246, 0.5)'],
    [1,6,1,'#D3D3D3','Didn’t vote in at least one referendum – 21','rgba(73, 148, 206, 1)'],
    [1,7,1,'#8A5988','46 – No','rgba(219, 233, 246,0.5)'],
    [2,5,3,'#449E9E','39 – Yes','rgba(250, 188, 19, 1)'],
    [2,6,17,'#D3D3D3','14 – Don’t know / would not vote','rgba(250, 188, 19, 0.5)'],
    [2,7,2,'','','rgba(250, 188, 19, 0.5)'],
    [3,5,3,'','','rgba(127, 194, 65, 1)'],
    [3,6,9,'','','rgba(127, 194, 65, 0.5)'],
    [3,7,2,'','','rgba(127, 194, 65, 0.5)'],
    [4,5,5,'','','rgba(211, 211, 211, 0.5)'],
    [4,6,9,'','','rgba(211, 211, 211, 0.5)'],
    [4,7,8,'','','rgba(211, 211, 211, 0.5)']
    ]



headers = data.pop(0)
df = pd.DataFrame(data, columns = headers)
scottish_df = df

data_trace = dict(
    type='sankey',
    domain = dict(
      x =  [0,1],
      y =  [0,1]
    ),
    orientation = "h",
    valueformat = ".0f",
    node = dict(
      pad = 10,
      thickness = 30,
      line = dict(
        color = "black",
        width = 0
      ),
      label =  scottish_df['Node, Label'].dropna(axis=0, how='any'),
      color = scottish_df['Color']
    ),
    link = dict(
      source = scottish_df['Source'].dropna(axis=0, how='any'),
      target = scottish_df['Target'].dropna(axis=0, how='any'),
      value = scottish_df['Value'].dropna(axis=0, how='any'),
      color = scottish_df['Link Color'].dropna(axis=0, how='any'),
  )
)

layout =  dict(
    title = "Scottish Referendum Voters who now want Independence",
    height = 772,
    font = dict(
      size = 10
    ),    
)

fig = dict(data=[data_trace], layout=layout)
iplot(fig, validate=False)

推荐答案

这个问题看起来确实很奇怪,但是直到您分析plotly中的sankey图是如何创建的:

This problem looks really strange, but only until you will analyze how the sankey plot in plotly is created:

创建sankey图时,将其发送给它:

When you create the sankey plot, you send to it:

  1. 节点列表
  2. 链接列表

这些列表相互绑定.创建5个长度的节点列表时,任何边缘都将在其开始和结束之间了解0,1,2,3,4.在您的程序中,您错误地创建了节点-您创建了链接列表,然后遍历它并创建了节点.看你的图.它有两个黑色节点,内部带有undefined.数据集的长度是多少...是,5.您的节点索引以4结尾,并且没有真正定义任何目标节点.您将第六个列表添加到数据集中,然后-宾果游戏! -有nodes[5]存在!只需尝试在数据集中添加另一行:

These lists are bounded with each other. When you create the 5-length node list, any edge will know about 0,1,2,3,4 in its starts and ends. In your program, you creates node wrongly - you create the list of links and then go through it and create nodes. Look at your diagram. It has two black nodes with undefined inside. And what is the length of your dataset... Yes, 5. Your node indices ends on 4 and no target nodes are really defined. You add the sixth list in your dataset and - bingo! - there are nodes[5] exists! Just try to add another new line in your dataset:

[1,7,1,'#FF0000','WAKA','rgba(219, 233, 246,0.5)']

您将看到另一个黑色条变成红色.您有五个节点(因为您有5个链接,并且通过迭代链接列表来创建节点),但是链接目标索引为5,6,7.您可以通过两种方式修复它:

And you will see that another black bar is colored to red. You have five nodes (because you have 5 links and you create node by iterating for links list), but links target indices are 5,6,7. You can fix it with two ways:

  1. 将数据集中的Target更改为2,3,4
  2. 分别创建节点和链接(正确方式)
  1. Change Target's in your dataset to 2,3,4
  2. Create nodes and links separately (right way)

希望我能帮助您解决问题和了解地块创建(更重要的IMO).

I hope I helped you in your problem and in plot creation understanding (what is more important IMO).

这是创建单独的节点/链接的示例(请注意,data_trace中的node部分仅使用nodes_df数据,data_trace中的link部分仅使用links_df数据以及nodes_dflinks_df长度不相等):

Here is the example of separate nodes/links creation (note that node part in data_trace uses only nodes_df data, link part in data_trace uses only links_df data and nodes_df and links_df length are not equal):

import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

nodes = [
    ['ID', 'Label', 'Color'],
    [0,'Remain+No – 28','#F27420'],
    [1,'Leave+No – 16','#4994CE'],
    [2,'Remain+Yes – 21','#FABC13'],
    [3,'Leave+Yes – 14','#7FC241'],
    [4,'Didn’t vote in at least one referendum – 21','#D3D3D3'],
    [5,'46 – No','#8A5988']
]
links = [
    ['Source','Target','Value','Link Color'],
    [0,3,20,'rgba(253, 227, 212, 0.5)'],
    [0,4,3,'rgba(242, 116, 32, 1)'],
    [0,2,5,'rgba(253, 227, 212, 0.5)'],
    [1,5,14,'rgba(219, 233, 246, 0.5)'],
    [1,3,1,'rgba(73, 148, 206, 1)'],
    [1,4,1,'rgba(219, 233, 246,0.5)'],
    [1,2,10,'rgba(8, 233, 246,0.5)'],
    [1,3,5,'rgba(219, 77, 246,0.5)'],
    [1,5,12,'rgba(219, 4, 246,0.5)']
]

nodes_headers = nodes.pop(0)
nodes_df = pd.DataFrame(nodes, columns = nodes_headers)
links_headers = links.pop(0)
links_df = pd.DataFrame(links, columns = links_headers)

data_trace = dict(
    type='sankey',
    domain = dict(
      x =  [0,1],
      y =  [0,1]
    ),
    orientation = "h",
    valueformat = ".0f",
    node = dict(
      pad = 10,
      thickness = 30,
      line = dict(
        color = "black",
        width = 0
      ),
      label =  nodes_df['Label'].dropna(axis=0, how='any'),
      color = nodes_df['Color']
    ),
    link = dict(
      source = links_df['Source'].dropna(axis=0, how='any'),
      target = links_df['Target'].dropna(axis=0, how='any'),
      value = links_df['Value'].dropna(axis=0, how='any'),
      color = links_df['Link Color'].dropna(axis=0, how='any'),
  )
)

layout =  dict(
    title = "Scottish Referendum Voters who now want Independence",
    height = 772,
    font = dict(
      size = 10
    ),    
)

fig = dict(data=[data_trace], layout=layout)
iplot(fig, validate=False)

让我们更深入地探讨:) sankey图中的节点和链接几乎完全独立.限制它们的唯一信息-链接中源目标中的索引.因此,我们可以创建许多节点,而无需为它们创建链接(只需用Edit1代码替换其中的节点/链接):

Edit 2: Let's dive in even more deeply :) Nodes and links in sankey diagrams are nearly fully independent. The only info that bounds them - indices in source-targets in links. So we can create many nodes and no links for them (just replace nodes/links in Edit1 code with it):

nodes = [
    ['ID', 'Label', 'Color'],
    [0,'Remain+No – 28','#F27420'],
    [1,'Leave+No – 16','#4994CE'],
    [2,'Remain+Yes – 21','#FABC13'],
    [3,'Leave+Yes – 14','#7FC241'],
    [4,'Didn’t vote in at least one referendum – 21','#D3D3D3'],
    [5,'46 – No','#8A5988'],
    [6,'WAKA1','#8A5988'],
    [7,'WAKA2','#8A5988'],
    [8,'WAKA3','#8A5988'],
    [9,'WAKA4','#8A5988'],
    [10,'WAKA5','#8A5988'],
    [11,'WAKA6','#8A5988'],

]
links = [
    ['Source','Target','Value','Link Color'],
    [0,3,20,'rgba(253, 227, 212, 0.5)'],
    [0,4,3,'rgba(242, 116, 32, 1)'],
    [0,2,5,'rgba(253, 227, 212, 0.5)'],
    [1,5,14,'rgba(219, 233, 246, 0.5)'],
    [1,3,1,'rgba(73, 148, 206, 1)'],
    [1,4,1,'rgba(219, 233, 246,0.5)'],
    [1,2,10,'rgba(8, 233, 246,0.5)'],
    [1,3,5,'rgba(219, 77, 246,0.5)'],
    [1,5,12,'rgba(219, 4, 246,0.5)']
]

这些节点将不会出现在图中.

And these nodes will not appear in diagram.

我们只能创建没有节点的链接:

We can create only links without nodes:

nodes = [
    ['ID', 'Label', 'Color'],
]
links = [
    ['Source','Target','Value','Link Color'],
    [0,3,20,'rgba(253, 227, 212, 0.5)'],
    [0,4,3,'rgba(242, 116, 32, 1)'],
    [0,2,5,'rgba(253, 227, 212, 0.5)'],
    [1,5,14,'rgba(219, 233, 246, 0.5)'],
    [1,3,1,'rgba(73, 148, 206, 1)'],
    [1,4,1,'rgba(219, 233, 246,0.5)'],
    [1,2,10,'rgba(8, 233, 246,0.5)'],
    [1,3,5,'rgba(219, 77, 246,0.5)'],
    [1,5,12,'rgba(219, 4, 246,0.5)']
]

我们将只有从无处到无处的链接.

And we will have only links from nowhere to nowhere.

如果要添加具有链接的新源(1),则应在nodes中添加新列表,计算其索引(这就是为什么我有ID列)并添加links中的新列表,其中Source等于节点索引.

If you want to add (1) a new source with a link, you should add a new list in nodes, calculate its index (it is why I have ID column) and add a new list in links with Source equal to node index.

如果要为现有节点添加(2)新目标-只需在links中添加一个新列表并正确写入其SourceTarget:

If you want to add (2) a new target for existing nodes - just add a new list in links and write its Source and Target properly:

    [1,100500,10,'rgba(219, 233, 246,0.5)'],
    [1,100501,10,'rgba(8, 233, 246,0.5)'],
    [1,100502,10,'rgba(219, 77, 246,0.5)'],
    [1,100503,10,'rgba(219, 4, 246,0.5)']

(这里我为4个新目标创建了4个新链接.源是所有索引都为1的节点.)

(Here I created 4 new links for 4 new targets. Source is the node with index 1 for all of them).

(3 + 4):Sankey图没有不同的来源和目标.它们都是Sankey的节点.每个节点既可以是源,也可以是目标.看一下:

(3+4): Sankey diagrams doesn't differ sources and targets. All of them are just nodes for Sankey. Every node can be both a source and a target. Look at it:

nodes = [
    ['ID', 'Label', 'Color'],
    [0,'WAKA WANNA BE SOURCE','#F27420'],
    [1,'WAKA WANNA BE TARGET','#4994CE'],
    [2,'WAKA DON\'T KNOW WHO WANNA BE','#FABC13'],

]
links = [
    ['Source','Target','Value','Link Color'],
    [0,1,10,'rgba(253, 227, 212, 1)'],
    [0,2,10,'rgba(242, 116, 32, 1)'],
    [2,1,10,'rgba(253, 227, 212, 1)'],
]

在这里,您将获得3列的Sankey图. 0 节点是源, 1 是目标, 2 节点是 1 的源 2 的目标.

Here you will have the 3-column Sankey diagram. The 0 node is a source, the 1 is a target and the 2 node is a source for 1 and a target for 2.

这篇关于如何使用pandas数据框定义sankey图的结构?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆