Gremlin:AWS Neptune-以CSV格式获取图中每个节点的所有叶节点 [英] Gremlin : AWS Neptune - Get all Leaf Nodes for each Node in the Graph as CSV

查看:85
本文介绍了Gremlin:AWS Neptune-以CSV格式获取图中每个节点的所有叶节点的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个简单的图,其中的节点以以下形式表示重复的记录ID

I have a simple graph that has Nodes which represent duplicate record id in the below form

Duplicate Id, Original Id
A,B
B,C
C,D
X,Y
Y,Z

有向图看起来像A->B-> C-> D,我想要下面的CSV结果,该结果的每个节点都具有最终叶节点,并且没有更多的出局边缘

The directed graph looks like A -> B ->C ->D and I want CSV result that looks like below that has each Node with ultimate leaf node with no more outgoing edges

A,D
B,D
C,D
X,Z
Y,Z

上面是一个简单的场景,用于解释该问题,但是实际数据将具有更复杂的场景,如下所示,其中我有24个从A到X的节点,每个节点都连接到其他23个具有24C2 = 276有向边的节点,并且对于每个24个节点,我需要没有更多输出边缘的最终节点.

The above is a simplistic scenario to explain the problem however actual data will have more complex scenarios like below where I have 24 Nodes from A to X with each node connect to other 23 having 24C2=276 directed edges and for each of the 24 nodes I require the ultimate node that has no more outgoing edges.

A,B
A,C
A,D
A,E
A,F
A,G
A,H
A,I
A,J
A,K
A,L
A,M
A,N
A,O
A,P
A,Q
A,R
A,S
A,T
A,U
A,V
A,W
A,X
B,C
B,D
B,E
B,F
B,G
B,H
B,I
B,J
B,K
B,L
B,M
B,N
B,O
B,P
B,Q
B,R
B,S
B,T
B,U
B,V
B,W
B,X
C,D
C,E
C,F
C,G
C,H
C,I
C,J
C,K
C,L
C,M
C,N
C,O
C,P
C,Q
C,R
C,S
C,T
C,U
C,V
C,W
C,X
D,E
D,F
D,G
D,H
D,I
D,J
D,K
D,L
D,M
D,N
D,O
D,P
D,Q
D,R
D,S
D,T
D,U
D,V
D,W
D,X
E,F
E,G
E,H
E,I
E,J
E,K
E,L
E,M
E,N
E,O
E,P
E,Q
E,R
E,S
E,T
E,U
E,V
E,W
E,X
F,G
F,H
F,I
F,J
F,K
F,L
F,M
F,N
F,O
F,P
F,Q
F,R
F,S
F,T
F,U
F,V
F,W
F,X
G,H
G,I
G,J
G,K
G,L
G,M
G,N
G,O
G,P
G,Q
G,R
G,S
G,T
G,U
G,V
G,W
G,X
H,I
H,J
H,K
H,L
H,M
H,N
H,O
H,P
H,Q
H,R
H,S
H,T
H,U
H,V
H,W
H,X
I,J
I,K
I,L
I,M
I,N
I,O
I,P
I,Q
I,R
I,S
I,T
I,U
I,V
I,W
I,X
J,K
J,L
J,M
J,N
J,O
J,P
J,Q
J,R
J,S
J,T
J,U
J,V
J,W
J,X
K,L
K,M
K,N
K,O
K,P
K,Q
K,R
K,S
K,T
K,U
K,V
K,W
K,X
L,M
L,N
L,O
L,P
L,Q
L,R
L,S
L,T
L,U
L,V
L,W
L,X
M,N
M,O
M,P
M,Q
M,R
M,S
M,T
M,U
M,V
M,W
M,X
N,O
N,P
N,Q
N,R
N,S
N,T
N,U
N,V
N,W
N,X
O,P
O,Q
O,R
O,S
O,T
O,U
O,V
O,W
O,X
P,Q
P,R
P,S
P,T
P,U
P,V
P,W
P,X
Q,R
Q,S
Q,T
Q,U
Q,V
Q,W
Q,X
R,S
R,T
R,U
R,V
R,W
R,X
S,T
S,U
S,V
S,W
S,X
T,U
T,V
T,W
T,X
U,V
U,W
U,X
V,W
V,X
W,X

以下是使用Neo4j的图形可视化-

Here is a graphical visualization using Neo4j -

对于上述情况,每个节点A到W都将X作为最终叶节点.

For the above case each Node A to W will have X as the ultimate Leaf Node.

在整体解决方案中,还需要避免循环循环.一步一步完成所有内容可能会太多,但是希望您能得到指导.

There will also be Cyclic loops which I need to avoid in the overall solution. It may be too much to have everything in one step however will appreciate guidance.

更新时间:2020-10-15 遍历优化需要优化执行过程,以查找从起始节点到叶节点的路径

Update : 2020-10-15 Traversal Optimization need to optimize execution in finding Path from starting Node to Leaf Node

对于以下数据方案,顶点A和B的结果应为

For the Data Scenario below, The result for Vertex A and B should be

A,G
A,H
B,G
B,H

从A到G的最短路径是A-> C-> E-> G;找到任何一条通往叶节点的最短路径时,是否有可能抑制从A到G的任何进一步遍历?否则,这将导致不必要的执行,尤其是对于连接的节点的较大群集.需要再次从A到H重复此步骤,其路径将为A-> C-> F-> H,然后阻止进一步尝试查找A和H之间的路径.

The shortest path from A to G is A->C->E->G ; is it possible to suppress any further traversals from A to G when any 1 shortest path to a leaf node is found ? else it will lead to unwanted execution especially for larger clusters of connected Nodes. This steps needs to be repeated again from A to H whose path will be A->C->F->H and then prevent any further attempts to find paths between A and H.

谢谢

推荐答案

根据您提供的内容,可以构建以下图形.

Based on what you have provided, the following graph can be built.

g.addV('A').as('a').
  addV('B').as('b').
  addV('C').as('c').
  addV('D').as('d').
  addV('X').as('x').
  addV('Y').as('y').
  addV('Z').as('z').
  addE('dup').from('a').to('b').
  addE('dup').from('b').to('c').
  addE('dup').from('c').to('d').
  addE('dup').from('x').to('y').
  addE('dup').from('y').to('z').
  iterate()  

和以下用于查找结果的查询

and the following query used to find the results

gremlin> g.V().
           repeat(out().simplePath()).
           until(__.not(out())).
           path().
             by(label).
           local(
             unfold().
             union(
               limit(1),
               tail()).
             fold())


==>[A,D]
==>[B,D]
==>[C,D]
==>[X,Z]
==>[Y,Z] 

已于2020年10月11日更新

使用您提供的较大数据集,我对查询进行了一些调整,以便对于每个起始顶点,仅找到到叶子的一条路径.这运行非常快.如果不这样做,从"A"开始,实际上有数百万个唯一路径以"X"结尾,这就是为什么上一个查询变得如此复杂的原因.

Using the larger data set you provided, I tweaked the query a bit so that for each starting vertex only one path to a leaf is found. This runs very quickly. Without doing this, starting at say 'A', there are literally millions of unique paths that end up at 'X' which is why the previous query becomes so complex.

gremlin> g.V().
......1>   local(
......2>     repeat(out().simplePath()).
......3>     until(__.not(out())).
......4>     path().
......5>       by(label).
......6>       limit(1)).
......7>   local(
......8>     unfold().
......9>     union(
.....10>       limit(1),
.....11>       tail()).
.....12>     fold())

==>[A,X]
==>[B,X]
==>[C,X]
==>[D,X]
==>[E,X]
==>[F,X]
==>[G,X]
==>[H,X]
==>[I,X]
==>[J,X]
==>[K,X]
==>[L,X]
==>[M,X]
==>[N,X]
==>[O,X]
==>[P,X]
==>[Q,X]
==>[R,X]
==>[S,X]
==>[T,X]
==>[U,X]
==>[V,X]
==>[W,X]

有趣的是,以下查询显示了图中高度连接的风扇.

Purely for interest the following query shows the highly connected fan out of the graph.

gremlin> g.V().group().by(label).by(local(out().label().order().fold())).unfold()

==>A=[B, C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>B=[C, D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>C=[D, E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>D=[E, F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>E=[F, G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>F=[G, H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>G=[H, I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>H=[I, J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>I=[J, K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>J=[K, L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>K=[L, M, N, O, P, Q, R, S, T, U, V, W, X]
==>L=[M, N, O, P, Q, R, S, T, U, V, W, X]
==>M=[N, O, P, Q, R, S, T, U, V, W, X]
==>N=[O, P, Q, R, S, T, U, V, W, X]
==>O=[P, Q, R, S, T, U, V, W, X]
==>P=[Q, R, S, T, U, V, W, X]
==>Q=[R, S, T, U, V, W, X]
==>R=[S, T, U, V, W, X]
==>S=[T, U, V, W, X]
==>T=[U, V, W, X]
==>U=[V, W, X]
==>V=[W, X]
==>W=[X]
==>X=[]

和计数(将这些数字相乘得到一个很大的数字),这解释了为什么从'A'查找所有路径是一个昂贵的查询.请注意, simplePath 步骤可确保不遵循任何周期,从而为我们提供了帮助.在该数据集中,从任何顶点到"X"的路径数最终为2 ^(C-1),其中C是下面列表中给定起点的数字.

and the counts (multiplying out those numbers gives a large number) which explains why finding all paths from 'A' is an expensive query. Note that the simplePath step helps us by making sure we do not follow any cycles. The number of paths from any vertex to 'X' in this data set ends up being 2^(C-1) where C is the number in the list below for a given start.

gremlin> g.V().group().by(label).by(local(out().count())).unfold()

==>A=23
==>B=22
==>C=21
==>D=20
==>E=19
==>F=18
==>G=17
==>H=16
==>I=15
==>J=14
==>K=13
==>L=12
==>M=11
==>N=10
==>O=9
==>P=8
==>Q=7
==>R=6
==>S=5
==>T=4
==>U=3
==>V=2
==>W=1
==>X=0

这篇关于Gremlin:AWS Neptune-以CSV格式获取图中每个节点的所有叶节点的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆