如何使用lxml遍历GraphML文件 [英] How to iterate over GraphML file with lxml

查看:96
本文介绍了如何使用lxml遍历GraphML文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我要使用一个简单的python脚本解析以下GraphML文件"mygraph.gml":

I have the following GraphML file 'mygraph.gml' that I want to parse with a simple python script:

这表示一个简单的图形,其中包含2个节点"node0","node1"和它们之间的一条边

This represents a simple graph with 2 nodes "node0", "node1" and an edge between them

<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
         http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
  <key id="name" for="node" attr.name="name" attr.type="string"/>
  <key id="weight" for="edge" attr.name="weight" attr.type="double"/>
  <graph id="G" edgedefault="directed">
    <node id="n0">
      <data key="name">node1</data>
    </node>
    <node id="n1">
      <data key="name">node2</data>
    </node>
<edge source="n1" target="n0">
  <data key="weight">1</data>
</edge>
  </graph>
</graphml>

这表示一个具有两个节点n0和n1的图,两个节点之间的权重为1. 我想用python解析此结构.

This represents a graph with two nodes n0 and n1 with an edge of weight 1 between them. I want to parse this structure with python.

我在lxml的帮助下编写了一个脚本(我需要使用它,因为数据集比这个简单的示例大得多,超过10 ^ 5个节点,python minidom太慢了)

I wrote a script with the help of lxml (I need to use it because the dataset in much much bigger than this simple example, more than 10^5 nodes, python minidom is too slow)

import lxml.etree as et

tree = et.parse('mygraph.gml')

root = tree.getroot()

graphml = {
"graph": "{http://graphml.graphdrawing.org/xmlns}graph",
"node": "{http://graphml.graphdrawing.org/xmlns}node",
"edge": "{http://graphml.graphdrawing.org/xmlns}edge",
"data": "{http://graphml.graphdrawing.org/xmlns}data",
"label": "{http://graphml.graphdrawing.org/xmlns}data[@key='label']",
"x": "{http://graphml.graphdrawing.org/xmlns}data[@key='x']",
"y": "{http://graphml.graphdrawing.org/xmlns}data[@key='y']",
"size": "{http://graphml.graphdrawing.org/xmlns}data[@key='size']",
"r": "{http://graphml.graphdrawing.org/xmlns}data[@key='r']",
"g": "{http://graphml.graphdrawing.org/xmlns}data[@key='g']",
"b": "{http://graphml.graphdrawing.org/xmlns}data[@key='b']",
"weight": "{http://graphml.graphdrawing.org/xmlns}data[@key='weight']",
"edgeid": "{http://graphml.graphdrawing.org/xmlns}data[@key='edgeid']"
}

graph = tree.find(graphml.get("graph"))
nodes = graph.findall(graphml.get("node"))
edges = graph.findall(graphml.get("edge"))

此脚本可以正确获取节点和边缘,以便我可以简单地对其进行迭代

This script gets correctly the nodes and edges so that I can simply iterate over them

for n in nodes:
    print n.attrib

或类似地在边缘上

for e in edges:
    print (e.attrib['source'], e.attrib['target'])

但是我真的不明白如何获取边缘或节点的数据"标签以打印边缘权重和节点标签的名称".

but I can't really understand how to get the "data" tag for the edges or the nodes in order to print the edge weight and nodes tag "name".

这对我不起作用:

weights = graph.findall(graphml.get("weight"))

最后一个列表始终为空.为什么?我缺少周围的东西,但不知道是什么.

the last list is always empty. Why? I'm missing something around but can't understand what.

推荐答案

您无法一次性完成,但是对于找到的每个节点,都可以使用数据的键/值构建字典:

You can't do it in one pass, but for each node found, you can build a dict with the key/value of data:

graph = tree.find(graphml.get("graph"))
nodes = graph.findall(graphml.get("node"))
edges = graph.findall(graphml.get("edge"))

for node in nodes + edges:
    attribs = {}
    for data in node.findall(graphml.get('data')):
        attribs[data.get('key')] = data.text
    print 'Node', node, 'have', attribs

它给出结果:

Node <Element {http://graphml.graphdrawing.org/xmlns}node at 0x7ff053d3e5a0> have {'name': 'node1'}
Node <Element {http://graphml.graphdrawing.org/xmlns}node at 0x7ff053d3e5f0> have {'name': 'node2'}
Node <Element {http://graphml.graphdrawing.org/xmlns}edge at 0x7ff053d3e640> have {'weight': '1'}

这篇关于如何使用lxml遍历GraphML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆