Python解析非标准XML文件 [英] Python to parse non-standard XML file

查看:36
本文介绍了Python解析非标准XML文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的输入文件实际上是附加到一个文件的多个 XML 文件.(来自 Google 专利).它具有以下结构:

My input file is actually multiple XML files appending to one file. (It's from Google Patents). It has below structure:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-grant SYSTEM "us-patent-grant.dtd" [ ]>
<root_node>...</root_node>

Python xml.dom.minidom 无法解析这个非标准文件.解析此文件的更好方法是什么?我不是下面的代码性能好不好.

Python xml.dom.minidom can't parse this non-standard file. What's a better way to parse this file? I am not below code has good performance or not.

for line in infile:
  if line == '<?xml version="1.0" encoding="UTF-8"?>': 
    xmldoc = minidom.parse(XMLstring)
  else:
    XMLstring += line

推荐答案

这是我的看法,使用生成器和 lxml.etree.仅以提取信息为例.

Here's my take on it, using a generator and lxml.etree. Extracted information purely for example.

import urllib2, os, zipfile
from lxml import etree

def xmlSplitter(data,separator=lambda x: x.startswith('<?xml')):
  buff = []
  for line in data:
    if separator(line):
      if buff:
        yield ''.join(buff)
        buff[:] = []
    buff.append(line)
  yield ''.join(buff)

def first(seq,default=None):
  """Return the first item from sequence, seq or the default(None) value"""
  for item in seq:
    return item
  return default

datasrc = "http://commondatastorage.googleapis.com/patents/grantbib/2011/ipgb20110104_wk01.zip"
filename = datasrc.split('/')[-1]

if not os.path.exists(filename):
  with open(filename,'wb') as file_write:
    r = urllib2.urlopen(datasrc)
    file_write.write(r.read())

zf = zipfile.ZipFile(filename)
xml_file = first([ x for x in zf.namelist() if x.endswith('.xml')])
assert xml_file is not None

count = 0
for item in xmlSplitter(zf.open(xml_file)):
  count += 1
  if count > 10: break
  doc = etree.XML(item)
  docID = "-".join(doc.xpath('//publication-reference/document-id/*/text()'))
  title = first(doc.xpath('//invention-title/text()'))
  assignee = first(doc.xpath('//assignee/addressbook/orgname/text()'))
  print "DocID:    {0}\nTitle:    {1}\nAssignee: {2}\n".format(docID,title,assignee)

产量:

DocID:    US-D0629996-S1-20110104
Title:    Glove backhand
Assignee: Blackhawk Industries Product Group Unlimited LLC

DocID:    US-D0629997-S1-20110104
Title:    Belt sleeve
Assignee: None

DocID:    US-D0629998-S1-20110104
Title:    Underwear
Assignee: X-Technology Swiss GmbH

DocID:    US-D0629999-S1-20110104
Title:    Portion of compression shorts
Assignee: Nike, Inc.

DocID:    US-D0630000-S1-20110104
Title:    Apparel
Assignee: None

DocID:    US-D0630001-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630002-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630003-S1-20110104
Title:    Hooded shirt
Assignee: None

DocID:    US-D0630004-S1-20110104
Title:    Headwear cap
Assignee: None

DocID:    US-D0630005-S1-20110104
Title:    Footwear
Assignee: Vibram S.p.A.

这篇关于Python解析非标准XML文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆