使用lxml Python解析非标准XML中的XPath [英] Parsing XPath within non standard XML using lxml Python

查看:105
本文介绍了使用lxml Python解析非标准XML中的XPath的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试创建一个包含Google Patents中所有专利信息的数据库.到目前为止,我的许多工作都在使用Python解析非标准XML文件中使用了MattH的非常好的答案.我的Python太大,无法显示,因此其链接的此处.

I’m trying to create a database of all patent information from Google Patents. Much of my work so far has been using this very good answer from MattH in Python to parse non-standard XML file. My Python is too large to display so its linked here.

此处的源文件: 一堆xml文件附加到一个具有多个标头的文件中.问题是,在解析具有多个xml和dtd声明的异常非标准" XML文件时,尝试使用正确的xpath表达式.我一直在尝试使用"-".join(doc.xpath将所有内容解析在一起,但是输出显示的空格由连字符分隔,如下所示的<document-id><classification-national>

The source files are here: a bunch of xml files appended together into one file with multiple headers.The issue is trying to use the correct xpath expression when parsing this unsual "non-standard" XML file which has multiple xml and dtd declarations. I have been trying to use "-".join(doc.xpath to tie everything together when its parsed out but the output creates blanks separated by hyphens for the <document-id> and <classification-national> shown below

<references-cited> <citation> 
<patcit num="00001"> <document-id>
<country>US</country> 
<doc-number>534632</doc-number> 
<kind>A</kind>
<name>Coleman</name> 
<date>18950200</date> 
</document-id> </patcit>
<category>cited by examiner</category>
<classification-national><country>US</country>
<main-classification>249127</main-classification></classification-national>
</citation>

注意:并非每个<citation>中都存在所有子级,有时根本不存在.

Note not all children exist within each <citation>, sometimes they are not present at all.

当尝试在<citation>下的多个条目的每个数据条目之间放置连字符时,如何解析此xpath?

How can I parse this xpath while trying to place hyphens between each data entry for multiple entries under <citation> ?

推荐答案

通过此XML(references.xml),

From this XML (references.xml),

<references-cited> 
  <citation> 
    <patcit num="00001"> 
      <document-id>
        <country>US</country> 
        <doc-number>534632</doc-number> 
        <kind>A</kind>
        <name>Coleman</name> 
        <date>18950200</date> 
      </document-id> 
    </patcit>
    <category>cited by examiner</category>
    <classification-national>
      <country>US</country>
      <main-classification>249127</main-classification>
    </classification-national>
  </citation>

  <citation>
    <patcit num="00002">
      <document-id>
        <country>US</country>
        <doc-number>D28957</doc-number>
        <kind>S</kind>
        <name>Simon</name>
        <date>18980600</date>
      </document-id>
    </patcit>
    <category>cited by other</category>
  </citation>
</references-cited>

您可以获得<citation>的每个后代的文本内容,该文本内容具有以下内容:

you can get the text content of every descendant of <citation> that has any content as follows:

from lxml import etree

doc = etree.parse("references.xml")
cits = doc.xpath('/references-cited/citation')

for c in cits:
    descs = c.xpath('.//*')
    for d in descs:
        if d.text and d.text.strip():
            print "%s: %s"  %(d.tag, d.text)
    print

输出:

country: US
doc-number: 534632
kind: A
name: Coleman
date: 18950200
category: cited by examiner
country: US
main-classification: 249127

country: US
doc-number: D28957
kind: S
name: Simon
date: 18980600
category: cited by other

此变化:

import sys
from lxml import etree

doc = etree.parse("references.xml")
cits = doc.xpath('/references-cited/citation')

for c in cits:
    descs = c.xpath('.//*')
    for d in descs:
        if d.text and d.text.strip():
            sys.stdout.write("-%s"  %(d.text))
    print

结果如下:

-US-534632-A-Coleman-18950200-cited by examiner-US-249127
-US-D28957-S-Simon-18980600-cited by other

这篇关于使用lxml Python解析非标准XML中的XPath的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆