找到特定标签后,如何正确地将XML文件分割成几个其他文件? [英] How to split an XML file (into several other files) properly once a certain tag has been found?

查看:59
本文介绍了找到特定标签后,如何正确地将XML文件分割成几个其他文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

问题:

我试图通过在找到标签后重新编写XML来拆分XML.但是结果不能正确地显示出来,因为在遍历元素并将它们添加到新的ET中时,它没有复制其子元素.一旦迭代通过该元素,最终将添加子项,因此,即使我找到了将子项添加到新ET的方法,它最终也将成为重复项.

I am trying to split the XML by re-writing it once a tag has been found. However the result doesn't come out properly because while iterating through elements and then adding them into a new ET, it is not copying their children. The children are eventually added once the iter has passed by that element, so even if I found the way to copy the children once added to the new ET, it would end up being a duplicate.

我尝试过的:

我试图用lxml的ElementTree解析XML,然后遍历元素.

I have tried to do so parsing the XML with lxml's ElementTree and then iterating through the elements.

如果元素的标签不匹配,则将该元素记录到ET对象中,然后使用tostring将该元素记录下来.迭代的元素与我希望XML拆分的标签匹配后,它将更改文件的名称,并通过将其记录到新文件中来有效地拆分".

If the element's tag doesn't match, the element is then recorded into an ET object and then using tostring to write it down. Once the element iterated matches the tag that I want the XML to split at, it will change the file's name and effectively 'split' by recording it into a new file.

from lxml import etree as ET

parser = ET.XMLParser()
context = ET.parse('activity-list(2).xml', parser=parser)
index = 0
root = context.getroot()

new_data = ET.Element('iati-activity')

for elem in context.iter('iati-activity'):
    for element in list(elem.iter()):
        if element.tag == 'iati-identifier':
            print("PASSED HERE")
            index = index + 1
        filename = format(str(index) + ".xml")
        print("ELEMENT IS", element.tag)
        new_sub = ET.SubElement(new_data, element.tag, attrib = 
        element.attrib)
        new_sub.text = element.text 
        with open(filename, 'wb') as f:
            f.write(ET.tostring(new_data))

编辑-

XML结构(INPUT):

XML Structure (INPUT):

<iati-activities version="2.03>
    <iati-activity>
       <iati-identifier>
          <title>
               <narrative>
               </narrative>
          </title>
       </iati-identifier>
       <iati-identifier>
          <title>
               <narrative>
               </narrative>
          </title>
       </iati-identifier>
    </iati-activity>
</iati-activities>

XML结构(输出-当前)

XML Structure (OUTPUT - CURRENT)

<iati-activities version="2.03>
    <iati-activity>
       <iati-identifier>
          <title>
          </title>
          <narrative>
          </narrative>
       </iati-identifier>
    </iati-activity>
</iati-activities>

... Same structure is created in second file with next iati-identifier's data

当前输入:

<iati-activity>
    <iati-identifier>XM-DAC-6-4-011077</iati-identifier>
    <reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
      <narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
      <narrative>Italian Agency for Development Cooperation</narrative>
    </reporting-org>
    <title>
      <narrative>Protracted relief and recovery operation</narrative>
      <narrative xml:lang="it">Protracted relief and recovery operation </narrative>
    </title>
    <description>
      <narrative>Protracted relief and recovery operation</narrative>
    </description>
    <description>
      <narrative xml:lang="it">Protracted relief and recovery operation </narrative>
    </description>
    <participating-org ref="XM-DAC-6-4" type="10" role="1">
      <narrative>AICS - Italian Agency for Cooperation and Development</narrative>
    </participating-org>
    <other-identifier ref="011077" type="A1">
      <owner-org ref="XM-DAC-6-4">
        <narrative>AICS</narrative>
      </owner-org>
    </other-identifier>
    <activity-status code="2"/>
    <activity-date iso-date="2017-05-01" type="1"/>
    <activity-date iso-date="2018-04-30" type="3"/>
    <contact-info type="1">
      <organisation>
        <narrative>AICS - Italian Agency for Cooperation and Development</narrative>
      </organisation>
      <telephone>+ 39 06 32492 305</telephone>
      <email>info@aics.gov.it</email>
      <mailing-address>
        <narrative>via Salvatore Contarini 25, 00135 Roma</narrative>
      </mailing-address>
    </contact-info>
    <recipient-country code="SO" percentage="100.00"/>
    <location>
      <location-reach code="1"/>
      <location-id/>
      <point/>
    </location>
    <collaboration-type code="3"/>
    <related-activity ref="XM-DAC-6-4-011077-01-0" type="2"/>
    <iati-identifier>XM-DAC-6-4-011077-01-0</iati-identifier>
    <reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
      <narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
      <narrative>Italian Agency for Development Cooperation</narrative>
    </reporting-org>
    <title>
      <narrative>Protracted relief and recovery operation</narrative>
      <narrative xml:lang="it">Protracted relief and recovery operation</narrative>
    </title>
    <description>
      <narrative>The scope of the program is to support the population on food security and resilience. In particular, to support local agricultural products and vulnerable families on food security.</narrative>
    </description>
    <description>
      <narrative xml:lang="it">Contributo al PAM per il programma per la sicurezza alimentare e la resilienza. Le attività, che con programmi analoghi sono state realizzate già negli scorsi anni includono oltre al tradizionale aiuto alimentare, anche il sostegno alle attività generatrici di reddito, la realizzazione di infrastrutture, il sostegno ai produttori agricoli locali e il sostegno alle famiglie più vulnerabili, per l’acquisto di beni alimentari e non, nel mercato locale attraverso smartcard prepagate che includono anche i dati biometrici dei beneficiari</narrative>
    </description>
    <participating-org ref="XM-DAC-6-4" type="10" role="1">
      <narrative>AICS - Italian Agency for Cooperation and Development</narrative>
    </participating-org>
    <participating-org ref="41140" type="40" role="4">
      <narrative>WFP - WORLD FOOD PROGRAMME</narrative>
    </participating-org>
    <other-identifier ref="011077/01/0" type="A1">
      <owner-org ref="XM-DAC-6-4">
        <narrative>AICS</narrative>
      </owner-org>
    </other-identifier>
    <activity-status code="2"/>
    <activity-date iso-date="2017-05-02" type="1"/>
    <activity-date iso-date="2018-04-30" type="3"/>
    <contact-info type="1">
      <organisation>
        <narrative>AICS - Italian Agency for Cooperation and Development</narrative>
      </organisation>
      <telephone>+ 39 06 32492 305</telephone>
      <email>info@aics.gov.it</email>
      <mailing-address>
        <narrative>via Salvatore Contarini 25, 00135 Roma</narrative>
      </mailing-address>
    </contact-info>
    <recipient-country code="SO" percentage="100.00"/>
    <sector code="52010" vocabulary="1" percentage="100.00"/>
    <policy-marker vocabulary="1" code="1" significance="0">
      <narrative>Gender Equality</narrative>
    </policy-marker>
    <policy-marker vocabulary="1" code="2" significance="0">
      <narrative>Aid to Environment</narrative>
    </policy-marker>
    <policy-marker vocabulary="1" code="3" significance="2">
      <narrative>Participatory Development/Good Governance</narrative>
    </policy-marker>
    <policy-marker vocabulary="1" code="4" significance="0">
      <narrative>Trade Development</narrative>
    </policy-marker>
    <policy-marker vocabulary="1" code="5" significance="0">
      <narrative>Aid Targeting the Objectives of the Convention on Biological Diversity</narrative>
    </policy-marker>
    <policy-marker vocabulary="1" code="6" significance="0">
      <narrative>Aid Targeting the Objectives of the Framework Convention on Climate Change - Mitigation</narrative>
    </policy-marker>
    <policy-marker vocabulary="1" code="7" significance="0">
      <narrative>Aid Targeting the Objectives of the Framework Convention on Climate Change - Adaptation</narrative>
    </policy-marker>
    <policy-marker vocabulary="1" code="8" significance="0">
      <narrative>Aid Targeting the Objectives of the Convention to Combat Desertification</narrative>
    </policy-marker>
    <collaboration-type code="3"/>
    <default-flow-type code="10"/>
    <default-finance-type code="110"/>
    <related-activity ref="XM-DAC-6-4-011077" type="1"/>
    </iati-activity>

预期输出:

<iati-activity>
  <iati-identifier>XM-DAC-6-4-011077</iati-identifier>
  <reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
      <narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
  <narrative>Italian Agency for Development Cooperation</narrative>
  <title>
      <narrative>Protracted relief and recovery operation</narrative>
      <narrative xml:lang="it">Protracted relief and recovery operation 
      </narrative>
  </title>
  <description>
      <narrative>Protracted relief and recovery operation</narrative>
  </description>
</iati-activity>

... next XML starts with next <iati-identifier>

当前输出:

<iati-activity>
  <iati-identifier>XM-DAC-6-4-011077</iati-identifier>
  <reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
      </reporting-org>
  <narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
  <narrative>Italian Agency for Development Cooperation</narrative>
  <title>
      </title>
  <narrative>Protracted relief and recovery operation</narrative>
  <narrative xml:lang="it">Protracted relief and recovery operation </narrative>
  <description>
      </description>
  <narrative>Protracted relief and recovery operation</narrative>
</iati-activity>

推荐答案

考虑参数化的 XSLT通过< iati-identifier> 节点将大型输入源分割成单独的XML文件.Python的 lxml 可以运行XSLT 1.0脚本,甚至可以将参数值从应用程序层传递到样式表(与以其他说明性专用语言-SQL传递参数不同).

Consider a parameterized XSLT to split your large input source into individual XML files by <iati-identifier> nodes. Python's lxml can run XSLT 1.0 scripts and even pass parameter values from application layer to stylesheet (not unlike passing parameters in the other declarative, special-purpose language -SQL).

具体来说,Python可以在运行XPath(同为XSLT)以获取文档中节点总数之后,迭代地传递每个 iati-identifier 的位置. following-sibling :: node_name [1] 用于按名称获取第一个相邻节点.

Specifically, Python can iteratively pass the position of each iati-identifier after running an XPath (sibling to XSLT) for total count of nodes in document. The following-sibling::node_name[1] is used to get first adjacent node by name.

XSLT (另存为.xsl文件,一个special.xml文件)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:strip-space elements="*"/>
    <xsl:output indent="yes"/>

    <!-- XSL PARAM -->
    <xsl:param name="item_num"/>

    <xsl:template match="/iati-activity">
        <xsl:apply-templates select="iati-identifier[position()=$item_num]"/>
    </xsl:template>

    <xsl:template match="iati-identifier">
        <iati-activity>
            <xsl:copy-of select="."/>
            <xsl:copy-of select="following-sibling::reporting-org[1]"/>
            <xsl:copy-of select="following-sibling::narrative[1]"/>
            <xsl:copy-of select="following-sibling::title[1]"/>
            <xsl:copy-of select="following-sibling::description[1]"/>
        </iati-activity>
    </xsl:template>

</xsl:stylesheet>

Python

import lxml.etree as ET

# LOAD XML AND XSL SCRIPT
xml = ET.parse('Input.xml')
xsl = ET.parse('Script.xsl')
transform = ET.XSLT(xsl)

# LOOP THROUGH ALL NODE COUNTS AND PASS PARAMETER TO XSLT
iati_count = len(xml.xpath('//iati-identifier'))

for i in range(iati_count):
   n = ET.XSLT.strparam(str(i+1))            
   result = transform(xml, item_num=n)         # NAME OF XSL PARAMETER

   # SAVE XML TO FILE
   with open('Output_{}.xml'.format(i+1), 'wb') as f:
       f.write(result)


输出

Output_1.xml

<?xml version="1.0"?>
<iati-activity>
  <iati-identifier>XM-DAC-6-4-011077</iati-identifier>
  <reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
    <narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
    <narrative>Italian Agency for Development Cooperation</narrative>
  </reporting-org>
  <title>
    <narrative>Protracted relief and recovery operation</narrative>
    <narrative xml:lang="it">Protracted relief and recovery operation </narrative>
  </title>
  <description>
    <narrative>Protracted relief and recovery operation</narrative>
  </description>
</iati-activity>

Output_2.xml

<?xml version="1.0"?>
<iati-activity>
  <iati-identifier>XM-DAC-6-4-011077-01-0</iati-identifier>
  <reporting-org ref="XM-DAC-6-4" type="10" secondary-reporter="0">
    <narrative xml:lang="it">AICS - Agenzia Italiana per la Cooperazione allo Sviluppo</narrative>
    <narrative>Italian Agency for Development Cooperation</narrative>
  </reporting-org>
  <title>
    <narrative>Protracted relief and recovery operation</narrative>
    <narrative xml:lang="it">Protracted relief and recovery operation</narrative>
  </title>
  <description>
    <narrative>The scope of the program is to support the population on food security and resilience. In particular, to support local agricultural products and vulnerable families on food security.</narrative>
  </description>
</iati-activity>

这篇关于找到特定标签后,如何正确地将XML文件分割成几个其他文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆