Iterparse大XML,具有较低的内存占用量,并获取所有(甚至嵌套的)Sequence Elements [英] Iterparse big XML, with low memory footprint, and get all, even nested, Sequence Elements

查看:73
本文介绍了Iterparse大XML,具有较低的内存占用量,并获取所有(甚至嵌套的)Sequence Elements的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个小的python脚本,用于基于 Liza解析XML数据Daly的博客用Python编写.但是,我的代码无法解析所有节点.因此,例如,当一个人有多个地址时,它仅使用第一个可用地址. XML树如下所示:

I have written a small python script to parse XML data based on Liza Daly's blog in Python. However, my code does not parse all the nodes. So for example when a person has had multiple addresses then it takes only the first available address. The XML tree would look like this:

- lgs
    - entities
        - entity
            - id
            - name
            - addressess
                - address
                    - address1
                - address
                    - address1
        - entity
            - id
              (...)

这将是python脚本:

and this would be the python script:

import os
import time
from datetime import datetime
import lxml.etree as ET
import pandas as pd

xml_file = '.\\FILE.XML'

file_name, file_extension = os.path.splitext(os.path.basename(xml_file))

def fast_iter(context, *args, **kwargs):
# iterate through the XML file and create a dictionary 
    xml_dict = {  "Id":[]
    , "name":[]
    , "address":[]
    , "type":[] }

    def try_to_append(xml_column, node):
        # find and append XML nodes to the empty dictionary
        try:
            xml_dict[xml_column].append(elem.find(node).text)
        except:
            xml_dict[xml_column].append('')

    for event, elem in context:
        try_to_append('Id', 'id')
        try_to_append('name', 'name')
        try_to_append('address', 'addresses/address/address1'
        elem.clear()

        for ancestor in elem.xpath("ancestor-or-self::*"):
            while ancestor.getprevious() is not None:
                del ancestor.getparent()[0]
        del context
        return xml_dict

context = ET.iterparse(xml_file, tag='entity')
xml_dict = fast_iter(context)
df = pd.DataFrame(xml_dict)

所以我得到的是:

ID | name     | address        | Type
1  | John Doe | Pythonstreet 4 | A

但是,约翰·多伊(John Doe)居住的名字上有3个地址.所以我希望:

But John Doe has 3 addresses on his name where he has lived. So I would expect:

ID | name     | address        | Type
1  | John Doe | Pythonstreet 4 | A
1  | John Doe | SQL street 33  | A
1  | John Doe | C++ street 99  | A

如您所见,由于唯一的额外节点是地址,其余的节点不变,仅地址被添加了额外的地址.当您将OPENROWSET()与BULK一起使用时,SQL Server会自动执行此操作.但是,该文件大于3GB,因此无法正常工作.

And as you can see, because the only extra node is address, the rest does not change and only the address is added extra. SQL Server would do this automatically when you use OPENROWSET() with BULK. However, this file is greater than 3GB so that wouldn't work.

<?xml version='1.0' encoding='UTF-8'?>
<ffl>
<version>12345</version>
<entities>

    <entity id="1124353" version="12345">
        <name>DAVID, Beckham</name>
        <footId>1021</footId>
        <footCode>FIF</footCode>
        <createdDate>09/02/1991</createdDate>

        <source>FIF</source>
        <OriginalSource>FIFA</OriginalSource>
        <pobs>
            <pob>Leytonstone, London, United Kingdom</pob>
        </pobs>
        <dobs>
            <dob>May 02, 1975</dob>
        </dobs>
        <titles>
            <title>Football player</title>
        </titles>
        <addresses>
            <address>
                <address></address>
                <city>London</city>
                <country>UK</country>
                <countryName>UNITED KINGDOM</countryName>
                <postalCode>SE10 0JF</postalCode>
            </address>
            <address>
                <address1>35-37 Parkgate Road</address1>
                <city>London</city>
                <country>UK</country>
                <countryName>UNITED KINGDOM</countryName>
                <postalCode>SW11 4NP</postalCode>
            </address>
        </addresses>
    </entity>

    <entity id="1184359" version="12345">
        <name>CRISTIANO, Ronaldo</name>
        <footId>1022</footId>
        <footCode>FIF</footCode>
        <createdDate>20/03/2003</createdDate>
        <source>FIF</source>
        <OriginalSource>FIFA</OriginalSource>
        <pobs>
            <pob>Funchal, Madeira, Portugal</pob>
        </pobs>
        <dobs>
            <dob>February 05, 1985</dob>
        </dobs>
        <titles>
            <title>Football player</title>
        </titles>
        <addresses>
            <address>
                <address>Avenida da Boavista 1837</address>
                <city>Porto</city>
                <country>PT</country>
                <countryName>PORTUGAL</countryName>
                <postalCode>4100-133</postalCode>
            </address>
            <address>
                <address1>Extramuros 74</address1>
                <city>Madrid</city>
                <country>ES</country>
                <countryName>Spain</countryName>
                <postalCode>28400</postalCode>
            </address>
            <address>
                <address1>Viale Certosa 29</address1>
                <city>Turin</city>
                <country>IT</country>
                <countryName>Italy</countryName>
                <postalCode>10093</postalCode>
            </address>
        </addresses>
    </entity>

    <entity id="1984359" version="12345">
        <name>LIONEL, Messi</name>
        <footId>1023</footId>
        <footCode>FIF</footCode>
        <createdDate>09/02/2008</createdDate>
        <source>FIF</source>
        <OriginalSource>FIFA</OriginalSource>
        <pobs>
            <pob>Rosario, Argentina</pob>
        </pobs>
        <dobs>
            <dob>June 24, 1987</dob>
        </dobs>
        <titles>
            <title>Football player</title>
        </titles>
        <addresses>
            <address>
                <address>Almeyra 2588</address>
                <city>San Martin</city>
                <state>Buenos Aires</state>
                <country>AR</country>
                <countryName>ARGENTINA</countryName>
                <postalCode>N/A</postalCode>
            </address>
            <address>
                <address1>Comandante Izarduy 67</address1>
                <city>Barcelona</city>
                <country>ES</country>
                <countryName>SPAIN</countryName>
                <postalCode>08940</postalCode>
            </address>
            <address>
                <address1>Humahuaca 4425</address1>
                <city>Buenos Aires</city>
                <country>AR</country>
                <countryName>ARGENTINA</countryName>
                <postalCode>N/A</postalCode>
            </address>
        </addresses>
    </entity>

</entities>

</ffl>

推荐答案

评论:由于它现在仅输出结果

Comment: As it now only outputs results

输出结果仅用于演示,跟踪和调试.
要将recordaddresses写入到SQL数据库中(例如使用 sqlite3 ),请执行以下操作:

Outputing results are only for demonstration, tracing and debuging.
To write a record and addresses into a SQL database, for example using sqlite3, do:

c.execute("INSERT INTO entity(id, name) VALUES(:id, :name)", record)
addresses = []
for addr in record['addresses']:
    addr[1].update({'id': record['id']})
    addresses.append(addr[1])
c.executemany("INSERT INTO adresses(id, address, city) VALUES(:id, :address, :city)", addresses)


为了大熊猫而扁平化
循环之外的前提: df = pd.DataFrame()


To flatten for pandas
Preconditon outside the loop: df = pd.DataFrame()

from copy import copy

addresses = copy(record['addresses'])
del record['addresses']

df_records = []
for addr in addresses:
    record.update(addr[1])
    df_records.append(record)

df = df.append(df_records, ignore_index=True)


问题:使用etree.iterparse在XML文件中包含所有节点

Question: Use etree.iterparse to include all nodes in XML file

执行以下class Entity:

  • 使用lxml.etree.iterparse解析XML文件.
  • 没有文件大小限制,因为<entity>...</entity>元素树在处理后被删除.
  • 从每个<entity>...</entity>树构建一个dict {tag, value, ...}.
  • 使用generator objectsyield dict.
  • 序列元素,例如<addresses>/<address>是元组[(address, {tag, text})...的列表.
  • Parse the XML File using lxml.etree.iterparse.
  • There is no File size limit, as the <entity>...</entity> Element Tree are deleted after processing.
  • Builds from every <entity>...</entity> Tree a dict {tag, value, ...}.
  • Using of generator objects to yield the dict.
  • Sequence Elements, e.g. <addresses>/<address> are List of Tuple [(address, {tag, text})....

待办事项:

  • 要整理成许多记录,请循环record['addresses']
  • 要等于不同的标签名称:addressaddress1
  • 要展平,请使用序列标签,例如<titels><probs><dobs>
  • To flatten into many Records, loop record['addresses']
  • To equal different tag names: address and address1
  • To flatten, Sequence tags, e.g. <titels>, <probs> and <dobs>


from lxml import etree

class Entity:
    def __init__(self, fh):
        """
        Initialize 'iterparse' to only generate 'end' events on tag '<entity>'

        :param fh: File Handle from the XML File to parse
        """
        self.context = etree.iterparse(fh, events=("end",), tag=['entity'])

    def _parse(self):
        """
        Parse the XML File for all '<entity>...</entity>' Elements
        Clear/Delete the Element Tree after processing

        :return: Yield the current '<entity>...</entity>' Element Tree
        """
        for event, elem in self.context:
            yield elem

            elem.clear()
            while elem.getprevious() is not None:
                del elem.getparent()[0]

    def sequence(self, elements):
        """
        Expand a Sequence Element, e.g. <titels> to a Tuple ('titel', text).
        If found a nested Sequence Element, e.g. <address>,
          to a Tuple ('address', {tag, text})

        :param elements: The Sequence Element
        :return: List of Tuple [(tag1, value), (tag2, value), ... ,(tagn, value))
        """
        _elements = []
        for elem in elements:
            if len(elem):
                _elements.append((elem.tag, dict(self.sequence(elem))))
            else:
                _elements.append((elem.tag, elem.text))

        return _elements

    def __iter__(self):
        """
        Iterate all '<entity>...</entity>' Element Trees yielded from self._parse()

        :return: Dict var 'entity' {tag1, value, tag2, value, ... ,tagn, value}}
        """
        for xml_entity in self._parse():
            entity = {'id': xml_entity.attrib['id']}

            for elem in xml_entity:
                # if elem is Sequence
                if len(elem):
                    # Append tuple(tag, value)
                    entity[elem.tag] = self.sequence(elem)
                else:
                    entity[elem.tag] = elem.text

            yield entity

if __name__ == "__main__":
    with open('.\\FILE.XML', 'rb') as in_xml_
        for record in Entity(in_xml):
            print("record:{}".format(record))

            for key, value in record.items():
                if isinstance(value, (list)):
                    #print_list(key, value)
                    print("{}:{}".format(key, value))
                else:
                    print("{}:{}".format(key, value))

输出:仅显示第一个记录"和仅 4 个字段.
注意:存在一个具有唯一标签名称的陷阱:addressaddress1

Output: Shows only the first Record and only 4 fields.
Note: There is a pitfall with unique tag names: address and address1

record:{'id': '1124353', 'titles': {'title': 'Foot... (omitted for brevity)
id:1124353
name:DAVID, Beckham
titles:[('title', 'Football player')]
addresses:
    address:{'city': 'London', 'address': None, 'post... (omitted for brevity)
    address:{'city': 'London', 'address1': '35-37 Par... (omitted for brevity)

使用Python测试:3.5-lxml.etree:3.7.1

Tested with Python: 3.5 - lxml.etree: 3.7.1

这篇关于Iterparse大XML,具有较低的内存占用量,并获取所有(甚至嵌套的)Sequence Elements的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆