连接XML标签以成为数据框列名称 [英] Concatenate XML tags to become a dataframe column name

查看:52
本文介绍了连接XML标签以成为数据框列名称的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我当前正在解析XML,并从中填充数据框。假设我们有这个玩具XML:

I am currently parsing an XML and from that, fill a dataframe. Suppose we have this toy XML:

<A>
  <AA>
      <AAA1 period='march'>ONE</AAA1>
      <AAA2>TWO</AAA2>
      <AAA3>THREE</AAA3>
      <AAA4>
           <B semester='4'>FOUR</B>
           <C>FIVE</C>
           <D>SIX</D>
      </AAA4>
  </AA>
</A>

我想要得到的是:
[{A.AA.AAA1.period-march:'ONE'},{A.AA.AAA2:'TWO'},{A.AA.AAA3:'THREE'},{A.AA.AAA4.B.第四学期:'FOUR'},{A.AA.AAA4.C:'FIVE'},{A.AA.AAA4.D:'SIX'}] ,这样会容易得多

And what I am trying to get is something like : [{A.AA.AAA1.period-march: 'ONE'}, {A.AA.AAA2: 'TWO'}, {A.AA.AAA3: 'THREE'}, {A.AA.AAA4.B.semester-4: 'FOUR'},{A.AA.AAA4.C: 'FIVE'}, {A.AA.AAA4.D: 'SIX'}] , which would be much easier to work with.

我已经解析了XML并将其转换为以下形式: [{'A:'empty'},{' AA:空},{'AAA1':一个},{'AAA2':两个},{'AAA3':三个},{'AAA4':空},{ B':'FOUR'},{'C':'FIVE'},{'D':'SIX'}] ,在父标记的值中填充空以对其进行标记然后可以按照以下想法进行连接:如果找到并为空,保存要连接的密钥,依此类推。

I have already parsed the XML and transformed it into this form: [{'A: 'empty'}, {'AA': 'empty'}, {'AAA1': 'ONE'}, {'AAA2': 'TWO'},{'AAA3': 'THREE'}, {'AAA4': 'empty'}, {'B': 'FOUR'}, {'C': 'FIVE'}, {'D': 'SIX'}], filling the values of the father tags with 'empty' to mark them and then be able to concatenate them following the idea that if it finds and 'empty' value, saves the key to concatenate, and so on.

伙计们提前非常感谢您。

I would appreciate all the help, guys. Thank you very much in advance.

推荐答案

棘手的部分是获取您感兴趣的元素的路径。xslt的一种方法是使用递归调用到模板。

The tricky part is getting the path to the element you are interested in. One way with xslt is to use a recursive call to a template.

以下使用此方法来组装字典的字符串版本并将其交给python。

The following uses this method to assemble string versions of the dictionaries and hand those to python.

xslt部分,dataframe.xsl:

Here's the xslt part, dataframe.xsl:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" />
    <xsl:strip-space elements="*" />

    <!-- match all elements that have text -->
    <xsl:template match="//*[text()]">
        <xsl:text>{'</xsl:text>
        <xsl:call-template name="pwd" />
        <xsl:text>': "</xsl:text>
        <xsl:value-of select="normalize-space(.)" />
        <xsl:text>"}&#xa;</xsl:text>
    </xsl:template>

    <!-- recursive template that prints parent element names -->
    <xsl:template name="pwd">
        <xsl:for-each select="parent::*">
            <xsl:call-template name="pwd" />
        </xsl:for-each>
        <xsl:if test="count(ancestor::*) > 0">
            <xsl:text>.</xsl:text>
        </xsl:if>
        <xsl:value-of select="name()" />
        <xsl:for-each select="@*">
            <xsl:value-of select="concat('.', name(), '-', .)" />
        </xsl:for-each>
    </xsl:template>
</xsl:stylesheet>

要使用libxml的xsltproc实用工具测试xslt转换:

To test the xslt transformation with libxml's xsltproc utility:

xsltproc dataframe.xsl source.xml
{'A.AA.AAA1.period-march': 'ONE'}
{'A.AA.AAA2': 'TWO'}
{'A.AA.AAA3': 'THREE'}
{'A.AA.AAA4.B.semester-4': 'FOUR'}
{'A.AA.AAA4.C': 'FIVE'}
{'A.AA.AAA4.D': 'SIX'}

将它们全部放入python,dataframe.py中:

Put it all together in python, dataframe.py:

#!/usr/bin/env python3
import ast
from lxml import etree

with open('dataframe.xsl') as stylesheet:
    transform = etree.XSLT(etree.XML(stylesheet.read()))

with open('source.xml') as xml:
    dataframe_str = str(transform(etree.parse(xml))).rstrip('\n')

dataframe_array = list(map(lambda s: ast.literal_eval(s),
    dataframe_str.split('\n')))

print(dataframe_array)

R结果:

./dataframe.py
[{'A.AA.AAA1.period-march': 'ONE'}, {'A.AA.AAA2': 'TWO'}, {'A.AA.AAA3': 'THREE'}, {'A.AA.AAA4.B.semester-4': 'FOUR'}, {'A.AA.AAA4.C': 'FIVE'}, {'A.AA.AAA4.D': 'SIX'}]

这篇关于连接XML标签以成为数据框列名称的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆